Exclude by Field and negative regexp

josh104
Posts: 24
Joined: Mon Oct 09, 2006 5:39 pm

Exclude by Field and negative regexp

Post by josh104 »

I am indexing sites with articles, the pages of which are designated by the unit "article_id=(some number)" in the URL.

I tried to create a negative regexp to match this in the Exclude by Field option to only keep article pages, but still walk the entire site by following links of index pages, etc.

I used the following:
[^]article_id\=[0-9]{1,6}

Didn't work. Any suggestions on forming a negative regexp for matching in the Exclude by Field option?

Thanks in advance.
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Exclude by Field and negative regexp

Post by John »

You could use:

>>=!article_id+>>=

which should exclude pages that don't have article_id in the URL.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude by Field and negative regexp

Post by mark »

You'd need to write an expression that will match your other urls but not the desired ones. If you provide examples of wanted and unwanted urls we may be able to help further.
josh104
Posts: 24
Joined: Mon Oct 09, 2006 5:39 pm

Exclude by Field and negative regexp

Post by josh104 »

Thank you both - I appreciate the suggestions.

The >>=!article_id+>>= doesn't seem to work to exclude the URLs without article_id in them.

Listing all the URLs I don't want to match would be a massive task for some of the sites. The majority I've been able to implement Exclude by field with a positive match quite successfully. However, there are a handful that are all over the place, largely navigated by a plethora of GET variables with little unity in their navigation. For these sites, I had been hoping to be able to only keep the things that matched a template for articles, and just ignore everything else. If approaching the problem that way isn't possible - I'll reasses the sites and try to build profiles that nip away at unwanted areas a bit at a time.

Also, I've noticed that the regular expressions that you guys post use notation that I am unfamiliar with. Does the Search Appliance use a particular standard for it's regular expressions, or is it a custom set? If it is custom, do you have any links to documentation on the formatting of the expressions?

Thanks!!
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Exclude by Field and negative regexp

Post by jason112 »

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude by Field and negative regexp

Post by mark »

If the article urls are coded consistently it *may* be possible to write 1 or 2 expressions to accomplish what you need.
josh104
Posts: 24
Joined: Mon Oct 09, 2006 5:39 pm

Exclude by Field and negative regexp

Post by josh104 »

thanks for the link to the REX documentation.

here's an example of the type of site I'm trying to index using a negative regex to only grab content pages and ignore other pages but still walk them:

--baseurl--/mw/search/article_display.jsp?vnu_content_id=1003187186&schema=

--baseurl--/mw/search/more_article_display.jsp?vnu_content_id=1003187186&schema=

--baseurl--/mw/current/article_display.jsp?vnu_content_id=1003222699

--baseurl--/mw/news/recent_display.jsp?vnu_content_id=10033127646

this is an example of what I don't want to keep the pages for:

--baseurl--/mw/directories/toptens/index.jsp

basically, the different categories have different URLs, but all display of content (what I want to index) passes a GET variable of 'vnu_content_id' with a value of a ten digit number.

any thoughts?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude by Field and negative regexp

Post by mark »

josh104
Posts: 24
Joined: Mon Oct 09, 2006 5:39 pm

Exclude by Field and negative regexp

Post by josh104 »

hmm....it doesn't seem to be working. It's still keeping the content from non-vnu_content_id pages (set to 'Pages only' setting).

also, I'm not sure I understand the part between 'baseurl/' and 'vnu'. I guess something that has been confusing me with the REX is the use of '='. I understand from the documentation that >> indicated direction, but the explanation for the '=' character isn't clear.

How is this query matching URLs w/o vnu_content_id regardless of what their subdirectories are (i.e. '/mw/search/' vs '/mw/directories/toptens/')?

Thanks!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exclude by Field and negative regexp

Post by mark »

= is a repetition operator meaning exactly one of the preceding subexpression. >>= causes the expression to be anchored to the beginning or end of the searched data. ! means not this subexpression. So it's looking for urls that begin "http://baseurl/" (make sure you're using your actual base url, not the string "baseurl"), followed by anything that's not "?vnu_content_id" through the end of the url. The string of dots is a trick to handle the special case of not matching ?vnu_content_id at the end of the string.