Exclude by Field and negative regexp

josh104 · Post by **josh104** » Fri Oct 20, 2006 2:06 pm

I am indexing sites with articles, the pages of which are designated by the unit "article_id=(some number)" in the URL.

I tried to create a negative regexp to match this in the Exclude by Field option to only keep article pages, but still walk the entire site by following links of index pages, etc.

I used the following:
[^]article_id\=[0-9]{1,6}

Didn't work. Any suggestions on forming a negative regexp for matching in the Exclude by Field option?

Thanks in advance.

Post by **John** » Fri Oct 20, 2006 4:05 pm

You could use:

>>=!article_id+>>=

which should exclude pages that don't have article_id in the URL.

Post by **mark** » Fri Oct 20, 2006 4:10 pm

You'd need to write an expression that will match your other urls but not the desired ones. If you provide examples of wanted and unwanted urls we may be able to help further.

josh104 · Post by **josh104** » Fri Oct 20, 2006 4:46 pm

Thank you both - I appreciate the suggestions.

The >>=!article_id+>>= doesn't seem to work to exclude the URLs without article_id in them.

Listing all the URLs I don't want to match would be a massive task for some of the sites. The majority I've been able to implement Exclude by field with a positive match quite successfully. However, there are a handful that are all over the place, largely navigated by a plethora of GET variables with little unity in their navigation. For these sites, I had been hoping to be able to only keep the things that matched a template for articles, and just ignore everything else. If approaching the problem that way isn't possible - I'll reasses the sites and try to build profiles that nip away at unwanted areas a bit at a time.

Also, I've noticed that the regular expressions that you guys post use notation that I am unfamiliar with. Does the Search Appliance use a particular standard for it's regular expressions, or is it a custom set? If it is custom, do you have any links to documentation on the formatting of the expressions?

Thanks!!

Post by **jason112** » Fri Oct 20, 2006 4:49 pm

The language is 'REX', our own pattern matching language,
which can operate much faster than regexes.

http://www.thunderstone.com/site/texisman/node215.html

Post by **mark** » Fri Oct 20, 2006 5:33 pm

If the article urls are coded consistently it *may* be possible to write 1 or 2 expressions to accomplish what you need.

josh104 · Post by **josh104** » Mon Oct 23, 2006 5:20 pm

thanks for the link to the REX documentation.

here's an example of the type of site I'm trying to index using a negative regex to only grab content pages and ignore other pages but still walk them:

--baseurl--/mw/search/article_display.jsp?vnu_content_id=1003187186&schema=

--baseurl--/mw/search/more_article_display.jsp?vnu_content_id=1003187186&schema=

--baseurl--/mw/current/article_display.jsp?vnu_content_id=1003222699

--baseurl--/mw/news/recent_display.jsp?vnu_content_id=10033127646

this is an example of what I don't want to keep the pages for:

--baseurl--/mw/directories/toptens/index.jsp

basically, the different categories have different URLs, but all display of content (what I want to index) passes a GET variable of 'vnu_content_id' with a value of a ten digit number.

any thoughts?

Post by **mark** » Tue Oct 24, 2006 2:45 pm

This should match your non-vnu_content_id urls.

>>=http://baseurl/=!\?vnu_content_id*...............?>>=

josh104 · Post by **josh104** » Tue Oct 24, 2006 4:49 pm

hmm....it doesn't seem to be working. It's still keeping the content from non-vnu_content_id pages (set to 'Pages only' setting).

also, I'm not sure I understand the part between 'baseurl/' and 'vnu'. I guess something that has been confusing me with the REX is the use of '='. I understand from the documentation that >> indicated direction, but the explanation for the '=' character isn't clear.

How is this query matching URLs w/o vnu_content_id regardless of what their subdirectories are (i.e. '/mw/search/' vs '/mw/directories/toptens/')?

Thanks!

Post by **mark** » Tue Oct 24, 2006 6:03 pm

= is a repetition operator meaning exactly one of the preceding subexpression. >>= causes the expression to be anchored to the beginning or end of the searched data. ! means not this subexpression. So it's looking for urls that begin "http://baseurl/" (make sure you're using your actual base url, not the string "baseurl"), followed by anything that's not "?vnu_content_id" through the end of the url. The string of dots is a trick to handle the special case of not matching ?vnu_content_id at the end of the string.