I am indexing sites with articles, the pages of which are designated by the unit "article_id=(some number)" in the URL.
I tried to create a negative regexp to match this in the Exclude by Field option to only keep article pages, but still walk the entire site by following links of index pages, etc.
I used the following:
[^]article_id\=[0-9]{1,6}
Didn't work. Any suggestions on forming a negative regexp for matching in the Exclude by Field option?
You'd need to write an expression that will match your other urls but not the desired ones. If you provide examples of wanted and unwanted urls we may be able to help further.
The >>=!article_id+>>= doesn't seem to work to exclude the URLs without article_id in them.
Listing all the URLs I don't want to match would be a massive task for some of the sites. The majority I've been able to implement Exclude by field with a positive match quite successfully. However, there are a handful that are all over the place, largely navigated by a plethora of GET variables with little unity in their navigation. For these sites, I had been hoping to be able to only keep the things that matched a template for articles, and just ignore everything else. If approaching the problem that way isn't possible - I'll reasses the sites and try to build profiles that nip away at unwanted areas a bit at a time.
Also, I've noticed that the regular expressions that you guys post use notation that I am unfamiliar with. Does the Search Appliance use a particular standard for it's regular expressions, or is it a custom set? If it is custom, do you have any links to documentation on the formatting of the expressions?
here's an example of the type of site I'm trying to index using a negative regex to only grab content pages and ignore other pages but still walk them:
this is an example of what I don't want to keep the pages for:
--baseurl--/mw/directories/toptens/index.jsp
basically, the different categories have different URLs, but all display of content (what I want to index) passes a GET variable of 'vnu_content_id' with a value of a ten digit number.
hmm....it doesn't seem to be working. It's still keeping the content from non-vnu_content_id pages (set to 'Pages only' setting).
also, I'm not sure I understand the part between 'baseurl/' and 'vnu'. I guess something that has been confusing me with the REX is the use of '='. I understand from the documentation that >> indicated direction, but the explanation for the '=' character isn't clear.
How is this query matching URLs w/o vnu_content_id regardless of what their subdirectories are (i.e. '/mw/search/' vs '/mw/directories/toptens/')?
= is a repetition operator meaning exactly one of the preceding subexpression. >>= causes the expression to be anchored to the beginning or end of the searched data. ! means not this subexpression. So it's looking for urls that begin "http://baseurl/" (make sure you're using your actual base url, not the string "baseurl"), followed by anything that's not "?vnu_content_id" through the end of the url. The string of dots is a trick to handle the special case of not matching ?vnu_content_id at the end of the string.