Page 1 of 2

Exclusion REX Question

Posted: Wed Sep 28, 2005 11:42 am
by velevi
Dear Support Staff:

Why won't the following REX exlusion rule work (even when running a "New"-type walk):
folder1/folder2/[\alpha]+_print\.html

The pages matching this URL pattern are still indexed.

Same holds for:
var=Definitions
where I'd like to exclude a page URL containing this specific query variable from being indexed.

Am I missing something about Exclusion REX?
(Maybe I should mention that I have the following in Extra URLs REX: ^http://my\.site\.com/.+>>\.php\? since if I just had .php in the list of allowed extensions, PHP pages with URL query strings would not get indexed. ?Is that a known bug or is that how it's supposed to be?)

ALSO the normal Exclusions field is not functioning for some reason:
/folder/folder2/ won't exclude the pages with that path from the database (?!).

Thank you!

Exclusion REX Question

Posted: Wed Sep 28, 2005 11:52 am
by mark
Try
>>folder1/folder2/=[\alpha]+_print\.html

Exclusion REX Question

Posted: Wed Sep 28, 2005 11:54 am
by mark
There's an option to strip query strings. Make sure it's off and that ? is not in your excludes (it is in there by default).

Make sure you're doing a new (not refresh) walk so that it's actually thinking about visiting those pages.

Exclusion REX Question

Posted: Wed Sep 28, 2005 12:01 pm
by velevi
OK, I removed the "?" from the Exclusions field. I have "Strip Queries" turned off. (So, this and the former fields do the same thing basically?)

I am definitely doing a "New" walk.

I'll try the change in the REX that you suggested!

Exclusion REX Question

Posted: Wed Sep 28, 2005 12:48 pm
by mark
Not the same at all really. Exclusions means "if this appears in a url just skip that page entirely". Strip queries means "remove the query string from the url before fetching the page". Both have to be set appropriately for urls with query strings to be included in the walk as-is.

Exclusion REX Question

Posted: Wed Sep 28, 2005 12:54 pm
by velevi
The correction of the REX didn't work, the same pages are included, despite the pattern being in the "Exclusion REX" field.

Any other suggestions?

Exclusion REX Question

Posted: Wed Sep 28, 2005 1:47 pm
by mark
Show me your exact url and exact exclusions.
Do you have anything in "Extra URLs REX"? Show that too.

Exclusion REX Question

Posted: Wed Sep 28, 2005 2:16 pm
by velevi
Unfortunately I cannot show the URLs I am working with. I can re-create them for you without the sensitive URLs, which is exactly the same:

URL to exclude 1:
http://sitehost.com/folderphp/sites.php ... efinitions

URL to exclude 2:
http://sitehost.com/folderhtml/html/amyly_print.html

Exclusion REX contains that and nothing else:
^http://another\.site\.domain\.com/.*
.*\.php\?=.+stat\=Definitions
.*/folderhtml/html/[\alpha]+_print\.html

In Extra URLs:
^http://anothersitehost\.anothersite\.com/$

Exclusion REX Question

Posted: Wed Sep 28, 2005 3:57 pm
by mark
Exclusion rex 1 is not needed. It only walks sites/domains you tell it to.

Try these for the other 2 exclusion rexes:
>>\.php\?=!stat\=Definitions*stat\=Definitions
>>/folderhtml/html/=[\alpha]+_print\.html

For extra urls rex:
>>=http://anothersitehost\.anothersite\.com=/?>>=

Note that rex syntax is different than grep. Please see the rex docs.

Exclusion REX Question

Posted: Thu Sep 29, 2005 9:48 am
by velevi
Thank you the Regular Expressions finally worked. I know that 'grep' syntax is different than Texis rex, but still things like '^' and '$' are supposed to be in rex as well right? Well, it seems that one would always need the anchor '>>' in rex or ?? How come there are two anchors in your suggestion for Extra URLs?

Thank you again!
--------------------------------------------------
URL to exclude 1:
http://sitehost.com/folderphp/sites.php ... efinitions
URL to exclude 2:
http://sitehost.com/folderhtml/html/amyly_print.html

Exclusion REX field:
^http://another\.site\.domain\.com/.*
.*\.php\?=.+stat\=Definitions
.*/folderhtml/html/[\alpha]+_print\.html

Extra URLs Field:
^http://anothersitehost\.anothersite\.com/$

- Exclusion rex 1 is not needed. It only walks sites/domains you tell it to.
- Try these for the other 2 exclusion rexes:
>>\.php\?=!stat\=Definitions*stat\=Definitions
>>/folderhtml/html/=[\alpha]+_print\.html
- For extra urls rex:
>>=http://anothersitehost\.anothersite\.com=/?>>=