Need help with advoiding "Spider Trap"

mjacobson · Post by **mjacobson** » Wed Jul 14, 2004 10:03 am

I am testing out the Webinator 5 and I notice that one of my test walks has stumbled into a trap. I am seeing urls like:

http://somesite.com/documents.asp?Searc ... 1\doc1.doc
http://somesite.com/documents.asp?Searc ... \\doc1.doc
http://somesite.com/documents.asp?Searc ... \\doc1.doc

I need to get a rex expression that will keep URLs with repeating patterns from being included in the index. Is there a way to accomplish this?

On a different note, with Webinator 5, if I make changes to the walk admin page like adding a new rex expression, like the one above, when will the change take effect on a "refresh" walk? The next time it runs, or do I have to accomplish a new walk to pickup the new changes.

Thanks for your help.

Post by **mark** » Wed Jul 14, 2004 11:18 am

Once a url is in the database it will be kept unless it disappears from the server or a new walk is performed. Exclusions will not be reapplied when refreshing a url. Newly encountered urls will follow the new exclusion rules though.

mjacobson · Post by **mjacobson** » Wed Jul 14, 2004 11:22 am

Thanks for the info. If I stop the walk, and then remove the "bad" URLs with the "List/Edit URLs" interface, then add the \\ expression to the "Exclusions" section, then the next refresh should not fall into this same spider trap?

Would it be better to add a \\ to the Exclusions section or add something like \\{2,} to the "Exclusion REX" section?

mjacobson · Post by **mjacobson** » Wed Jul 14, 2004 12:29 pm

I cleared the "bad" urls using the "List/Edit" interface, added \\ to the list of "Exclusions" and started a refresh. Within a few minutes, I had urls with 2 or more "\\" somewhere in the URL string.

Is there a another table that needs to be cleaned out. My understanding is the List/Edit deletes records from the html table and the refs table and creates the needed indexes and builds the spell checker table. It seems that refresh did not pickup on the new addition of \\ or the crawlers still had some URLs in a todo table that they started on and ignored the new exclusion.

Post by **mark** » Wed Jul 14, 2004 4:31 pm

While the walk is stopped clear undesirables from todo by hand. Something like

tsql -d YOURDB "delete from todo where Url like '/\\\\'"

Where YOURDB is the database reported as created for that walk. It will be "db1" or "db2" in that profile's dataspace.