Need help with advoiding "Spider Trap"

Post Reply
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Need help with advoiding "Spider Trap"

Post by mjacobson »

I am testing out the Webinator 5 and I notice that one of my test walks has stumbled into a trap. I am seeing urls like:

http://somesite.com/documents.asp?Searc ... 1\doc1.doc
http://somesite.com/documents.asp?Searc ... \\doc1.doc
http://somesite.com/documents.asp?Searc ... \\doc1.doc

I need to get a rex expression that will keep URLs with repeating patterns from being included in the index. Is there a way to accomplish this?

On a different note, with Webinator 5, if I make changes to the walk admin page like adding a new rex expression, like the one above, when will the change take effect on a "refresh" walk? The next time it runs, or do I have to accomplish a new walk to pickup the new changes.

Thanks for your help.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Need help with advoiding "Spider Trap"

Post by mark »

Once a url is in the database it will be kept unless it disappears from the server or a new walk is performed. Exclusions will not be reapplied when refreshing a url. Newly encountered urls will follow the new exclusion rules though.
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Need help with advoiding "Spider Trap"

Post by mjacobson »

Thanks for the info. If I stop the walk, and then remove the "bad" URLs with the "List/Edit URLs" interface, then add the \\ expression to the "Exclusions" section, then the next refresh should not fall into this same spider trap?

Would it be better to add a \\ to the Exclusions section or add something like \\{2,} to the "Exclusion REX" section?
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Need help with advoiding "Spider Trap"

Post by mjacobson »

I cleared the "bad" urls using the "List/Edit" interface, added \\ to the list of "Exclusions" and started a refresh. Within a few minutes, I had urls with 2 or more "\\" somewhere in the URL string.

Is there a another table that needs to be cleaned out. My understanding is the List/Edit deletes records from the html table and the refs table and creates the needed indexes and builds the spell checker table. It seems that refresh did not pickup on the new addition of \\ or the crawlers still had some URLs in a todo table that they started on and ignored the new exclusion.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Need help with advoiding "Spider Trap"

Post by mark »

While the walk is stopped clear undesirables from todo by hand. Something like

tsql -d YOURDB "delete from todo where Url like '/\\\\'"

Where YOURDB is the database reported as created for that walk. It will be "db1" or "db2" in that profile's dataspace.
Post Reply