Crawling behaviour

rmehrotra · Post by **rmehrotra** » Mon Aug 15, 2005 11:28 am

While running thunderstone, I noticed that after some time like 4-5hr walk is stopping for sometime and then restarting again after an hour or so and folloin are the message log during tha period:
-----------------------------------------------------
Dispatcher stopping by request. May take up to 105 seconds to stop.
57142 pages fetched (-1,621,256,519 bytes) Total
6290 errors Total
971 duplicate pages Total

Removing commonality from fetched pages...
Updating search index ...
--------------------------------------------------

Is it common behavior or some setting which I have set wrong??

-rm

Post by **mark** » Mon Aug 15, 2005 12:05 pm

Looks like someone issued a pause walk request.

rmehrotra · Post by **rmehrotra** » Mon Aug 15, 2005 12:52 pm

nobody has stopped I have noticed this behavior several times. like i started the process in night and in morning showing the log with the message and those refresh buttons on "Walk Status" page missing; then when I refreshed the browser(F5) it showed the latest status.

Also one more thing I have "*travel-mp-*" pattern in"Exclusion REx" but still its going ot URLS like:

http://ZZZ.com/California/travel-mp-304 ... h-p-1.html

Pls let me know why this is happening

-rm

Post by **mark** » Mon Aug 15, 2005 1:33 pm

*travel-mp-*
is not a valid rex expression. If you just need substring exclusion use "excludes" instead and enter
travel-mp-

rmehrotra · Post by **rmehrotra** » Mon Aug 15, 2005 1:48 pm

basically I dont want thunderstone to crawl any URL having
"travel-mp-" in it.

-rm