Crawling behaviour

Post Reply
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling behaviour

Post by rmehrotra »

While running thunderstone, I noticed that after some time like 4-5hr walk is stopping for sometime and then restarting again after an hour or so and folloin are the message log during tha period:
-----------------------------------------------------
Dispatcher stopping by request. May take up to 105 seconds to stop.
57142 pages fetched (-1,621,256,519 bytes) Total
6290 errors Total
971 duplicate pages Total

Removing commonality from fetched pages...
Updating search index ...
--------------------------------------------------

Is it common behavior or some setting which I have set wrong??

-rm
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling behaviour

Post by mark »

Looks like someone issued a pause walk request.
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling behaviour

Post by rmehrotra »

nobody has stopped I have noticed this behavior several times. like i started the process in night and in morning showing the log with the message and those refresh buttons on "Walk Status" page missing; then when I refreshed the browser(F5) it showed the latest status.

Also one more thing I have "*travel-mp-*" pattern in"Exclusion REx" but still its going ot URLS like:

http://ZZZ.com/California/travel-mp-304 ... h-p-1.html

Pls let me know why this is happening

-rm
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling behaviour

Post by mark »

*travel-mp-*
is not a valid rex expression. If you just need substring exclusion use "excludes" instead and enter
travel-mp-
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling behaviour

Post by rmehrotra »

basically I dont want thunderstone to crawl any URL having
"travel-mp-" in it.

-rm
Post Reply