I am processing a list of URLs using a page file. When I get a URL that times out when it is being accessed during the walk, the walk reports all subsequent URLs in the page file as timeout errors as well. However, when I access these subsequent URLs directly in a browser, they return normally. How do I get the walk to skip over the one timed out URL and keep going on the rest of the URLs?
Skipping timed out URLs
Skipping timed out URLs
Is it possible that there is some kind of rate control on the server that is blocking access? You may need to either add a crawl delay, or periodically sleep for a while.
John Turnbull
Thunderstone Software
Thunderstone Software
Skipping timed out URLs
Also try setting parallelism:threads to 1 to eliminate simultaneous page fetches.
Skipping timed out URLs
I got the sysadmin of the web server I'm hitting to increase the timeout period and I'm not receiving any more timeout errors. However, I am seeing another issue where the walk just stops and doesn't report an error. The last URL the walk hits has over 12 MBs of text data and the link takes a while to open. I've run this walk several times and it always stops after hitting this link. Any ideas?
Skipping timed out URLs
Also, during the walk, I'm receiving some errors in monitor.log that I'm not sure of what they are:
200 2006-11-07 12:23:40 (6128) Database Monitor on D:\Thunderstone Software\MORPH3\texis\SiebelTest4.452bd6764\db2 starting
000 2006-11-07 12:24:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:24:00 (5024) TXrunscheduledevents() failed
000 2006-11-07 12:25:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
000 2006-11-07 12:25:00 (5024) TXrunscheduledevents() failed; exiting
200 2006-11-07 12:25:00 (5024) Texis Monitor exiting at request of task Cron
200 2006-11-07 12:25:05 (5636) Texis Monitor version 05.01.1145905362 starting
000 2006-11-07 12:26:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:26:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:27:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:27:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:28:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:28:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:29:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:29:00 (5636) TXrunscheduledevents() failed...
200 2006-11-07 12:23:40 (6128) Database Monitor on D:\Thunderstone Software\MORPH3\texis\SiebelTest4.452bd6764\db2 starting
000 2006-11-07 12:24:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:24:00 (5024) TXrunscheduledevents() failed
000 2006-11-07 12:25:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
000 2006-11-07 12:25:00 (5024) TXrunscheduledevents() failed; exiting
200 2006-11-07 12:25:00 (5024) Texis Monitor exiting at request of task Cron
200 2006-11-07 12:25:05 (5636) Texis Monitor version 05.01.1145905362 starting
000 2006-11-07 12:26:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:26:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:27:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:27:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:28:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:28:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:29:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:29:00 (5636) TXrunscheduledevents() failed...
Skipping timed out URLs
Check the walk status page (scroll down to see it all) and the error report. My guess is that 12MB page is being truncated at the "Max page size" setting.
Skipping timed out URLs
Check the walk status page (scroll down to see it all) and the error report.
Skipping timed out URLs
There are no errors:
Latest run:
0 pages in todo
394 pages scheduled to be refreshed
0 pages visited in the last hour (0 success/0 failed)
394 pages in index
The last link visited from the page file was the 12 MB link. The page file contains over 20,000 links and stops with no error reported.
Latest run:
0 pages in todo
394 pages scheduled to be refreshed
0 pages visited in the last hour (0 success/0 failed)
394 pages in index
The last link visited from the page file was the 12 MB link. The page file contains over 20,000 links and stops with no error reported.
Skipping timed out URLs
The top of the walk status won't be useful for this. Scroll down to just above the error and dup reports to see what happened at the end of the walk. See if it says anything about why it stopped. It should look something like
...
6 pages fetched (77,296 bytes) from xxx
95 pages fetched (2,766,714 bytes) Total
84 errors Total
14 duplicate pages Total
Creating search index on fetched pages...Done.
Done.
Verifying usability of new walk.
Walk finished at 2006-11-08 02:00:17 (took 15 seconds)
Making new database live: yyy
Checking for broken hyperlinks...
...
6 pages fetched (77,296 bytes) from xxx
95 pages fetched (2,766,714 bytes) Total
84 errors Total
14 duplicate pages Total
Creating search index on fetched pages...Done.
Done.
Verifying usability of new walk.
Walk finished at 2006-11-08 02:00:17 (took 15 seconds)
Making new database live: yyy
Checking for broken hyperlinks...