Skipping timed out URLs

rhuber0 · Post by **rhuber0** » Mon Nov 06, 2006 4:17 pm

I am processing a list of URLs using a page file. When I get a URL that times out when it is being accessed during the walk, the walk reports all subsequent URLs in the page file as timeout errors as well. However, when I access these subsequent URLs directly in a browser, they return normally. How do I get the walk to skip over the one timed out URL and keep going on the rest of the URLs?

Post by **John** » Mon Nov 06, 2006 5:30 pm

Is it possible that there is some kind of rate control on the server that is blocking access? You may need to either add a crawl delay, or periodically sleep for a while.

Post by **mark** » Tue Nov 07, 2006 11:03 am

Also try setting parallelism:threads to 1 to eliminate simultaneous page fetches.

rhuber0 · Post by **rhuber0** » Tue Nov 07, 2006 3:55 pm

I got the sysadmin of the web server I'm hitting to increase the timeout period and I'm not receiving any more timeout errors. However, I am seeing another issue where the walk just stops and doesn't report an error. The last URL the walk hits has over 12 MBs of text data and the link takes a while to open. I've run this walk several times and it always stops after hitting this link. Any ideas?

rhuber0 · Post by **rhuber0** » Tue Nov 07, 2006 3:57 pm

Also, during the walk, I'm receiving some errors in monitor.log that I'm not sure of what they are:

200 2006-11-07 12:23:40 (6128) Database Monitor on D:\Thunderstone Software\MORPH3\texis\SiebelTest4.452bd6764\db2 starting
000 2006-11-07 12:24:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:24:00 (5024) TXrunscheduledevents() failed
000 2006-11-07 12:25:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
000 2006-11-07 12:25:00 (5024) TXrunscheduledevents() failed; exiting
200 2006-11-07 12:25:00 (5024) Texis Monitor exiting at request of task Cron
200 2006-11-07 12:25:05 (5636) Texis Monitor version 05.01.1145905362 starting
000 2006-11-07 12:26:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:26:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:27:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:27:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:28:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:28:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:29:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:29:00 (5636) TXrunscheduledevents() failed...

Post by **mark** » Tue Nov 07, 2006 4:52 pm

Check the walk status page (scroll down to see it all) and the error report. My guess is that 12MB page is being truncated at the "Max page size" setting.

rhuber0 · Post by **rhuber0** » Tue Nov 07, 2006 4:54 pm

Max Page Size = -1

Post by **mark** » Tue Nov 07, 2006 9:52 pm

Check the walk status page (scroll down to see it all) and the error report.

rhuber0 · Post by **rhuber0** » Wed Nov 08, 2006 8:37 am

There are no errors:

Latest run:
0 pages in todo
394 pages scheduled to be refreshed
0 pages visited in the last hour (0 success/0 failed)
394 pages in index

The last link visited from the page file was the 12 MB link. The page file contains over 20,000 links and stops with no error reported.

Post by **mark** » Wed Nov 08, 2006 1:47 pm

The top of the walk status won't be useful for this. Scroll down to just above the error and dup reports to see what happened at the end of the walk. See if it says anything about why it stopped. It should look something like

...
6 pages fetched (77,296 bytes) from xxx
95 pages fetched (2,766,714 bytes) Total
84 errors Total
14 duplicate pages Total

Creating search index on fetched pages...Done.
Done.
Verifying usability of new walk.

Walk finished at 2006-11-08 02:00:17 (took 15 seconds)
Making new database live: yyy

Checking for broken hyperlinks...