Skipping timed out URLs

rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Skipping timed out URLs

Post by rhuber0 »

I am processing a list of URLs using a page file. When I get a URL that times out when it is being accessed during the walk, the walk reports all subsequent URLs in the page file as timeout errors as well. However, when I access these subsequent URLs directly in a browser, they return normally. How do I get the walk to skip over the one timed out URL and keep going on the rest of the URLs?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Skipping timed out URLs

Post by John »

Is it possible that there is some kind of rate control on the server that is blocking access? You may need to either add a crawl delay, or periodically sleep for a while.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Skipping timed out URLs

Post by mark »

Also try setting parallelism:threads to 1 to eliminate simultaneous page fetches.
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Skipping timed out URLs

Post by rhuber0 »

I got the sysadmin of the web server I'm hitting to increase the timeout period and I'm not receiving any more timeout errors. However, I am seeing another issue where the walk just stops and doesn't report an error. The last URL the walk hits has over 12 MBs of text data and the link takes a while to open. I've run this walk several times and it always stops after hitting this link. Any ideas?
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Skipping timed out URLs

Post by rhuber0 »

Also, during the walk, I'm receiving some errors in monitor.log that I'm not sure of what they are:

200 2006-11-07 12:23:40 (6128) Database Monitor on D:\Thunderstone Software\MORPH3\texis\SiebelTest4.452bd6764\db2 starting
000 2006-11-07 12:24:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:24:00 (5024) TXrunscheduledevents() failed
000 2006-11-07 12:25:00 (5024) Will not re-use mutex: Already exists in the function startwatchjobs
000 2006-11-07 12:25:00 (5024) TXrunscheduledevents() failed; exiting
200 2006-11-07 12:25:00 (5024) Texis Monitor exiting at request of task Cron
200 2006-11-07 12:25:05 (5636) Texis Monitor version 05.01.1145905362 starting
000 2006-11-07 12:26:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:26:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:27:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:27:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:28:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:28:00 (5636) TXrunscheduledevents() failed
000 2006-11-07 12:29:00 (5636) Will not re-use mutex: Already exists in the function startwatchjobs
100 2006-11-07 12:29:00 (5636) TXrunscheduledevents() failed...
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Skipping timed out URLs

Post by mark »

Check the walk status page (scroll down to see it all) and the error report. My guess is that 12MB page is being truncated at the "Max page size" setting.
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Skipping timed out URLs

Post by rhuber0 »

Max Page Size = -1
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Skipping timed out URLs

Post by mark »

Check the walk status page (scroll down to see it all) and the error report.
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Skipping timed out URLs

Post by rhuber0 »

There are no errors:

Latest run:
0 pages in todo
394 pages scheduled to be refreshed
0 pages visited in the last hour (0 success/0 failed)
394 pages in index

The last link visited from the page file was the 12 MB link. The page file contains over 20,000 links and stops with no error reported.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Skipping timed out URLs

Post by mark »

The top of the walk status won't be useful for this. Scroll down to just above the error and dup reports to see what happened at the end of the walk. See if it says anything about why it stopped. It should look something like

...
6 pages fetched (77,296 bytes) from xxx
95 pages fetched (2,766,714 bytes) Total
84 errors Total
14 duplicate pages Total

Creating search index on fetched pages...Done.
Done.
Verifying usability of new walk.

Walk finished at 2006-11-08 02:00:17 (took 15 seconds)
Making new database live: yyy


Checking for broken hyperlinks...
Post Reply