fetch and timouts

Fippy · Post by **Fippy** » Fri Mar 02, 2007 4:13 pm

I have <urlcp timeout 120>
and am trying to fetch a 1000 urls in a <fetch>
My fetch always exits after about 80 or so, so I used urlinfo to print a whole host of diagnostics, like each url fetched, the http code, error code, time taken etc. Each took 1-9 seconds to fetch.

Question: Is the urlcp timeout for ALL url's or each one? The docs say it is per each fetch but totalling up the time taken to fetch all my urls it looked like the timeout of for ALL fetches, i.e. the whole fetch loop.

If one url times out does it quit the entire fetch loop?
Will it enter the loop with the timed out url and set the http and error codes accordingly or will it just skip the loop for that url?

What I am seeing does not make sense if urlcp timeout is per each url. With 1000 urls it ought to take the script 1000*120 seconds to quit, right?

Thanks to anyone who can clear up the confusion.

Post by **John** » Fri Mar 02, 2007 5:33 pm

How many are you fetching in parallel? Are you doing a lot of processing in the fetch loop? Is it the overall script timeout you are hitting, or exiting the fetch loop?

Fippy · Post by **Fippy** » Fri Mar 02, 2007 5:39 pm

Parallel is 2, and I'm not doing a great deal in the loop and there are no exits.
It really looks as if the timeout is for ALL urls not each.
I just upped the timeout value and its reading more records now, but it doesnt make sense to me that the timeout has to be so high for a web url

Post by **mark** » Fri Mar 02, 2007 6:09 pm

After how long does it actually timeout?
What's the exact message you're getting?
What's your overall script <timeout> set to?

Fippy · Post by **Fippy** » Fri Mar 02, 2007 6:22 pm

<urlcp maxpagesize -1>
<urlcp reparentmode abs>
<urlcp timeout 1500>
<vxcp timeout 11000>

<fetch parallel=2 $adlinks>

It seems to run for about 20 minutes (which is a little short of 1500 seconds). I don't get any messages or errors or anything. The fetch loop just exits and I have only processed say 200 of the 1000 url's in $adlinks.

As I said before, if I print urlinfo diagnostics, I never get a report of a timed out url. I can print all the urls that trigger another trip around the fetch loop and I get say 200 of them, never the 1000.

Most odd. All I know is the more I increase <urlcp timeout> the more records I get to receive.

Post by **mark** » Sat Mar 03, 2007 3:47 am

Do you have a putmsg function trapping messages?
Are there any related messages in vortex.log?
How are you executing this script? Are you running it from a command line or accessing it via a browser? If by a browser the browser may be timing out the connection due to no data for too long. Or the webserver may be timing it out for running too long.

Fippy · Post by **Fippy** » Mon Mar 05, 2007 11:49 am

I don't see any relevant errors in the log nor anything regarding a putmsg in the script.
It is being executed by a command line on a linux machine with plenty of free memory and cpu.
I really am seeing a direct correlation between the size of the urlcp timeout and the number of urls I get to fetch, just as if the timeout was for the entire fetch, not for each url.

Post by **mark** » Mon Mar 05, 2007 1:35 pm

Can you give your "texis -version" output and the shortest complete example script that demonstrates the behavior?

Fippy · Post by **Fippy** » Mon Mar 05, 2007 1:42 pm

Texis Web Script (Vortex) Copyright (c) 1996-2004 Thunderstone - EPI, Inc.
Commercial Version 5.00.1086121238 20040601 (i686-unknown-linux2.4.9-64-32)

Creating a test script may take significant effort unfortunately. There are several modules involved. I'll see what I can come up with.

Post by **Kai** » Mon Mar 05, 2007 3:10 pm

If you count the URLs before the opening <fetch> (with <count $adlinks> $ret), and then print $loop after the closing </fetch>, is $loop also short of the number of URLs? Are you calling <fetch> or <submit> inside the <fetch> loop?