vortex walker

Post Reply
galderman
Posts: 9
Joined: Mon Jun 04, 2001 1:00 pm

vortex walker

Post by galderman »

I'm just starting to look at the "dowalk" Vortex walker
again after a long hiatus. Looking at the script, it
seems that if I have to stop the walk for any reason,
the internal state of each child walker is lost.
(Here, it seems that happens perhaps more than it should,
but it's a reality: file system getting full, other
maintenance, etc.)

I am thinking of adding another "signal" with loguser which requests a "soft stop" instead of the current "hard stop". I'm thinking that dumping the
internal "todo" xtree list into the global "todo" database table might be all that is needed as a walker
shuts down.

It would seem that losing the "done" xtree would not be
such a big deal. Could you please comment/confirm/deny
about these ideas?

cheers,
Gary Alderman
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

vortex walker

Post by John »

Storing the "done" tree as well in that case would also be useful, so you don't waste time with links that have already been processed.

The global todo database is used somewhat differently in the dowalk script, and saving the internal todo tree into the todo table would probably not yield the desired results.

Note that the dowalk script will crawl sites individually, so if you are crawling more than one site you would need to save and restore each dowalk's todo and done trees, with an indication as to which was which.
John Turnbull
Thunderstone Software
galderman
Posts: 9
Joined: Mon Jun 04, 2001 1:00 pm

vortex walker

Post by galderman »

Maybe I have not correctly deciphered the operation of the script. Yes, we are crawling more than one site, actually it's generally several hundred, ranging from a handful of pages up to a hundred thousand pages each for a couple of our major contributors.

What I think I see is that a "child" walker is indeed tasked to walk a single site, with any "off-site" links saved to the global "todo" list. (Other walkers working other sites in parallel may find URLs referencing this particular site and put them on the global "todo" list.) On completion of everything in the INTERNAL "todo" xtree, control is returned to the "go" function. There, I try to grab a SINGLE URL from the global "todo" table which matches this servername. (BTW, this seems a little odd, I would have expected to try to grab them ALL.)

Anyhow, if a "child" walking site "x" has be be stopped abnormally, it would seem to me that the URLs in the internal "todo" xtree could be simply be placed into the global "todo" database table for processing by some other walker at a later time. I don't understand your comment about this not having the desired results.

What I need to do is to come back later (after file system maintenance, server reboot, whatever) and fire up more walkers to continue the walk.

I think I need to grab a URL from the "todo" list, grab ALL URLs from the same server, and put them into the internal "todo" xtree. OK, the "done" xtree contains information which relates to what has been done by this current instance of a walker. I guess I am trying to synthesize this state information for a new walker. Perhaps I need to pre-load the "done" xtree with all URLs for this server which are already in the html table??? I am starting to think I need to have a
completely different way of kicking off such a walker.

I really think I need some way to deal with this issue. Do I seem to be the only one who thinks so?

BTW, I am also trying to build a little function using the "loguser" signals to request an internal status report from the child walkers. Does anyone have any example code lying around?
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

vortex walker

Post by John »

Since you are likely to have a lot of links in the todo tree for a given site the todo table is probably going to be much bigger than it would otherwise be, and there would a performance hit in processing that todo table.

The state that needs to be stored, and then restored if the top URL, and the todo and done xtrees. This would allow the children to pick up, and ensure they will perform the same as if they hadn't been stopped, and with no loss of performance.

It might be a useful option to add in the future, but we have generally found that servers stay up much longer than the time it takes to crawl a site. Obviously your experience is different.
John Turnbull
Thunderstone Software
vinod1
Posts: 12
Joined: Mon Oct 22, 2001 12:07 am

vortex walker

Post by vinod1 »

I'm having to develop a crawler based on certain criteria as below
1) provided we have the base url (jobs.com) and a certain keyword ("computers"), I need the crawler to ultimately pick up all urls on that site that contain the keyword e.g. (www.jobs.com/jobs/computers/today.html) by scanning pages, retrieving urls from those pages, then loading each page and searching the urls within them and so on. Finally I need all these urls got from the site stored into the database. While most of the part, I trust is possible by vortex, I need to know whether the recursive part is possible using vortex (pages first, urls within those pages, then loading those urls and retrieve urls within those loaded pages and so forth all the while making sure that the url contains the keyword) I'm experimenting with vortex and had my doubts about this. I would appreciate a quick reply in this regard.

Vinod
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

vortex walker

Post by bart »

There is no problem with recursion within Vortex. The application should be fairly easy to create.
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

vortex walker

Post by Kai »

Vortex has a recursion limit of 250 by default, so erroneous recursion doesn't cause a script to consume all available memory. This is changeable with the <stack> directive at compile time, as well as dynamically with <vxcp stack>. Vortex also has local variables and parameters as well.
Post Reply