How to limit rewalk

gebickford
Posts: 6
Joined: Mon Jun 18, 2001 3:25 pm

How to limit rewalk

Post by gebickford »

I'm trying to find the most efficient and easiest way to configure Webinator so it doesn't re-walk the pages it's already done. The site I'm working on is a newspaper's online site, where nearly all the content never changes. Each week the items in the main directory are replaced, and after a week are moved into a dated directory under /archive/. Thus, we have:

/
/welcome.html
/front1.shtml
/front2.shtml
...
/archive/
/archive/20010620/
welcome.html
front1.shtml
...
/archive/20010613/

and so forth. The archives are linked off the main page so a depth limit won't do what I want - the new archive is at the same depth as the old ones.

Presently, I use -rewalk to reindex the entire site, which works fine. However this is inefficient. So each week, I want to index the top level, plus just _one_ of the directories under /archive (the most recent).

I see several possible ways to do this, including using the -V option to only download modified pages, or using -x to exclude previously run directories (this would require updating for each run).

What might be the most effective way to do this? My present config file has these options:
-d- -D9 -M -o -t5 -fshtml
The -D9 is historical, and probably not relevant any more.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

How to limit rewalk

Post by mark »

Use -e with -X and -V
gebickford
Posts: 6
Joined: Mon Jun 18, 2001 3:25 pm

How to limit rewalk

Post by gebickford »

If I understand correctly then, adding (for example) '-e"-7 days" -X -V will _limit_ the retrieval of older pages, but won't limit the retrieval of the new pages?

Also, because of the timeliness factor in newspapers, we intend to use a version of the scripted walker code to set the page's last-modified date into the Visited field, so we can use "order by Visited" in the search and get most recent articles first. I think this won't affect the above negatively. Am I right?

Adding the page date (either by parsing a meta-tag or the last-modified method) to the database in its own field would be my suggestion for the ideal new feature :O) This would get us away from the kluge of using the Visited field for this. Another, more general way of handling this might be to add a 'comments' field in which we could put arbitrary data, perhaps XML-format. This would handle all sorts of user extensions without forcing any additional expansion and incompatibility issues with the database.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

How to limit rewalk

Post by mark »

Right. It will check each page for modification. Modified pages will be downloaded. New hyperlinks will be added to the database. This all assumes that the webserver respects the if-mod-since directive in a request.

If you're adventurous you could modify the scripted walker to do the whole thing for you rather than having to do every page twice. See http://thunderstone.master.com/texis/ma ... 3a96d57b1a

The meta field holds the arbitrary meta info. See -meta.