-index(ing) while walking?

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

-index(ing) while walking?

Post by Thunderstone »



Hello,

is it possible to index the DB while the walker is still walking the web?

If my walker is slow (say it's very polite and makes big pauses between
requests) and it takes a few days (say a week) to crawl all the URLs, is
there any way I can index the DB (whatever is in it up to that point) while
the walker is still web-walking, without having to stop the walker, index,
and restart the walker?

This would be useful to me because I want my index to be always fresh (so
I'd want the walker to just keep walking, and rewalking, and rewalking,
while at the same time I'd want the DB indexed as much as possible, or as
frequently as possible, so that it's always 'fully searchable')

Similarly, is it possible to have index creating as the walker walks?
For example, if a walker visits www.site.com/dir/file.html, can it store in
a DB and indexing right away, while continuing to do its walking?

Thanks,

Otis



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

-index(ing) while walking?

Post by Thunderstone »




You can periodically run gw -index while the walk is in progress to
update the index. This isn't done automatically for every page because
the cost in cpu, memory, and disk is too high. Unindexed pages are
still processed, but without the benefit of the index.

The above still leaves you with some unindexed data between index
updates. Having some unindexed pages is not a terribly bad thing, but
having thousands would be. If there's going to be too many unindexed
pages between index updates you might consider a different strategy.
Copy the "live" database to another directory. Walk in the copy and
periodically stop, index, and make it live. When making a new database
live, it's best to have it ready to go and just rename the directory,
rather than copying the individual files over, as that would make for a
broken database until all of the files were copied.



Post Reply