Webinator 1.3

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 1.3

Post by Thunderstone »



I need to index just portions of some 130 sites. I know that I could use
-jurl: to elimate lower protions of the web tree, but it would take a fair
wmount of effort to maintain this list. Is there a way to default
webinators behaviour to just include include those pages that reside in
the specified directory structure that contains the entry URL ie:
http://www.osstf.on.ca/private/security.html
would result in www.osstf.on.ca/private's directory structure only, not
any other portion of the www.osstf.on.ca website?

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 1.3

Post by Thunderstone »




Sorry, there's currently no direct way to do what you're asking.

On unix systems having the "dirname" command, you can simulate it with a
shell script containing something similar to this:

#!/bin/sh
url="$1"
dir="`dirname $url`"
gw OTHER_OPTIONS_HERE -j$dir/ $url

Replace "gw" above with "echo" to see what it's doing.
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 1.3

Post by Thunderstone »



I have a list of 50 odd sites that are currently being indexed using a
sparcstation 5. I'm using the command:

gw -d/home/http/docs/unions "&/home/http/docs/unions/fqdn.txt"

I have mistakenly pressed the Control C on this index and stopped it from
indexing two separate time (No I'm not going to ask how do I disable the
control C or the break let me go on:-). Not understanding the todo list
made me first wipe the database
gw -w -d/home/http/docs/webinator/unions
seeing the size of the database files not change I decided to delete all the
files in the union directory and restarting the index process again from
scratch. I now have now had gw running continuously on for about 16
hours with a count of 7477/86129. The database todo list indicates that
there 2530 urls still in the todo database. Is this typical performance of
the webinator with on walker. Should I start another walker with the
following options to get antoher process working on the same todo
list?
gw -d/homehttp/docs/webinator/unions

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 1.3

Post by Thunderstone »




You could have resumed the walk where it left off with:
gw -d/home/http/docs/unions


You wanted to do "-wipe" not "-w". They are different (see below).
gw -wipe -d/home/http/docs/webinator/unions


There are 3 major factors governing the speed of web walking:
1. The speed of the web server being indexed
2. The network bandwidth between the indexer and indexee
3. The -w option of gw

We have no control over the first 2, but the default behavior of gw is to wait
5 seconds between page requests. You can decrease the wait period with the
"-w" option (see http://www.thunderstone.com/gwman/node22.html).

You can also run another copy of gw as you suggested if you have sufficient
horsepower and network bandwidth.

WARNING: Don't be mean to other peoples' servers by fetching too fast unless
you have gotten their agreement.