Webinator 2 problems with robots.txt

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 2 problems with robots.txt

Post by Thunderstone »



I've recently walked about 400 Sites on the net and am now tring to track
down the indiviual sites that excluded my webinator robot because of
the robots.txt file. I saw the warnings pop up on the screen when I did
the walks but no where are they in the gw.log or error table.
I might be able to log them with standard our our standard error but
but I though that these messages should be in the gw.log since they are
not errors but something that happend during the walks.

Dan McHugh
Webmaster@osstf.on.ca

P.S. If you wan't to see what almost 400 sites look like with webinator on
a sparcstation 5 with 64 Meg of ram check it out at:

http://www.osstf.on.ca/cgi-bin/texis/we ... nionsearch



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator 2 problems with robots.txt

Post by Thunderstone »




You must have had verbosity turned up to see any warnings about
exclusion by robots.txt. gw normally only talks about robots.txt when it's
fetching it, not when pages are excluded by it.

Exclusion by robots.txt is not considered an error. It is just like
excluding with -x and is not generally reported or recorded.
You would have to capture the verbose messages you saw or revisit
all of the sites and get just their robots.txt.

You could collect all of the robots.txt files into a database with something
like the following for each site:
gw -drobots http://www.a_site.com/robots.txt
You could then review the Url and Body field using SQL select statements.
Or you could use the Vortex <fetch> command
(http://index.thunderstone.com/vortexman/node91.html) to fetch them.
Then process or store them as desired.

Most sites that have robots.txt have it for a good reason. They want to keep
walkers out of big and/or dynamic spaces. You should generally respect them.
They don't often block walkers from the whole site.