Bug with robots.txt exclusion?

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Bug with robots.txt exclusion?

Post by Thunderstone »



I just setup webinator for a server running on a port different than 80.

When I give the walk command:
gw -d- http://localhost:9999/subdir/
the messages show that it can't find the robots.txt file. Checking the
error log with
gw -d- st "select Url,Reason from error"
it reports:
Url Reason
localhost/robots.txt Can't connect to host

Of course it can't! It should be looking at:
localhost:9999/robots.txt

Is there a fix? Or, do I have to delete things after I walk....

--Hal
Hal Wine <hal@dtor.com> voice: 510/482-0597


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Bug with robots.txt exclusion?

Post by Thunderstone »




The next version will have that fixed. You can simulate the robots.txt
file using the -x option. Use one -x option for each disallow in
robots.txt. You may put the options in a file so you don't have
to retype them all the time.

Example:
Given the following robots.txt file simulate it with -x:

User-agent: *
Disallow: /text
Disallow: /junk

Create a file (named robot.opt for example) that looks like the following:
(make sure there are no trailing spaces or tabs on the lines)

xhttp://localhost:9999/text
xhttp://localhost:9999/junk

Run gw like this:

gw -d- -mrobot.opt http://localhost:9999/subdir/

Alternately, if you are just trying to restrict the walk to
pages under "subdir" you can use the -j option:

gw -d- -jhttp://localhost:9999/subdir/ http://localhost:9999/subdir/
Post Reply