Ignoring robots.txt file

mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Ignoring robots.txt file

Post by mjacobson »

It seems that Webinator 4.3 is ignoring robots.txt files when it discovers links from one site pointing to an area of a different site and that area is supposed to be off limits due to the robots.txt file.

Example: site1.com has the following statement in the robots.txt file
Disallow: /softcopy_keys/asp/mob_ground

When I did a pattern search after a walk, I get several pages in the index that has the above pattern.
http://site1.com/softcopy_keys/asp/mob_ ... .asp?id=14

The one thing that appears to be in common with all of the fetched pages is that they all have parents coming from a different site. This leads me to believe that Webinator is not following the robots.txt file in this type of case.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Ignoring robots.txt file

Post by mark »

Not inconceivable. We'll check it out.
sunnedaze
Posts: 22
Joined: Mon Jul 28, 2003 2:07 pm

Ignoring robots.txt file

Post by sunnedaze »

I am having a similar problem. My robotx.txt file is:
User-agent: *
Disallow: /wlc050403
When I do a search the very first site in the match list is /wlc050403. I have robots.txt set to Y in my admin screen for the site. When I do a query, my robots file is found. Am not sure what the problem could be.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Ignoring robots.txt file

Post by mark »

Check the walk status, it should indicate what it found in the robots.txt file if it was fetched. Robots.txt will be ignored when processing single offsite pages.
sunnedaze
Posts: 22
Joined: Mon Jul 28, 2003 2:07 pm

Ignoring robots.txt file

Post by sunnedaze »

The walk status is suppressing output because of all the errors (file not found) & duplicate links. We only walk on weekends because it is very time consuming. And, this html page is not near the front of the listing (before the suppression begins). It is a single html page...are you saying that single pages can't be ignored via the robot?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Ignoring robots.txt file

Post by mark »

Not quite. When using the "off-site pages" feature to get individual off-site pages just the off-site pages will not check robots.txt. Pages on the main site are always checked against robots.txt.

The robots.txt message comes out before any of the error messages. Shortly after the "started ... on YOURBASEURL" it will say "robots.txt excludes the following prefixes:" followed by the url prefixes one per line. If you don't see that message robots.txt was not processed. Double check your settings and make sure you can fetch robots.txt with your browser.

You can use the utility function getrobots of dowalk to find out how your robots.txt would be processed. Something like
texis "profile=YOURPROFILE" "top=YOURBASEURL" dowalk/getrobots.txt
sunnedaze
Posts: 22
Joined: Mon Jul 28, 2003 2:07 pm

Ignoring robots.txt file

Post by sunnedaze »

I see the robots.txt file is not being processed. I am able to see it with the browser & I do have the button on the admin screen checked to use the robots. What else could be wrong? Thanks
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Ignoring robots.txt file

Post by mark »

You can use the utility function getrobots of dowalk to find out how your robots.txt would be processed. Something like
texis "profile=YOURPROFILE" "top=YOURBASEURL" dowalk/getrobots.txt

If you have access to your webserver's log you could look at that to see if webinator is fetching it or not.
sunnedaze
Posts: 22
Joined: Mon Jul 28, 2003 2:07 pm

Ignoring robots.txt file

Post by sunnedaze »

It looks like it is not fetched. What would I try next?

<pre>
Webinator Walk Report for allenet

Creating database F:\Thunderstone Software\Webinator/texis/allenet/db1...Done.
Walk started at 2003-07-27 02:00:02 (by schedule)
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://cleohsenet01.napa.ad.etn.com/
Reading urls from file F:\iPlanet\Servers\docs\enet\html\eneturl.txt
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 (10604) on http://cleohsenet01.napa.ad.etn.com/
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Ignoring robots.txt file

Post by mark »

It's silent about errors with robots.txt so that indicates that any of the following happened: it wasn't fetched, there was an error fetching, there was an error processing, nothing appropriate was found.

Do the test I suggested above to see what webinator thinks of your robots.txt file.
Post Reply