Ignoring robots.txt file

mjacobson · Post by **mjacobson** » Tue Dec 24, 2002 8:20 am

It seems that Webinator 4.3 is ignoring robots.txt files when it discovers links from one site pointing to an area of a different site and that area is supposed to be off limits due to the robots.txt file.

Example: site1.com has the following statement in the robots.txt file
Disallow: /softcopy_keys/asp/mob_ground

When I did a pattern search after a walk, I get several pages in the index that has the above pattern.
http://site1.com/softcopy_keys/asp/mob_ ... .asp?id=14

The one thing that appears to be in common with all of the fetched pages is that they all have parents coming from a different site. This leads me to believe that Webinator is not following the robots.txt file in this type of case.

Post by **mark** » Tue Dec 24, 2002 10:27 am

Not inconceivable. We'll check it out.

sunnedaze · Post by **sunnedaze** » Mon Jul 28, 2003 2:17 pm

I am having a similar problem. My robotx.txt file is:
User-agent: *
Disallow: /wlc050403
When I do a search the very first site in the match list is /wlc050403. I have robots.txt set to Y in my admin screen for the site. When I do a query, my robots file is found. Am not sure what the problem could be.

Post by **mark** » Mon Jul 28, 2003 2:57 pm

Check the walk status, it should indicate what it found in the robots.txt file if it was fetched. Robots.txt will be ignored when processing single offsite pages.

sunnedaze · Post by **sunnedaze** » Mon Jul 28, 2003 3:48 pm

The walk status is suppressing output because of all the errors (file not found) & duplicate links. We only walk on weekends because it is very time consuming. And, this html page is not near the front of the listing (before the suppression begins). It is a single html page...are you saying that single pages can't be ignored via the robot?

Post by **mark** » Mon Jul 28, 2003 4:50 pm

Not quite. When using the "off-site pages" feature to get individual off-site pages just the off-site pages will not check robots.txt. Pages on the main site are always checked against robots.txt.

The robots.txt message comes out before any of the error messages. Shortly after the "started ... on YOURBASEURL" it will say "robots.txt excludes the following prefixes:" followed by the url prefixes one per line. If you don't see that message robots.txt was not processed. Double check your settings and make sure you can fetch robots.txt with your browser.

You can use the utility function getrobots of dowalk to find out how your robots.txt would be processed. Something like
texis "profile=YOURPROFILE" "top=YOURBASEURL" dowalk/getrobots.txt

sunnedaze · Post by **sunnedaze** » Tue Jul 29, 2003 9:06 am

I see the robots.txt file is not being processed. I am able to see it with the browser & I do have the button on the admin screen checked to use the robots. What else could be wrong? Thanks

Post by **mark** » Tue Jul 29, 2003 10:26 am

You can use the utility function getrobots of dowalk to find out how your robots.txt would be processed. Something like
texis "profile=YOURPROFILE" "top=YOURBASEURL" dowalk/getrobots.txt

If you have access to your webserver's log you could look at that to see if webinator is fetching it or not.

sunnedaze · Post by **sunnedaze** » Thu Jul 31, 2003 10:05 am

It looks like it is not fetched. What would I try next?

<pre>
Webinator Walk Report for allenet

Creating database F:\Thunderstone Software\Webinator/texis/allenet/db1...Done.
Walk started at 2003-07-27 02:00:02 (by schedule)
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://cleohsenet01.napa.ad.etn.com/
Reading urls from file F:\iPlanet\Servers\docs\enet\html\eneturl.txt
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 (10604) on http://cleohsenet01.napa.ad.etn.com/

Post by **mark** » Thu Jul 31, 2003 10:52 am

It's silent about errors with robots.txt so that indicates that any of the following happened: it wasn't fetched, there was an error fetching, there was an error processing, nothing appropriate was found.

Do the test I suggested above to see what webinator thinks of your robots.txt file.