Directories excluded in robots.txt still being indexed

vallinem · Post by **vallinem** » Mon Nov 26, 2001 2:13 pm

I updated robots.txt today, (and confirmed the updates by displaying robots.txt in the browser) then did a new Webinator walk, but files that are in the excluded directories are still showing up in the index.

Note: the files linking to the 'excluded' files are not in the excluded directories (e.g., /indexeddir/file.asp is calling /excludeddir/dontindexme.pdf) - is that the problem? Do I need to be adding the excluded directories to the Exclusions list?

Post by **mark** » Mon Nov 26, 2001 2:52 pm

It doesn't matter where the excluded files are linked from. The robots.txt file format is important though. The walk status page will show you how your robots.txt was interpreted. If it's not what you think it should be please include your robots.txt here and indicate where webinator disagrees with you.

vallinem · Post by **vallinem** » Mon Nov 26, 2001 3:03 pm

I suspect it's bypassing it entirely - what walk status shows is:

Ignore urls containing any of the following:
/cgi-bin/
~
?

which are the items listed under exclusions. Robots.txt is marked Y. Contents of robots.txt are:

User-agent: *
Disallow: /_private/
Disallow: /_vti_bin/
Disallow: /_vti_cnf/
Disallow: /_vti_log/
Disallow: /_vti_pvt/
Disallow: /_vti_txt/
Disallow: /apps/bps2001/
Disallow: /apps/CMP/archive/
Disallow: /apps/prop36/olddata/
Disallow: /archive/
Disallow: /cgi-bin/
Disallow: /download/
Disallow: /images/
Disallow: /includes/
Disallow: /java/
Disallow: /misc/Grants/AerialImages/
Disallow: /misc/Grants/KernDataDemog/
Disallow: /NTAdmin/
Disallow: /Phone Book Service/
Disallow: /polproc/sourcedocs/
Disallow: /reports/
Disallow: /ubb/
Disallow: /Webinator/
Disallow: /y2k/

(if you tell me how to attach a file, I can send the actual file - it was edited in TextPad, a DOS editor, if that makes a difference)

Post by **mark** » Mon Nov 26, 2001 3:36 pm

That looks ok. There should not be spaces at the end of line, but Webinator will ignore them. Right after the status report line "started 1 (##) on http://..." it should say "robots.txt excludes the following prefixes:" followed by a list of prefixes. Your robots.txt file is in the root directory of your web tree, right? (eg: http://www.yousite.com/robots.txt)

To test what Webinator thinks of your robots.txt from a command or shell prompt you can run
texis profile=PROFILE top=THEURL dowalk/getrobots.txt
where PROFILE is the name of the PROFILE you're using and THEURL is the Url for your robots.txt file. You should run the above command from the directory containing the "dowalk" script. You may need to specify the full path to "texis" if it's not in your PATH.

Post by **mark** » Mon Nov 26, 2001 3:57 pm

p.s.
One thing though. Space is not valid in urls so is also not valid in robots.txt. It should be encoded as %20. As in
Disallow: /Phone%20Book%20Service/
But that only affects that entry, not the rest.

vallinem · Post by **vallinem** » Mon Nov 26, 2001 4:47 pm

Robots.txt is indeed in root. I did examine and discover some "extra" exclusions (directories that no longer exist) so removed these. Also removed Phone Book Service, since it was an empty directory.

After doing the cleanup on robots.txt, when ran command you suggested, (using http://countynet/ as my "top" address) result was:
Requested hostprefix (derived from SSc_url): <URL>

Agent=''[note: this is 2 single quotes, not one double quote]
Disallow='/apps/bps...<then lists all exclusions, one line per - Webinator is last one>
/Webinator/'
<p>
rrejects:

The user agent looks a little odd - could that be the problem?

When tried walk after fixing robots.txt and running dowalk/getrobots.txt, result was:

Webinator Walk Report for fullsite

Creating database d:\program files\webinator/texis/fullsite/db1...Done.
Walk started at 2001-11-26 13:39:07 (by user)
Start fetching at http://countynet/
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 (1163) on http://countynet/
393 pages fetched (37,524,442 bytes) from http://countynet/
10 errors
5 duplicate pages

If we can't figure this out, is my best bet just to duplicate the contents of robots.txt in my 'exclusions' list?

Post by **mark** » Mon Nov 26, 2001 6:18 pm

There's the problem. You have a funky user-agent in robots.txt. It's not as tolerant of trailing spaces in the agent name as in the path. Remove the trailing spaces.

If you just want to get it done, yes you can just put your exclusions into the exclusions list. If you want to get your webserver configured properly for general use you should fix robots.txt.

vallinem · Post by **vallinem** » Mon Nov 26, 2001 7:50 pm

I would like to fix robots.txt, but am not sure how to do it - do you know of an editor/tool I can use (preferably a free download) to remove those trailing spaces?

Post by **mark** » Mon Nov 26, 2001 10:42 pm

Windows comes with "notepad" which should do the job nicely.

vallinem · Post by **vallinem** » Tue Nov 27, 2001 11:50 am

I tried that, and also tried saving directly to the directory rather than FTPing, but still get the same results. Does Thunderstone have its own user agent that I can try adding to robots.txt?