Directories excluded in robots.txt still being indexed

vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Directories excluded in robots.txt still being indexed

Post by vallinem »

I updated robots.txt today, (and confirmed the updates by displaying robots.txt in the browser) then did a new Webinator walk, but files that are in the excluded directories are still showing up in the index.

Note: the files linking to the 'excluded' files are not in the excluded directories (e.g., /indexeddir/file.asp is calling /excludeddir/dontindexme.pdf) - is that the problem? Do I need to be adding the excluded directories to the Exclusions list?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Directories excluded in robots.txt still being indexed

Post by mark »

It doesn't matter where the excluded files are linked from. The robots.txt file format is important though. The walk status page will show you how your robots.txt was interpreted. If it's not what you think it should be please include your robots.txt here and indicate where webinator disagrees with you.
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Directories excluded in robots.txt still being indexed

Post by vallinem »

I suspect it's bypassing it entirely - what walk status shows is:

Ignore urls containing any of the following:
/cgi-bin/
~
?

which are the items listed under exclusions. Robots.txt is marked Y. Contents of robots.txt are:

User-agent: *
Disallow: /_private/
Disallow: /_vti_bin/
Disallow: /_vti_cnf/
Disallow: /_vti_log/
Disallow: /_vti_pvt/
Disallow: /_vti_txt/
Disallow: /apps/bps2001/
Disallow: /apps/CMP/archive/
Disallow: /apps/prop36/olddata/
Disallow: /archive/
Disallow: /cgi-bin/
Disallow: /download/
Disallow: /images/
Disallow: /includes/
Disallow: /java/
Disallow: /misc/Grants/AerialImages/
Disallow: /misc/Grants/KernDataDemog/
Disallow: /NTAdmin/
Disallow: /Phone Book Service/
Disallow: /polproc/sourcedocs/
Disallow: /reports/
Disallow: /ubb/
Disallow: /Webinator/
Disallow: /y2k/

(if you tell me how to attach a file, I can send the actual file - it was edited in TextPad, a DOS editor, if that makes a difference)
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Directories excluded in robots.txt still being indexed

Post by mark »

That looks ok. There should not be spaces at the end of line, but Webinator will ignore them. Right after the status report line "started 1 (##) on http://..." it should say "robots.txt excludes the following prefixes:" followed by a list of prefixes. Your robots.txt file is in the root directory of your web tree, right? (eg: http://www.yousite.com/robots.txt)

To test what Webinator thinks of your robots.txt from a command or shell prompt you can run
texis profile=PROFILE top=THEURL dowalk/getrobots.txt
where PROFILE is the name of the PROFILE you're using and THEURL is the Url for your robots.txt file. You should run the above command from the directory containing the "dowalk" script. You may need to specify the full path to "texis" if it's not in your PATH.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Directories excluded in robots.txt still being indexed

Post by mark »

p.s.
One thing though. Space is not valid in urls so is also not valid in robots.txt. It should be encoded as %20. As in
Disallow: /Phone%20Book%20Service/
But that only affects that entry, not the rest.
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Directories excluded in robots.txt still being indexed

Post by vallinem »

Robots.txt is indeed in root. I did examine and discover some "extra" exclusions (directories that no longer exist) so removed these. Also removed Phone Book Service, since it was an empty directory.

After doing the cleanup on robots.txt, when ran command you suggested, (using http://countynet/ as my "top" address) result was:
Requested hostprefix (derived from SSc_url): <URL>

Agent=''[note: this is 2 single quotes, not one double quote]
Disallow='/apps/bps...<then lists all exclusions, one line per - Webinator is last one>
/Webinator/'
<p>
rrejects:

The user agent looks a little odd - could that be the problem?

When tried walk after fixing robots.txt and running dowalk/getrobots.txt, result was:

Webinator Walk Report for fullsite

Creating database d:\program files\webinator/texis/fullsite/db1...Done.
Walk started at 2001-11-26 13:39:07 (by user)
Start fetching at http://countynet/
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 (1163) on http://countynet/
393 pages fetched (37,524,442 bytes) from http://countynet/
10 errors
5 duplicate pages

If we can't figure this out, is my best bet just to duplicate the contents of robots.txt in my 'exclusions' list?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Directories excluded in robots.txt still being indexed

Post by mark »

There's the problem. You have a funky user-agent in robots.txt. It's not as tolerant of trailing spaces in the agent name as in the path. Remove the trailing spaces.

If you just want to get it done, yes you can just put your exclusions into the exclusions list. If you want to get your webserver configured properly for general use you should fix robots.txt.
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Directories excluded in robots.txt still being indexed

Post by vallinem »

I would like to fix robots.txt, but am not sure how to do it - do you know of an editor/tool I can use (preferably a free download) to remove those trailing spaces?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Directories excluded in robots.txt still being indexed

Post by mark »

Windows comes with "notepad" which should do the job nicely.
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Directories excluded in robots.txt still being indexed

Post by vallinem »

I tried that, and also tried saving directly to the directory rather than FTPing, but still get the same results. Does Thunderstone have its own user agent that I can try adding to robots.txt?
Post Reply