"dowalk_beta" and robots.txt

galderman · Post by **galderman** » Wed Jun 20, 2001 12:44 pm

I found that the getrobotstxt function in dowalk_beta
needs some work. The current version is intolerant of
the variations found in legal robots.txt files around our network.

Specifically, we need to account for comments indicated by "#" either as separate "comment lines" or as trailing comments on other lines within the records.

I have found a need to insert the following after the "fetch":
<sandr "\#=[^\x0a]" "" $robotstxt><$robotstxt = $ret>
<sandr "\space+\x0a" "\x0a" $robotstxt><$robotstxt = $ret>

If there is a better way, please advise me.

I think that if you take a look at the RFC at http://www.robotstxt.org/wc/norobots-rfc.html, you will see a "BNF" description in section 3.3 Formal Syntax which might help you to beef up the parsing.

I suspect that the use of multiple User-agent lines within a single record might be hard to accomodate. (It seems to be baffling me, but luckily I don't think anyone on our network is that sophisticated.)

Another issue is that the existing code stops looking if it finds "$myname". I don't see anything in the RFC that hints at this behavior. I think we need to accept directives for "*" or "$myname". I think I'm going to just whack the <if $Agent eq $myname><break>

I can't believe I'm alone here. Does anyone have a correct "robots.txt" parser? (Actually, I am not all that interested in the "Allow" directive since we have told our webmasters not to try to use it.)

Oh yeah... one last point. I quote from the RFC:
...snip...
As the majority of /robots.txt files are created with platform-specific text editors, robots should be liberal in accepting files with different end-of-line conventions, specifically CR and LF in addition to CRLF.
...snip...
I'm not very good at Vortex regexps, but both the original code and my hacks shown above are specific to UNIX line-endings. What is the easiest way to accommodate line-ending variations?

Post by **mark** » Wed Jun 20, 2001 2:27 pm

\space includes \x0a and could eat a delimiter. I would use
[\x20\x09]+\x0d?>>\x0a
For multiple user agents something like this should do it:
<$xa="">
<split "\x0a" $Agent>
<loop $ret>
<if $ret eq $myname or $ret eq "*">
<$xa=$ret><break>
</if>
</loop>
<if $xa ne "">
... it's a match ...
myname and * have to be distinct. Otherwise you can't disable all but one, as in:
User-agent: *
Disallow: /

User-agent: Webinator
Disallow: /cgi-bin
A specific name should override *.

To handle DOS style CRLF use
<split "\x0d?>>\x0a" $Disallow>
I think handling MAC usage of CR *instead* of LF would be too much, but could be done. If you want to be tolerant of more spooge you could do
<rex ">>=[^*]+" $dislist>
to handle incorrect things like
Disallow: /somedir/*

galderman · Post by **galderman** » Wed Jun 20, 2001 3:03 pm

Thanks for the regexp and vortex hints. I'll try to implement some of that.

On a re-re-re-reading of the RFC, I now see your point about the specific name overriding the "*" specification.

Re: The MAC... DOH!! I completely forgot about that.
For us, I think handling both PC and Unix would be fine.

Thanks for the quick feedback.