I found that the getrobotstxt function in dowalk_beta
needs some work. The current version is intolerant of
the variations found in legal robots.txt files around our network.
Specifically, we need to account for comments indicated by "#" either as separate "comment lines" or as trailing comments on other lines within the records.
I have found a need to insert the following after the "fetch":
<sandr "\#=[^\x0a]" "" $robotstxt><$robotstxt = $ret><!-- whack comments: "#" to end of line -->
<sandr "\space+\x0a" "\x0a" $robotstxt><$robotstxt = $ret><!-- whack trailing spaces -->
If there is a better way, please advise me.
I think that if you take a look at the RFC at http://www.robotstxt.org/wc/norobots-rfc.html, you will see a "BNF" description in section 3.3 Formal Syntax which might help you to beef up the parsing.
I suspect that the use of multiple User-agent lines within a single record might be hard to accomodate. (It seems to be baffling me, but luckily I don't think anyone on our network is that sophisticated.)
Another issue is that the existing code stops looking if it finds "$myname". I don't see anything in the RFC that hints at this behavior. I think we need to accept directives for "*" or "$myname". I think I'm going to just whack the <if $Agent eq $myname><break>
I can't believe I'm alone here. Does anyone have a correct "robots.txt" parser? (Actually, I am not all that interested in the "Allow" directive since we have told our webmasters not to try to use it.)
Oh yeah... one last point. I quote from the RFC:
...snip...
As the majority of /robots.txt files are created with platform-specific text editors, robots should be liberal in accepting files with different end-of-line conventions, specifically CR and LF in addition to CRLF.
...snip...
I'm not very good at Vortex regexps, but both the original code and my hacks shown above are specific to UNIX line-endings. What is the easiest way to accommodate line-ending variations?
needs some work. The current version is intolerant of
the variations found in legal robots.txt files around our network.
Specifically, we need to account for comments indicated by "#" either as separate "comment lines" or as trailing comments on other lines within the records.
I have found a need to insert the following after the "fetch":
<sandr "\#=[^\x0a]" "" $robotstxt><$robotstxt = $ret><!-- whack comments: "#" to end of line -->
<sandr "\space+\x0a" "\x0a" $robotstxt><$robotstxt = $ret><!-- whack trailing spaces -->
If there is a better way, please advise me.
I think that if you take a look at the RFC at http://www.robotstxt.org/wc/norobots-rfc.html, you will see a "BNF" description in section 3.3 Formal Syntax which might help you to beef up the parsing.
I suspect that the use of multiple User-agent lines within a single record might be hard to accomodate. (It seems to be baffling me, but luckily I don't think anyone on our network is that sophisticated.)
Another issue is that the existing code stops looking if it finds "$myname". I don't see anything in the RFC that hints at this behavior. I think we need to accept directives for "*" or "$myname". I think I'm going to just whack the <if $Agent eq $myname><break>
I can't believe I'm alone here. Does anyone have a correct "robots.txt" parser? (Actually, I am not all that interested in the "Allow" directive since we have told our webmasters not to try to use it.)
Oh yeah... one last point. I quote from the RFC:
...snip...
As the majority of /robots.txt files are created with platform-specific text editors, robots should be liberal in accepting files with different end-of-line conventions, specifically CR and LF in addition to CRLF.
...snip...
I'm not very good at Vortex regexps, but both the original code and my hacks shown above are specific to UNIX line-endings. What is the easiest way to accommodate line-ending variations?