parsing robots.txt

sourceuno · Post by **sourceuno** » Wed Mar 06, 2002 5:58 pm

I'm having trouble getting the User-agent from the following robots.txt which is contained in the $robotstxt variable:

User-agent: *
Disallow: /

Here's the code I'm using:

<$robotsch="multiple
recdelim \x0a=[\x20\x09]*\x0d?\x0a
allmatch \x0a
field Agent varchar User-agent ''
field Disallow varchar Disallow ''
">
<timport row $robotsch $robotstxt>
agent:$Agent
dis: $Disallow
<local xa="">
<lower $Agent><$Agent=$ret>
<split "\x0d?>>\x0a" $Agent>
<sandr "\space+>>=" "" $ret>
<loop $ret>
<if $ret eq $myname or $ret eq "*">
<$xa=$ret><break>
</if>
</loop>
<if $xa ne "">
<$Agent=$xa>
<$rrejects=>
<$rrejectssrc=>
<split "\x0d?>>\x0a" $Disallow>
<$dislist=$ret>
<loop $dislist>
<rex ">>=[^ #*]+" $dislist>
</loop>
</if>
</timport>

The $Agent variable never returns the *.

Post by **John** » Wed Mar 06, 2002 6:24 pm

Do you get anything in $Agent? I just tried your example, and it did show $Agent as *. Is it possible there are additional control characters in the robots.txt that are causing the problem? Do you have a URL we could <fetch>?

sourceuno · Post by **sourceuno** » Wed Mar 06, 2002 6:45 pm

Try fetching this URL:

http://emerald2.weddingchannel.com/robots.txt

Post by **John** » Wed Mar 06, 2002 8:00 pm

I tried adding a <fetch> of that URL and setting $robotstxt to $ret, then executing your code and the output was:

agent:*
dis: /

sourceuno · Post by **sourceuno** » Thu Mar 07, 2002 11:21 am

I stripped some code out within the timport, but I still get the following output:

agent:
dis: /

Here's the code:

<a name=main>
<$robotsch="multiple
recdelim \x0a=[\x20\x09]*\x0d?\x0a
allmatch \x0a
field Agent varchar User-agent ''
field Disallow varchar Disallow ''
">
<urlcp maxredirs 0>
<fetch http://emerald2.weddingchannel.com/robots.txt>
<$robotstxt=$ret>
<urlinfo httpcode>
<if $ret eq 200>
<timport row $robotsch $robotstxt>
agent:$Agent
dis: $Disallow
</timport>
</if>
</a>

Post by **John** » Thu Mar 07, 2002 12:33 pm

That is due to potential variations in line separators on various platforms. The easiest solution is probably to code the rex expressions yourself, instead of relying on the default tags, e.g. instead of User-agent use

/>>[\x0d\x0a]\RUser-agent:=\P[\x20\x09]*[^\x0d\x0a]+