parsing robots.txt

sourceuno
Posts: 225
Joined: Mon Apr 09, 2001 3:58 pm

parsing robots.txt

Post by sourceuno »

I'm having trouble getting the User-agent from the following robots.txt which is contained in the $robotstxt variable:

User-agent: *
Disallow: /

Here's the code I'm using:

<$robotsch="multiple
recdelim \x0a=[\x20\x09]*\x0d?\x0a
allmatch \x0a
field Agent varchar User-agent ''
field Disallow varchar Disallow ''
">
<timport row $robotsch $robotstxt><!-- parse the file -->
agent:$Agent
dis: $Disallow
<local xa="">
<lower $Agent><$Agent=$ret><!-- work in lowercase -->
<split "\x0d?>>\x0a" $Agent><!-- break out multiples -->
<sandr "\space+>>=" "" $ret>
<loop $ret><!-- for each agent -->
<if $ret eq $myname or $ret eq "*"><!-- for me or anyone -->
<$xa=$ret><break><!-- something i should look at -->
</if>
</loop>
<if $xa ne "">
<$Agent=$xa><!-- matching agent name -->
<$rrejects=><!-- clear any previous matches -->
<$rrejectssrc=>
<split "\x0d?>>\x0a" $Disallow><!-- break out multiples -->
<$dislist=$ret>
<loop $dislist><!-- for each disallow -->
<rex ">>=[^ #*]+" $dislist><!-- truncate at incorrect usage -->
</loop>
</if>
</timport>

The $Agent variable never returns the *.
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

parsing robots.txt

Post by John »

Do you get anything in $Agent? I just tried your example, and it did show $Agent as *. Is it possible there are additional control characters in the robots.txt that are causing the problem? Do you have a URL we could <fetch>?
John Turnbull
Thunderstone Software
sourceuno
Posts: 225
Joined: Mon Apr 09, 2001 3:58 pm

parsing robots.txt

Post by sourceuno »

User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

parsing robots.txt

Post by John »

I tried adding a <fetch> of that URL and setting $robotstxt to $ret, then executing your code and the output was:

agent:*
dis: /
John Turnbull
Thunderstone Software
sourceuno
Posts: 225
Joined: Mon Apr 09, 2001 3:58 pm

parsing robots.txt

Post by sourceuno »

I stripped some code out within the timport, but I still get the following output:

agent:
dis: /

Here's the code:

<a name=main>
<$robotsch="multiple
recdelim \x0a=[\x20\x09]*\x0d?\x0a
allmatch \x0a
field Agent varchar User-agent ''
field Disallow varchar Disallow ''
">
<urlcp maxredirs 0><!-- no redirs for robots.txt -->
<fetch http://emerald2.weddingchannel.com/robots.txt><!-- get the robots.txt file -->
<$robotstxt=$ret>
<urlinfo httpcode>
<if $ret eq 200><!-- webserver says ok -->
<timport row $robotsch $robotstxt><!-- parse the file -->
agent:$Agent
dis: $Disallow
</timport>
</if>
</a>
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

parsing robots.txt

Post by John »

That is due to potential variations in line separators on various platforms. The easiest solution is probably to code the rex expressions yourself, instead of relying on the default tags, e.g. instead of User-agent use

/>>[\x0d\x0a]\RUser-agent:=\P[\x20\x09]*[^\x0d\x0a]+
John Turnbull
Thunderstone Software