Still having problems with robots.txt

mjacobson · Post by **mjacobson** » Wed Oct 20, 2004 9:51 am

I have posted about this problem in the past, http://thunderstone.master.com/texis/ma ... =3be3163f3 and it still seems that Webinator fails to follow the robot's rules.

I downloaded the newest scripts from you site, 5.1.3 Oct 18 last modified and ran them without making any modifications. I indexed about 98,000 pages which spans about 93 sites. I have both "robots.txt" and "Meta" set to "Y". My excludes are:
~
/admin/
/calendar
/Calendar/
/Kalendar
/calendar.cgi
/wusage
/wusage6.0/
/statisitcs
/stat
/stats
/usage
/webstatistics

I am using Webinator to support a large private Intranet that sits behind firewalls so it is not accessiable to the Internet. Webinator service will be the main search engine for the users of this Internet and we need it to follow robots.txt and Meta robots standards.

The robots.txt file is:
User-agent: *
Disallow: /td

I ran the getrobots.txt command and received the following output:

/usr/local/morph3/bin/texis "profile=osis" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt

002 ./dowalk(applydbsettings2) 955: can't open /db2: No such file or directory in the function ddopen

000 ./dowalk(applydbsettings2) 955: Could not connect to /db2 in the function openntexis

004 ./dowalk(sysutil) 1914: Cannot create directory /db1: Permission denied

** Removed other errors to save space. **

000 ./dowalk(getrobotstxt) 3616: Could not connect to /db1 in the function openntexis
Agent='*'
Disallow='/td'
<p>
rrejects: lindev2o{ismcsys} 5:

I am not seeing the above errors when managing the walk through the web interface.

After the initial walk completed, I had a few Urls that should have been rejected due to the site's robots.txt file.

Need to get a fix for this as soon as possible. I am running Webinator on a Linux box. The kernel release that we are using is 2.4.20-24.8. Thank you for your help.

Post by **mark** » Wed Oct 20, 2004 1:07 pm

The errors are probably due to not running texis as the same user that the webserver does. The install generally makes texis setuid to avoid such problems.

Check the parents of the undesired page. Is one of the parents a different site? One possibility is that it's getting by because of a multi-site walk. Urls that are referenced by another site will slip by without robots.txt processing, though meta robots and your profile excludes will apply. For now the only workaround in that case is to add the robots.txt rules to the "Exclusion Prefix" in the profile. e.g.: http://osis.nima.mil/td

mjacobson · Post by **mjacobson** » Thu Oct 21, 2004 7:42 am

First problem with the errors. This was my fault, I was using a bad profile name. Here is the output with the correct profile name:
/usr/local/morph3/bin/texis "profile=orgScript" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt
Requested hostprefix (derived from SSc_url): http://osis.nima.mil
Agent='*'
Disallow='/td'
<p>
rrejects: '/td' == '>>=\Lhttp://osis.nima.mil/td\L'

It looks like the parents are from different sites so it is getting by the robots.txt check. While the workaround you suggest would handle the problem, it does not solve it for all sites and so I need to find a better solution to the problem.

Since I don't know all of the sites that can get indexed, sites get added, changed, and removed on a daily basis, it would be hard to set the excludes manually before each walk. While this network may only have a hundred servers, I have other networks that have many more servers so the solution would be impossible. When the web master complains, the email goes to a group account so my management gets the wrong impression. They don't always see or remember all of the positive remarks about the product.

What I need is for every page before it is fetched to first check the following:
1. Has the robots.txt file for this site been fetched?
a) If yes, then is this Url allowed by the robots.txt file?
b) No, process the robots.txt file, then check to see if Url is allowed.
2. If Url is not allowed by robots.txt or meta robot instructions, then record that fact in the error table

Any suggestions on where to start in adding these features would be helpful. I would think that a new table that would store all of the robots.txt file directives would be the best approach to this problem. In the example above the string ">>=\Lhttp://osis.nima.mil/td\L" would get inserted into a robots table along with the agent and disallow statement.

Thanks for your help.

Post by **mark** » Thu Oct 21, 2004 11:13 am

In the script you'll find
<else>
Shortly after that it gets inserted into todo. robots.txt needs to be consulted before the insert and the insert skipped if it's not allowed.
You can use the options table to cache the robots.txt files and/or rules. Set the Profile value to something special like '_robots.txt', the Name to the site name (www.thesite.com), and String to the exclude value.
If there's no recent enough entry for the site fetch and parse robots.txt by calling <getrobotstxt> and store results in table. Then extract expressions from the table and <rex> the url. <rex $String $todourl><if $ret ne ""><$accept=N><$reason="robots.txt"><else>...do the insert...</if>

mjacobson · Post by **mjacobson** » Thu Oct 21, 2004 11:34 am

Thanks, I will give your suggestions a try.

mjacobson · Post by **mjacobson** » Tue Oct 26, 2004 1:41 pm

I have made modifications to the script which I will include below. Things seem to be working fine except I am getting some pages that are being denied by a robots.txt file when the server does not have one.

Here is a few lines from a log file that shows what I am talking about.

gofetch-> (F)
offsiteUrl-> http://goldweb.nima.mil/sni/htmldocs/R4 ... /index.cfm
RobotRules-> >>=\Lhttp://goldweb.nima.mil/td\L
ret-> http://goldweb.nima.mil/

gofetch-> (F)
offsiteUrl-> http://goldweb.nima.mil/is/sbu_web.cfm
RobotRules-> >>=\Lhttp://goldweb.nima.mil/td\L
ret-> http://goldweb.nima.mil/

gofetch is either T or F depending on if robots.txt blocks it.
offsiteUrl is the Url that needs to be checked
RobotRules are the rules from the robots.txt file
ret is the output of the rex compairson that I am doing.

<rex $RobotRules $offsiteUrl>
<if $ret ne "">

Here are the changes I made. I would love it if something like this would be added in a future release of Webinator so I would not have to maintain the code.

Following changes made to getrobotstxt
<if $robotsUrl eq "">
<sum "%s" $hostprefix "/robots.txt"><$x=$ret>
<else>
<$x=$robotsUrl>
</if>

<$rrejects=$rrejects $trejects>

<sql MAX=1 "select Name as SiteName from options where Name=$hostprefix AND String=$trejects"></sql>
<if $SiteName eq "">
<sql "insert into options (id,Profile,Name,String) values (counter,'_robots.txt',$hostprefix,$trejects)"></sql>
</if>


Changes made to proclinks

<else>
<$todourl=$ret>
<if $requires ne "">
<rex $requires $todourl>
<else>

</if>
<if $ret ne "">
<$offsiteAllowed=F> 
<checkOffsiteRobots offsiteUrl=$todourl><$offsiteAllowed=$ret>
<if $offsiteAllowed eq T>
<vxcp putmsg log off>
<sql novars "insert into todo values(counter, $dlsecs,$depth+1,$todourl, 0, '000000000')"></sql>
<vxcp putmsg log $ret>
<else>
<$accept=N><$reason="Denied by robots.txt">
</if>
<else>
<$accept=N><$reason="Not in requirements">
</if>
</if>

Support functions below here
<a name=checkOffsiteRobots offsiteUrl>
<local baseUrl id RobotRules prefix gofetch>
<$gofetch=F>

<rex "http://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "https://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "ftp://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "gopher://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else><$prefix="file://">
</if>
</if>
</if>
</if>
<sandr $prefix "" $offsiteUrl>
<rex ">>=[^/]+" $ret><$baseUrl=$ret>
<sum "%s" $prefix $baseUrl><$baseUrl=$ret>
<sum "%s" $baseUrl "/robots.txt"><$robotsUrl=$ret>

<checkErrorTable r=$robotsUrl><$errorid=$ret>
<if $errorid eq "">
<getRobotRules T=$baseUrl><$RobotRules=$ret>
<if $RobotRules eq "">
<getrobotstxt>

<getRobotRules T=$baseUrl><$RobotRules=$ret>
</if>
<if $RobotRules eq "">
<$gofetch=T>
<else>
<rex $RobotRules $offsiteUrl>
<if $ret ne "">
<$gofetch=F>
<else>
<$gofetch=T>
</if>
</if>
<else>
<$gofetch=T>
</if>
<return $gofetch>
</a>


<a name=checkErrorTable r>
<local id sql>
<sum "%s" "select id from error where Url = '" $r "'"><$sql=$ret>
<sql $sql></sql>
<if $id gte $dayago>

<sql NOVARS "delete from error where id=$id"></sql>
<$id=>
</if>
<return $id>
</a>


<a name=getRobotRules T>
<local String sql>
<sum "%s" "select String from options where Name = '" $T "' AND Profile='_robots.txt'"><$sql=$ret>
<sql $sql></sql>
<sql $sql></sql>
<return $String>
</a>

Please let me know if there are improvements I can make anything that might be causing the above problem.

Thanks for your help

mjacobson · Post by **mjacobson** » Tue Oct 26, 2004 3:17 pm

I will try and add more logging. I am not looping over the RobotRules so I will modify my rex to do this.

mjacobson · Post by **mjacobson** » Tue Oct 26, 2004 4:21 pm

I think I found the problem. It was in the getrobotstxt function. If it is ok with you, I will send the script so you or someone else could look it over for problems or suggestions.