Still having problems with robots.txt

Post Reply
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

I have posted about this problem in the past, http://thunderstone.master.com/texis/ma ... =3be3163f3 and it still seems that Webinator fails to follow the robot's rules.

I downloaded the newest scripts from you site, 5.1.3 Oct 18 last modified and ran them without making any modifications. I indexed about 98,000 pages which spans about 93 sites. I have both "robots.txt" and "Meta" set to "Y". My excludes are:
~
/admin/
/calendar
/Calendar/
/Kalendar
/calendar.cgi
/wusage
/wusage6.0/
/statisitcs
/stat
/stats
/usage
/webstatistics

I am using Webinator to support a large private Intranet that sits behind firewalls so it is not accessiable to the Internet. Webinator service will be the main search engine for the users of this Internet and we need it to follow robots.txt and Meta robots standards.

The robots.txt file is:
User-agent: *
Disallow: /td

I ran the getrobots.txt command and received the following output:

/usr/local/morph3/bin/texis "profile=osis" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt

002 ./dowalk(applydbsettings2) 955: can't open /db2: No such file or directory in the function ddopen

000 ./dowalk(applydbsettings2) 955: Could not connect to /db2 in the function openntexis

004 ./dowalk(sysutil) 1914: Cannot create directory /db1: Permission denied

** Removed other errors to save space. **

000 ./dowalk(getrobotstxt) 3616: Could not connect to /db1 in the function openntexis
Agent='*'
Disallow='/td'
<p>
rrejects: lindev2o{ismcsys} 5:

I am not seeing the above errors when managing the walk through the web interface.

After the initial walk completed, I had a few Urls that should have been rejected due to the site's robots.txt file.

Need to get a fix for this as soon as possible. I am running Webinator on a Linux box. The kernel release that we are using is 2.4.20-24.8. Thank you for your help.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Still having problems with robots.txt

Post by mark »

The errors are probably due to not running texis as the same user that the webserver does. The install generally makes texis setuid to avoid such problems.

Check the parents of the undesired page. Is one of the parents a different site? One possibility is that it's getting by because of a multi-site walk. Urls that are referenced by another site will slip by without robots.txt processing, though meta robots and your profile excludes will apply. For now the only workaround in that case is to add the robots.txt rules to the "Exclusion Prefix" in the profile. e.g.: http://osis.nima.mil/td
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

First problem with the errors. This was my fault, I was using a bad profile name. Here is the output with the correct profile name:
/usr/local/morph3/bin/texis "profile=orgScript" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt
Requested hostprefix (derived from SSc_url): http://osis.nima.mil
Agent='*'
Disallow='/td'
<p>
rrejects: '/td' == '>>=\Lhttp://osis.nima.mil/td\L'


It looks like the parents are from different sites so it is getting by the robots.txt check. While the workaround you suggest would handle the problem, it does not solve it for all sites and so I need to find a better solution to the problem.

Since I don't know all of the sites that can get indexed, sites get added, changed, and removed on a daily basis, it would be hard to set the excludes manually before each walk. While this network may only have a hundred servers, I have other networks that have many more servers so the solution would be impossible. When the web master complains, the email goes to a group account so my management gets the wrong impression. They don't always see or remember all of the positive remarks about the product.

What I need is for every page before it is fetched to first check the following:
1. Has the robots.txt file for this site been fetched?
a) If yes, then is this Url allowed by the robots.txt file?
b) No, process the robots.txt file, then check to see if Url is allowed.
2. If Url is not allowed by robots.txt or meta robot instructions, then record that fact in the error table

Any suggestions on where to start in adding these features would be helpful. I would think that a new table that would store all of the robots.txt file directives would be the best approach to this problem. In the example above the string ">>=\Lhttp://osis.nima.mil/td\L" would get inserted into a robots table along with the agent and disallow statement.

Thanks for your help.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Still having problems with robots.txt

Post by mark »

In the script you'll find
<else><!-- add it to list of offsite to be processed
by some other walker -->
Shortly after that it gets inserted into todo. robots.txt needs to be consulted before the insert and the insert skipped if it's not allowed.
You can use the options table to cache the robots.txt files and/or rules. Set the Profile value to something special like '_robots.txt', the Name to the site name (www.thesite.com), and String to the exclude value.
If there's no recent enough entry for the site fetch and parse robots.txt by calling <getrobotstxt> and store results in table. Then extract expressions from the table and <rex> the url. <rex $String $todourl><if $ret ne ""><$accept=N><$reason="robots.txt"><else>...do the insert...</if>
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

Thanks, I will give your suggestions a try.
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

I have made modifications to the script which I will include below. Things seem to be working fine except I am getting some pages that are being denied by a robots.txt file when the server does not have one.

Here is a few lines from a log file that shows what I am talking about.

gofetch-> (F)
offsiteUrl-> http://goldweb.nima.mil/sni/htmldocs/R4 ... /index.cfm
RobotRules-> >>=\Lhttp://goldweb.nima.mil/td\L
ret-> http://goldweb.nima.mil/

gofetch-> (F)
offsiteUrl-> http://goldweb.nima.mil/is/sbu_web.cfm
RobotRules-> >>=\Lhttp://goldweb.nima.mil/td\L
ret-> http://goldweb.nima.mil/

gofetch is either T or F depending on if robots.txt blocks it.
offsiteUrl is the Url that needs to be checked
RobotRules are the rules from the robots.txt file
ret is the output of the rex compairson that I am doing.

<rex $RobotRules $offsiteUrl>
<if $ret ne ""><!-- Rule matches so don't fetch -->

Here are the changes I made. I would love it if something like this would be added in a future release of Webinator so I would not have to maintain the code.

Following changes made to getrobotstxt
<if $robotsUrl eq "">
<sum "%s" $hostprefix "/robots.txt"><$x=$ret>
<else>
<$x=$robotsUrl>
</if>

<$rrejects=$rrejects $trejects><!-- add to list -->
<!-- Jake Change. Add the robots.txt data to the options table. -->
<sql MAX=1 "select Name as SiteName from options where Name=$hostprefix AND String=$trejects"></sql>
<if $SiteName eq "">
<sql "insert into options (id,Profile,Name,String) values (counter,'_robots.txt',$hostprefix,$trejects)"></sql>
</if>
<!-- End Jake Change -->

Changes made to proclinks

<else><!-- add it to list of offsite to be processed
by some other walker -->
<$todourl=$ret>
<if $requires ne ""><!-- are there requirements -->
<rex $requires $todourl>
<else>
<!-- $ret will still be $todourl for logic below -->
</if>
<if $ret ne ""><!-- matches required expr -->
<$offsiteAllowed=F> <!-- Default, do not allow the offsite to be indexed -->
<checkOffsiteRobots offsiteUrl=$todourl><$offsiteAllowed=$ret>
<if $offsiteAllowed eq T>
<vxcp putmsg log off><!-- ignore dup msgs -->
<sql novars "insert into todo values(counter, $dlsecs,$depth+1,$todourl, 0, '000000000')"></sql>
<vxcp putmsg log $ret>
<else><!-- Blocked by robots.txt file -->
<$accept=N><$reason="Denied by robots.txt">
</if>
<else>
<$accept=N><$reason="Not in requirements">
</if>
</if>

Support functions below here
<a name=checkOffsiteRobots offsiteUrl>
<local baseUrl id RobotRules prefix gofetch>
<$gofetch=F><!-- Default value. Do not allow $offsiteUrl to be fetched -->
<!-- Get just the domain part of the Url. This will be used to
Check the Options table to see if we have the robots.txt processed or not
-->
<rex "http://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "https://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "ftp://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else>
<rex "gopher://" $offsiteUrl>
<if $ret ne ""><$prefix=$ret>
<else><$prefix="file://">
</if>
</if>
</if>
</if>
<sandr $prefix "" $offsiteUrl>
<rex ">>=[^/]+" $ret><$baseUrl=$ret>
<sum "%s" $prefix $baseUrl><$baseUrl=$ret>
<sum "%s" $baseUrl "/robots.txt"><$robotsUrl=$ret>
<!-- Check the error table to see if we have tried to fetch the robots.txt file. -->
<checkErrorTable r=$robotsUrl><$errorid=$ret>
<if $errorid eq ""><!-- I don't have this in the error table. -->
<getRobotRules T=$baseUrl><$RobotRules=$ret>
<if $RobotRules eq ""><!-- Nothing in the options table. Fetch the robots.txt file -->
<getrobotstxt>
<!-- I know it has been fetched so check the options table to get any robot rules -->
<getRobotRules T=$baseUrl><$RobotRules=$ret>
</if>
<if $RobotRules eq ""><!-- No rules, free to get -->
<$gofetch=T>
<else><!-- I have rules so see if they block the fetch -->
<rex $RobotRules $offsiteUrl>
<if $ret ne ""><!-- Rule matches so don't fetch -->
<$gofetch=F>
<else><!-- Free to fetch -->
<$gofetch=T>
</if>
</if>
<else><!-- Have it in error table so I am free to fetch $offsiteUrl -->
<$gofetch=T>
</if>
<return $gofetch><!-- Return either T for fetch or F to stop fetch of Url -->
</a>

<!-- Get the id value out of the error table for the robots.txt file. -->
<a name=checkErrorTable r>
<local id sql>
<sum "%s" "select id from error where Url = '" $r "'"><$sql=$ret>
<sql $sql></sql>
<if $id gte $dayago><!-- $dayago set in <defaults> -->
<!-- Delete the record from the error table and make id = "" This way the robots.txt file will be fetched again. Need to make sure that any newly created robots.txt file will be found. -->
<sql NOVARS "delete from error where id=$id"></sql>
<$id=>
</if>
<return $id>
</a>

<!-- Get any rules out of the options table -->
<a name=getRobotRules T>
<local String sql>
<sum "%s" "select String from options where Name = '" $T "' AND Profile='_robots.txt'"><$sql=$ret>
<sql $sql></sql>
<sql $sql></sql>
<return $String>
</a>

Please let me know if there are improvements I can make anything that might be causing the above problem.

Thanks for your help
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

I will try and add more logging. I am not looping over the RobotRules so I will modify my rex to do this.
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Still having problems with robots.txt

Post by mjacobson »

I think I found the problem. It was in the getrobotstxt function. If it is ok with you, I will send the script so you or someone else could look it over for problems or suggestions.
Post Reply