I have posted about this problem in the past, http://thunderstone.master.com/texis/ma ... =3be3163f3 and it still seems that Webinator fails to follow the robot's rules.
I downloaded the newest scripts from you site, 5.1.3 Oct 18 last modified and ran them without making any modifications. I indexed about 98,000 pages which spans about 93 sites. I have both "robots.txt" and "Meta" set to "Y". My excludes are:
~
/admin/
/calendar
/Calendar/
/Kalendar
/calendar.cgi
/wusage
/wusage6.0/
/statisitcs
/stat
/stats
/usage
/webstatistics
I am using Webinator to support a large private Intranet that sits behind firewalls so it is not accessiable to the Internet. Webinator service will be the main search engine for the users of this Internet and we need it to follow robots.txt and Meta robots standards.
The robots.txt file is:
User-agent: *
Disallow: /td
I ran the getrobots.txt command and received the following output:
/usr/local/morph3/bin/texis "profile=osis" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt
002 ./dowalk(applydbsettings2) 955: can't open /db2: No such file or directory in the function ddopen
000 ./dowalk(applydbsettings2) 955: Could not connect to /db2 in the function openntexis
004 ./dowalk(sysutil) 1914: Cannot create directory /db1: Permission denied
** Removed other errors to save space. **
000 ./dowalk(getrobotstxt) 3616: Could not connect to /db1 in the function openntexis
Agent='*'
Disallow='/td'
<p>
rrejects: lindev2o{ismcsys} 5:
I am not seeing the above errors when managing the walk through the web interface.
After the initial walk completed, I had a few Urls that should have been rejected due to the site's robots.txt file.
Need to get a fix for this as soon as possible. I am running Webinator on a Linux box. The kernel release that we are using is 2.4.20-24.8. Thank you for your help.
I downloaded the newest scripts from you site, 5.1.3 Oct 18 last modified and ran them without making any modifications. I indexed about 98,000 pages which spans about 93 sites. I have both "robots.txt" and "Meta" set to "Y". My excludes are:
~
/admin/
/calendar
/Calendar/
/Kalendar
/calendar.cgi
/wusage
/wusage6.0/
/statisitcs
/stat
/stats
/usage
/webstatistics
I am using Webinator to support a large private Intranet that sits behind firewalls so it is not accessiable to the Internet. Webinator service will be the main search engine for the users of this Internet and we need it to follow robots.txt and Meta robots standards.
The robots.txt file is:
User-agent: *
Disallow: /td
I ran the getrobots.txt command and received the following output:
/usr/local/morph3/bin/texis "profile=osis" "top=http://osis.nima.mil/" ./dowalk/getrobots.txt
002 ./dowalk(applydbsettings2) 955: can't open /db2: No such file or directory in the function ddopen
000 ./dowalk(applydbsettings2) 955: Could not connect to /db2 in the function openntexis
004 ./dowalk(sysutil) 1914: Cannot create directory /db1: Permission denied
** Removed other errors to save space. **
000 ./dowalk(getrobotstxt) 3616: Could not connect to /db1 in the function openntexis
Agent='*'
Disallow='/td'
<p>
rrejects: lindev2o{ismcsys} 5:
I am not seeing the above errors when managing the walk through the web interface.
After the initial walk completed, I had a few Urls that should have been rejected due to the site's robots.txt file.
Need to get a fix for this as soon as possible. I am running Webinator on a Linux box. The kernel release that we are using is 2.4.20-24.8. Thank you for your help.