Excluding a spider page

smallguy · Post by **smallguy** » Wed Mar 20, 2002 10:35 am

using:

Webinator WWW Site Indexer Version 2.56 (Commercial)
Release: 20010814

We have a page that lists every page in the site, to make the site easier to spider for webinator (as some links are displayed using javascript).

The only problem is, this spider page is getting spidered itself. I've tried using robots.txt in the form of:

Disallow: /dir/spider.html

But this wasn't picked up. I tried using the -x argument but discovered this stopped the page from being read in the first place.

Does anyone have any suggestions? I need to ignore this page but for it to also be read!

Post by **mark** » Wed Mar 20, 2002 10:59 am

That would require meta robots tags within the html page itself (<meta name="robots" content="NOINDEX,FOLLOW">) and Webinator 4 to support meta robots. Webinator 2 does not support meta robots.

The alternative is to do the walk and delete the list page afterwards. See the webinator manual for how to remove pages from the database.

BTW, your robots.txt syntax is incomplete. See the Webinator manual for a description of the syntax.

smallguy · Post by **smallguy** » Thu Mar 21, 2002 4:52 am

I'm trying to delete using the following:

C:\Inetpub\cgi>texis -d "C:\Program Files\Thunderstone Software\Webinator2\folder\db_all" -s "DELETE FROM html WHERE Url='domain.com/spider.html'"

But i get the following back:

000 Mar 21 09:49:17 Insufficient permissions on html in the function ipreparetre
e
000 Insufficient permissions on html in the function ipreparetree
000 Mar 21 09:49:17 SQLPrepare() failed with -1 in the function prepntexis
000 SQLPrepare() failed with -1 in the function prepntexis

I've also tried referencing the row in the table by it's id, with the same outcome.

What am i doing wrong?!

Post by **mark** » Thu Mar 21, 2002 10:01 am

gw creates the tables as texis user "_SYSTEM" give the
-u _SYSTEM -p ""
options.