Page 1 of 1

Excluding a spider page

Posted: Wed Mar 20, 2002 10:35 am
by smallguy
using:

Webinator WWW Site Indexer Version 2.56 (Commercial)
Release: 20010814

We have a page that lists every page in the site, to make the site easier to spider for webinator (as some links are displayed using javascript).

The only problem is, this spider page is getting spidered itself. I've tried using robots.txt in the form of:

Disallow: /dir/spider.html

But this wasn't picked up. I tried using the -x argument but discovered this stopped the page from being read in the first place.

Does anyone have any suggestions? I need to ignore this page but for it to also be read!

Excluding a spider page

Posted: Wed Mar 20, 2002 10:59 am
by mark
That would require meta robots tags within the html page itself (<meta name="robots" content="NOINDEX,FOLLOW">) and Webinator 4 to support meta robots. Webinator 2 does not support meta robots.

The alternative is to do the walk and delete the list page afterwards. See the webinator manual for how to remove pages from the database.

BTW, your robots.txt syntax is incomplete. See the Webinator manual for a description of the syntax.

Excluding a spider page

Posted: Thu Mar 21, 2002 4:52 am
by smallguy
I'm trying to delete using the following:

C:\Inetpub\cgi>texis -d "C:\Program Files\Thunderstone Software\Webinator2\folder\db_all" -s "DELETE FROM html WHERE Url='domain.com/spider.html'"

But i get the following back:

000 Mar 21 09:49:17 Insufficient permissions on html in the function ipreparetre
e
000 Insufficient permissions on html in the function ipreparetree
000 Mar 21 09:49:17 SQLPrepare() failed with -1 in the function prepntexis
000 SQLPrepare() failed with -1 in the function prepntexis


I've also tried referencing the row in the table by it's id, with the same outcome.

What am i doing wrong?!

Excluding a spider page

Posted: Thu Mar 21, 2002 10:01 am
by mark
gw creates the tables as texis user "_SYSTEM" give the
-u _SYSTEM -p ""
options.