Page 1 of 1

exclusion issue

Posted: Tue Mar 12, 2002 10:15 am
by ycharreix
Hi
I try to exclude all the pages of my site which contain the word "print", so I used:
gw -x/print -dmy_db http://my_site
and that doesn't work, I still have the pages in my DB.
What's wrong?
thanks

exclusion issue

Posted: Tue Mar 12, 2002 10:27 am
by mark
http://www.thunderstone.com/site/gw25man/node58.html says
"Excludes URLs with the path component matching the REX expression EXPR"

-x does only applies to the URL, not the page content. To eliminate pages based on content you need to delete them after the walk or switch to Webinator 4 which will let you modify the walk script to exclude pages based on whatever criteria you desire.

exclusion issue

Posted: Tue Mar 12, 2002 10:33 am
by ycharreix
actually, I want to exclude the pages which have the word "print" in their URLs.
and I cannot switch to webinator 4 because I'm using a Content Management Tool wich doesn't support this version...

exclusion issue

Posted: Tue Mar 12, 2002 11:05 am
by mark
Did you start with a clean databse? -x won't cause gw to delete urls that are already in the database.
What's your gw version and release (gw -version)?
What's an example of a url that's getting into the database that shouldn't?

And the standard Webinator license doesn't really allow usage with a CMS. See attachment A. You would need a Webinator CMS license or full Texis license.

exclusion issue

Posted: Tue Mar 12, 2002 11:21 am
by ycharreix
I created the DB, it was totally empty.
that's the version I use:
CMS Webinator Webinator WWW Site Indexer Version 2.6
Copyright(c) 1995,1996,1997,1998,1999,2000,2001 Thunderstone EPI Inc.
Release: 20011228 sparc-sun-solaris2.6-64

that's the CMS version.
Half the pages generated by the CMS are printer-friendly version, and that's why "print" is in the URL.
www.mysite.com/page.htm?print=true
so the result list contains twice the same pages...
I just started using webinator today, to create a new DB, is the following line correct?
gw -x/print -dmy_db http://www.mysite.com/st-index

another question: I want to create another DB ,'cause I have another section of my site I want to index, and I typed another line:
gw -x/print -dmy_db2 http://www.mysite.com/eur-index
I obtain the pages under eur-index but also the ones under st-index. I wipped the DB and the todo list and retry but I obtain the same result.
Did I forgot something?
thanks

exclusion issue

Posted: Tue Mar 12, 2002 11:48 am
by mark
-x/ applies to the *path* of the url. Data after the ? is not the path, it's the query. You want -exquery=print .

To stay under a specific directory you need to use the -j option
gw -exquery=print -dmy_db2 -jhttp://www.mysite.com/eur-index http://www.mysite.com/eur-index/
otherwise gw is free to fetch any page on the same machine.

exclusion issue

Posted: Tue Mar 12, 2002 12:12 pm
by ycharreix
ok
everything works just fine.
thanks for your help!