exclusion issue

Post Reply
ycharreix
Posts: 6
Joined: Tue Mar 12, 2002 10:10 am

exclusion issue

Post by ycharreix »

Hi
I try to exclude all the pages of my site which contain the word "print", so I used:
gw -x/print -dmy_db http://my_site
and that doesn't work, I still have the pages in my DB.
What's wrong?
thanks
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

exclusion issue

Post by mark »

http://www.thunderstone.com/site/gw25man/node58.html says
"Excludes URLs with the path component matching the REX expression EXPR"

-x does only applies to the URL, not the page content. To eliminate pages based on content you need to delete them after the walk or switch to Webinator 4 which will let you modify the walk script to exclude pages based on whatever criteria you desire.
ycharreix
Posts: 6
Joined: Tue Mar 12, 2002 10:10 am

exclusion issue

Post by ycharreix »

actually, I want to exclude the pages which have the word "print" in their URLs.
and I cannot switch to webinator 4 because I'm using a Content Management Tool wich doesn't support this version...
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

exclusion issue

Post by mark »

Did you start with a clean databse? -x won't cause gw to delete urls that are already in the database.
What's your gw version and release (gw -version)?
What's an example of a url that's getting into the database that shouldn't?

And the standard Webinator license doesn't really allow usage with a CMS. See attachment A. You would need a Webinator CMS license or full Texis license.
ycharreix
Posts: 6
Joined: Tue Mar 12, 2002 10:10 am

exclusion issue

Post by ycharreix »

I created the DB, it was totally empty.
that's the version I use:
CMS Webinator Webinator WWW Site Indexer Version 2.6
Copyright(c) 1995,1996,1997,1998,1999,2000,2001 Thunderstone EPI Inc.
Release: 20011228 sparc-sun-solaris2.6-64

that's the CMS version.
Half the pages generated by the CMS are printer-friendly version, and that's why "print" is in the URL.
www.mysite.com/page.htm?print=true
so the result list contains twice the same pages...
I just started using webinator today, to create a new DB, is the following line correct?
gw -x/print -dmy_db http://www.mysite.com/st-index

another question: I want to create another DB ,'cause I have another section of my site I want to index, and I typed another line:
gw -x/print -dmy_db2 http://www.mysite.com/eur-index
I obtain the pages under eur-index but also the ones under st-index. I wipped the DB and the todo list and retry but I obtain the same result.
Did I forgot something?
thanks
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

exclusion issue

Post by mark »

-x/ applies to the *path* of the url. Data after the ? is not the path, it's the query. You want -exquery=print .

To stay under a specific directory you need to use the -j option
gw -exquery=print -dmy_db2 -jhttp://www.mysite.com/eur-index http://www.mysite.com/eur-index/
otherwise gw is free to fetch any page on the same machine.
ycharreix
Posts: 6
Joined: Tue Mar 12, 2002 10:10 am

exclusion issue

Post by ycharreix »

ok
everything works just fine.
thanks for your help!
Post Reply