Excluding document parts (like selection boxes) from indexing

t.wuersch · Post by **t.wuersch** » Thu Mar 29, 2001 6:15 am

I've got a bit of a problem here: I'd like to tell gw that it sould index only the text parts of a document and exclude unnecessary stuff (like selection boxes). Is this possible?

This exclusion would be necessary for my web site because if I perform a Webinator search on my site the results are not very useful. I'll explain this: I've got, say, 10 pages with product descriptions for 10 different products. On all of these pages there's a selection box where you can choose a different product for which you want to see the product description. Now here's the problem: When I search for a product using its name the search result contains all 10 pages - but it should contain only one page with the correct product. The reason is quite obvious: gw indexes the product names in the selection box as well as the rest of the document. Now my simple question: Is there a way to stop gw from doing so? Many Thanks

Timo

Post by **mark** » Thu Mar 29, 2001 9:50 am

Currently no. The next release of Webinator will have that capability. We don't have a release date for that yet, but it's not too far away now.

Post by **mark** » Thu Mar 29, 2001 11:30 am

Sorry, I neglected to mention that you could do it by modifying the scripted walker ( ftp://ftp.thunderstone.com/pub/dowalk_beta ).

You can surround the unwanted html portions with <DEL> and </DEL> then set <urlcp ignoredel 1> in dowalk. See http://www.thunderstone.com/site/vortexman/node132.html

Another possibility would be to modify the scripted walker to remove all forms, or select lists, or whatever using <sandr>. Then reprocess the resultant html through <fetch>. See the vortex manual for details on how to use those functions.

bart · Post by **bart** » Thu Mar 29, 2001 12:49 pm

The code for the <sandr> expression for this would be something like:

<fetch $theurl>
<sandr ">>\<form=!\</form\>+\</form\>" "" $ret>

t.wuersch · Post by **t.wuersch** » Mon Apr 02, 2001 7:37 pm

Um... And where do I have to put this code? I think this goes into the webinator/bin/sandr file but am I right or not? And is it the sandr file which is completely responsible for controlling the indexing process?

Post by **mark** » Mon Apr 02, 2001 9:43 pm

Not at all. You need to modify dowalk_beta and use it to walk instead of gw. You will find <fetch> in dowalk_beta already. You will make the changes around that after reading the manual about how fetch and sandr work.

t.wuersch · Post by **t.wuersch** » Tue Apr 03, 2001 11:38 am

Okay, I got things running so far, but now I've got the problem that dowalk indexes only the one page I tell it to but it doesn't follow the links on that page... What am I doing wrong now?

Post by **mark** » Tue Apr 03, 2001 12:26 pm

If you run dowalk the way that's described at the top of the script it should behave basically like gw. Perhaps you have sandr'd out too much and there are no urls to follow? Did it walk ok before you modified it? Do you see any urls in the refs table? Do they look like ones that should have been followed? What's the url you are giving to dowalk?

t.wuersch · Post by **t.wuersch** » Tue Apr 03, 2001 12:42 pm

As far as I see I run dowalk correctly but I think using the following commands: First I create a db in the Webinator directory using

> gw -create -dvisplaytest

The, I start the dowalk script using

> texis.exe
top=http://www.visplay.com/hiddensitemap.htm
dowalk/dispatch.txt

It seems to walk hiddensitemap.htm correctly becase I get a correct search result when I perform a Webinator search. However, the only page that appears in the search result is hiddensitemap.htm. I didn't use sandr so far because I think I'll use the urlcp select or the urlcp del options to get aroud the selection box problem. I didn't look at the url refs table because honestly I don't know how.

The file hiddensitemap.htm is a help file for the Webinator because without it it wasn't able to walk the page.

Post by **mark** » Tue Apr 03, 2001 2:07 pm

The stock dowalk_beta with the mods to use a standard gw database walks that site fine. Does the gw.log file in the database directory indicate any problems?

See what urls are in the database with
gw -st "select Url from html"
List references table with
gw -st "select * from refs"

Did the database exist already? Maybe you need to wipe it
gw -wipe