I've got a bit of a problem here: I'd like to tell gw that it sould index only the text parts of a document and exclude unnecessary stuff (like selection boxes). Is this possible?
This exclusion would be necessary for my web site because if I perform a Webinator search on my site the results are not very useful. I'll explain this: I've got, say, 10 pages with product descriptions for 10 different products. On all of these pages there's a selection box where you can choose a different product for which you want to see the product description. Now here's the problem: When I search for a product using its name the search result contains all 10 pages - but it should contain only one page with the correct product. The reason is quite obvious: gw indexes the product names in the selection box as well as the rest of the document. Now my simple question: Is there a way to stop gw from doing so? Many Thanks
Another possibility would be to modify the scripted walker to remove all forms, or select lists, or whatever using <sandr>. Then reprocess the resultant html through <fetch>. See the vortex manual for details on how to use those functions.
Um... And where do I have to put this code? I think this goes into the webinator/bin/sandr file but am I right or not? And is it the sandr file which is completely responsible for controlling the indexing process?
Not at all. You need to modify dowalk_beta and use it to walk instead of gw. You will find <fetch> in dowalk_beta already. You will make the changes around that after reading the manual about how fetch and sandr work.
Okay, I got things running so far, but now I've got the problem that dowalk indexes only the one page I tell it to but it doesn't follow the links on that page... What am I doing wrong now?
If you run dowalk the way that's described at the top of the script it should behave basically like gw. Perhaps you have sandr'd out too much and there are no urls to follow? Did it walk ok before you modified it? Do you see any urls in the refs table? Do they look like ones that should have been followed? What's the url you are giving to dowalk?
It seems to walk hiddensitemap.htm correctly becase I get a correct search result when I perform a Webinator search. However, the only page that appears in the search result is hiddensitemap.htm. I didn't use sandr so far because I think I'll use the urlcp select or the urlcp del options to get aroud the selection box problem. I didn't look at the url refs table because honestly I don't know how.
The file hiddensitemap.htm is a help file for the Webinator because without it it wasn't able to walk the page.
The stock dowalk_beta with the mods to use a standard gw database walks that site fine. Does the gw.log file in the database directory indicate any problems?
See what urls are in the database with
gw -st "select Url from html"
List references table with
gw -st "select * from refs"
Did the database exist already? Maybe you need to wipe it
gw -wipe