Confusing results

r.j.michell · Post by **r.j.michell** » Mon Sep 03, 2001 6:41 am

We run Webinator Version 2.55 Release: 20000120, and are getting strange results from him:

If I type in "cricket" some documents with the word "cricket" somewhere in the body text appear, but docs with the word "cricket" that occur with greater frequency are not picked up at all!

Is there something I can do to my vortex search script to better serve results?

Here is the complete gw command (Taken from a once-weekly crontab procedure)

# - Change dir to webinator:
cd /www/httpd/html/webinator
# - Instruct gw to rewalk globalDB
bin/gw -rewalk -dglobalDB
# - Change owner,group,mode of globalDB after rewalk:
chown nobody globalDB
chgrp nobody globalDB
chmod 775 globalDB
# - Change dir to globalDB:
cd /www/httpd/html/webinator/globalDB
# - Change mode of all tables in globalDB:
chmod 775 *.*

Any pointers would be most welcome.
Cheers.

Russ

Post by **John** » Mon Sep 03, 2001 8:04 am

The first thing to check is that those pages were actually indexed by looking at gw.log to see if they were retrieved successfully.

If they were, and you can search for those pages with other terms look at the match info, and make sure "cricket" does actually occur in the text that Webinator indexed.

r.j.michell · Post by **r.j.michell** » Mon Sep 03, 2001 8:35 am

Thanks for your reply:

I checked: /path/to/webinator/DBname/gw.log for its last sweep (yesterday) and there is no mention of the directory in which the desired document is located!??

I checked permissions on this dir and it+docs are all 775 and have ownerships that reflect in dirs that were indexed...

As you can see from my crontab extract, we are rewalking, and as I remember it, the initial index was set to walk about 10 domains under our top domain of: .anglia.ac.uk so in theory, all dirs (of which the target dir in this context is one) under this should be re-indexed...

Our robots.txt has no entry to exclude access to this dir? Could I possibly be looking at the wrong gw.log file? Doing a >locate *gw.log brought up a page or so of results..

The pages that were retrieved from the search term: 'cricket' did feature this term in the body-text according to 'match info' - just not the desired document!

Thanks again.
Russ

r.j.michell · Post by **r.j.michell** » Mon Sep 03, 2001 8:42 am

Actually my mistake! - the desired dir *does* occur in the log as follows:

2001/09/02 00:46:21 Retrieving http://www.apu.ac.uk/marketing/

which makes it all the more confusing that some pages containing the search word are not displayed as a result. Does it matter that it includes the '/' at the end of the url, and many of the successfully indexed dirs don't:

2001/09/02 00:46:21 Retrieving http://www.apu.ac.uk/clearing2001

Sorry about the mess up..
Russ

Post by **John** » Mon Sep 03, 2001 9:54 am

Are the files in the directory you wanted indexed indexed? I noticed that some of the links had .shtml extensions. Were these included in the original walk with a -fshtml option?

r.j.michell · Post by **r.j.michell** » Mon Sep 03, 2001 10:12 am

ahaaaaar - that's is it! we neglected the -fshtml....

Here is the original command given to gw:

/bin/gw
-d/www/httpd/html/webinator/nameofdatabase
-jhttp://www.urltoindex -dns=sys http://www.urltoindex

Do I now need to wipe the database, and re-initiate the command (above) to include -fshtml thus:

/bin/gw
-d/www/httpd/html/webinator/nameofdatabase
-fshtml -jhttp://www.urltoindex -dns=sys http://www.urltoindex ??

Many thanks for your help thus far!
Cheers

Russ

Post by **John** » Mon Sep 03, 2001 10:24 am

Yes, that's what needs to be done. You could use a new name for the database to avoid having no database while you create a new one.

r.j.michell · Post by **r.j.michell** » Mon Sep 03, 2001 10:32 am

Thanks very much John!
Mystery solved.

Cheers once again.

Russ