new files not indexed

richard.kunst · Post by **richard.kunst** » Mon Jan 17, 2005 12:45 pm

Hi,
I installed Webinator 5.0.2 several months ago. Installation and configuration went well. It is used to index a small academic website (http://mercury.soas.ac.uk/wadict)(40 or so pages) and a small corpus of texts in the SE Asian Wa language (http://wadict.soas.ac.uk/) (100 files or so). Settings are pretty much out-of-the-box. Almost all files are in UTF-8 format. Searching etc. in general goes well. But there are two general problems, the first with indexing, the second with searching:
(1) When a new file is added to a watch folder, it is not automatically indexed (daily 2am rewalk schedule). I diddle with the settings in various ways, and somehow I have been able to manually force the indexing of files, but I am not sure how. I have appended various log file messages.

(2) files with the extension *.xml are indexed, as requested, but the search never finds content in any of them.

From vortext.log (after adding new file to corpus):
178 2005-01-14 02:00:03 C:\Program Files\Thunderstone Software\Webinator\texis\scripts\webinator\dowalk

Trying to insert duplicate value (http://wadict.soas.ac.uk/wadict/corpus/) in index C:\Program Files\Thunderstone Software\Webinator\texis\default\db1\xtodourl.btr
115 2005-01-14 02:00:03 C:\Program Files\Thunderstone Software\Webinator\texis\scripts\webinator\dowalk:69: Field NextCheck non-existent
000 2005-01-14 02:00:03 C:\Program Files\Thunderstone Software\Webinator\texis\scripts\webinator\dowalk:69: SQLExecute() failed with -1 in the function execntexis

(1000s of messages like the last two above in vortex.log)

Contents of db1.long after manual Update and Go on corpus:

</pre>
<pre>
Walk started at 2005-01-17 16:14:51 (by resume)
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://wadict.soas.ac.uk/wadict/corpus/
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 refresh (2204) on http://wadict.soas.ac.uk/wadict/corpus/
109 errors
11 duplicate pages

Updating search index ...Done.
Creating spell-checker dictionaries...Done.
Verifying usability of new walk.

Walk finished at 2005-01-17 16:14:56 (took 2 seconds)
Keeping database live: C:\Program Files\Thunderstone Software\Webinator/texis/default/db2
</pre>
<pre>
<hr width="50%">
Checking for broken hyperlinks...
No broken hyperlinks found. Nice Job!
Checking for duplicate pages...
No duplicate pages found.
<hr width="50%">
End of report.
</pre>

Thanks for any help!
Richard

Post by **mark** » Mon Jan 17, 2005 2:44 pm

I'd start by getting the latest scripts, 5.1.6, from the support page and do a complete "new" walk. Then switch it to "refresh" mode for the scheduled walks.