Page 1 of 1

Incomplete Walk

Posted: Wed Jun 30, 2004 12:17 pm
by mjacobson
I have been testing the new version of Webinator 5 and I can't seem to get a complete walk of all eligible documents on the server that I am indexing.

My Webinator 4 index of this site has about 130,000 urls in it, but version 5 will only index about 1,700 urls. I am using the same settings on both walks and have left the new Webinator 5 options like "Default Refresh Time" set to their defaults.

The server that I am indexing is running Apache web server with "Fancy Indexing" on. The starting Url points to a directory that does not have a default index page, so Apache generates one. This starting directory only has sub-directories listed in it, no other type of files. These directories have a common naming standard of 4 digit year and Julian date the directory was created, 2004136.

Each of these subdirectories contains the html files that was produced on this date. Once the content has been published, it is very static but hundreds of documents gets published daily to this directory so the starting page changes constantly.

When I go to the "List/Edit URLs" page I can see that the starting url has been completely index by viewing the "Body" portion of the page. I then click the "Children" link and all of the links on this page are displayed but only about three-fourths of the links are in the index. The other quarter is not in the index and there is no reason given. I do have the Verbosity set to 8.

I then click one of the children links and I get a complete list of links on this page, but none of them are in the index and again, no reason given.

I have down loaded the newest version of the scripts today and get the same results. If I start running a refresh crawl, I will get a few more documents, < 1000, each time. I have reviewed the error table and the vortex logs and can see no indications why this is happening. I am running the Webinator test on my backup server. The only difference between these two boxes is the version of Vortex installed. The backup box is running Commercial Version 5.00.1086121238 20040601 (i686-unknown-linux2.4.9-64-32)

Sorry for the long post and thanks for your help in this matter.

Incomplete Walk

Posted: Wed Jun 30, 2004 12:24 pm
by John
You may want to change the "Maximum Process Size" limit to a larger or unlimited option. A refresh crawl will pick up where the crawl was left off.

Also scheduling a refresh crawl for every minute is a useful option with Version 5, although you may want to adjust the Default/Min/Max refresh times.

Incomplete Walk

Posted: Wed Jun 30, 2004 1:14 pm
by mjacobson
Is there any guidance on setting the Default/Min/Max refresh times? Also interested in documentation on the other new controls.

Incomplete Walk

Posted: Wed Jun 30, 2004 2:02 pm
by John
The default refresh time is used the first time a page is fetched, so it should be set around what you expect the refresh period to be. The closer it is the faster the system will stabilize to the right value. If bandwidth is a concern set it higher, and if freshness then leave it lower. It will then adjust from there, but stay within the min and max.