I have been testing the new version of Webinator 5 and I can't seem to get a complete walk of all eligible documents on the server that I am indexing.
My Webinator 4 index of this site has about 130,000 urls in it, but version 5 will only index about 1,700 urls. I am using the same settings on both walks and have left the new Webinator 5 options like "Default Refresh Time" set to their defaults.
The server that I am indexing is running Apache web server with "Fancy Indexing" on. The starting Url points to a directory that does not have a default index page, so Apache generates one. This starting directory only has sub-directories listed in it, no other type of files. These directories have a common naming standard of 4 digit year and Julian date the directory was created, 2004136.
Each of these subdirectories contains the html files that was produced on this date. Once the content has been published, it is very static but hundreds of documents gets published daily to this directory so the starting page changes constantly.
When I go to the "List/Edit URLs" page I can see that the starting url has been completely index by viewing the "Body" portion of the page. I then click the "Children" link and all of the links on this page are displayed but only about three-fourths of the links are in the index. The other quarter is not in the index and there is no reason given. I do have the Verbosity set to 8.
I then click one of the children links and I get a complete list of links on this page, but none of them are in the index and again, no reason given.
I have down loaded the newest version of the scripts today and get the same results. If I start running a refresh crawl, I will get a few more documents, < 1000, each time. I have reviewed the error table and the vortex logs and can see no indications why this is happening. I am running the Webinator test on my backup server. The only difference between these two boxes is the version of Vortex installed. The backup box is running Commercial Version 5.00.1086121238 20040601 (i686-unknown-linux2.4.9-64-32)
Sorry for the long post and thanks for your help in this matter.
My Webinator 4 index of this site has about 130,000 urls in it, but version 5 will only index about 1,700 urls. I am using the same settings on both walks and have left the new Webinator 5 options like "Default Refresh Time" set to their defaults.
The server that I am indexing is running Apache web server with "Fancy Indexing" on. The starting Url points to a directory that does not have a default index page, so Apache generates one. This starting directory only has sub-directories listed in it, no other type of files. These directories have a common naming standard of 4 digit year and Julian date the directory was created, 2004136.
Each of these subdirectories contains the html files that was produced on this date. Once the content has been published, it is very static but hundreds of documents gets published daily to this directory so the starting page changes constantly.
When I go to the "List/Edit URLs" page I can see that the starting url has been completely index by viewing the "Body" portion of the page. I then click the "Children" link and all of the links on this page are displayed but only about three-fourths of the links are in the index. The other quarter is not in the index and there is no reason given. I do have the Verbosity set to 8.
I then click one of the children links and I get a complete list of links on this page, but none of them are in the index and again, no reason given.
I have down loaded the newest version of the scripts today and get the same results. If I start running a refresh crawl, I will get a few more documents, < 1000, each time. I have reviewed the error table and the vortex logs and can see no indications why this is happening. I am running the Webinator test on my backup server. The only difference between these two boxes is the version of Vortex installed. The backup box is running Commercial Version 5.00.1086121238 20040601 (i686-unknown-linux2.4.9-64-32)
Sorry for the long post and thanks for your help in this matter.