Skipping timed out URLs

rhuber0 · Post by **rhuber0** » Fri Nov 10, 2006 8:36 am

I figured it out. The issue looks to be related to our license only allowing 100,000 URLs per profile. Even though I delete all URLs from the html table before I process the page file again, it won't allow me to index any more URLs. Why is that? And, is there some way that I can delete the URLs from the index to allow more URLs to be read? After all, there will never be more than 100,000 items in the index.

Post by **John** » Fri Nov 10, 2006 9:50 am

If you are starting fresh then doing a "New" walk should work, and start from 0. Does the Walk Status page show 0 pages in the index after the deletes? If not try the delete again to make sure it completed.

Post by **mark** » Fri Nov 10, 2006 10:47 am

Deleting everything and starting again leaves quite a window of non-usefulness for the searches. You should just do a mode "new" walk. It will do the walk to a new database, leaving the live search db available for searching. When the new walk is finished it will flip it to live and delete the old one so there are no search outages.

rhuber0 · Post by **rhuber0** » Fri Nov 10, 2006 1:10 pm

A new walk is not an option for us. I guess I should explain my whole situation.

My project is to replace the Search functionality in our Siebel system with a Thunderstone search. In order to do that, I pull all the Siebel data (service requests from our customers) and place it into ASP pages for Thunderstone to index. There is too much data (and would generate too much load on the database server) to reindex all the content every night so my solution was to place the URLs for the service requests that had been updated into a Page File for processing. This would, in essence, just index the changes. For example, let's say 300 service requests were updated in Siebel during one business day. My database process would create the page file with the 300 ASP URLs for those changed items. I would then delete the 300 URLs from Texis database using tsql.exe (if the URLs had been already indexed) and run the indexer to insert the new content from the 300 ASP pages.

The problem arises in that I can't index any more than 100,000 URLs, even though I delete them from the index prior to reindexing them when they're updated. So, even though there will never be more than 100,000 items in the index at one time, once I've hit the 100,000 URL mark, I can't index any more content. And, running a "Refresh Type = New" isn't an option b/c I would then lose all the content that had been indexed and have reindex every Siebel item again.

Is there any other way to delete URLs so that Thunderstone understands that I don't have more than 100,000 items in the index and will let me continue to process URLs?

Post by **mark** » Fri Nov 10, 2006 2:48 pm

It sounds like you have full texis since you have tsql. With texis it would probably be far simpler to just load the data directly into a texis database rather than making web pages to crawl.

But as far as the crawl is concerned, where do you see that it's telling you you're at the limit? Perhaps it's the counts table that webinator maintains as a shortcut to counting the database all the time. When you delete urls from the html table you need to update the counts table to reflect the change.

rhuber0 · Post by **rhuber0** » Fri Nov 10, 2006 3:02 pm

I never received a message saying the profile is at the limit, but I'm 99% sure that is the issue. I processed several page files that had just under 87,000 URLs in them combined. When I processed the final file (which had over 20,000 URLs in it), the indexes processed just over 12,000 of them and stopped without error. Based on that, the 100,000 URL limit seems like the reasonable explanation to me.

So, as long as I update the counts table in the texis database after I do the delete, I should be able to keep indexing?

Post by **mark** » Fri Nov 10, 2006 3:48 pm

The walk status (scroll down to see it all) should indicate if the walk stopped due to page limit.

Updating the counts table should help webinator stay in sync with the actual table size.