Thunderstone Support Forums

Posted: **Fri Nov 09, 2001 8:41 am**

Can anyone suggest a way to do this:

I gave the free version of Webinator 3000 URLs to look at. It stopped after 10,000 pages as designed. Can anyone suggest a way that I could get a rough estimate from this of how many pages would be indexed if the 10,000 page limitation was not in place?

I need to know in order to decide which commercial version we need and how much bandwidth we should plan for the crawls.

Thanks in advance,

Posted: **Fri Nov 09, 2001 10:03 am**

It's impossible to know without knowledge of the sites listed in those URLs, ie. how many pages are linked from them. If these URLs are all distinct sites, the total might be several million URLs to crawl.

Posted: **Mon Nov 12, 2001 10:24 am**

Thanks,

I realised that I can look at the todo list, which contains the list of pages not indexed so far. If I run gw again, does it index these pages also?

Posted: **Mon Nov 12, 2001 10:48 am**

Yes, if you're still under the applicable page limit. But once it walks those pages, other links will be found and added, so the size of the todo list is not a good estimation of the final size of the walk. Again, it depends on the size of the site(s) involved.

Posted: **Mon Nov 12, 2001 10:48 am**

Yes, when gw starts it picks up whatever is in the todo list. But that only a small indication. It could still find new urls on those pages too.

Posted: **Mon Nov 12, 2001 11:11 am**

So it fills the db with 10,000 pages and then will not run again using the same DB? Can I run using a different db, for example if I divided by urls by category and just had a search feature for each category?

Posted: **Mon Nov 12, 2001 12:23 pm**

If you are just estimating capacity you could delete from the html table, and run again, probably with noindex to prevent index creation. You could also pick a sample and estimate, for example crawl 30 urls, and see how many pages are fetched.

You may also want to check the Webinator license, as in many cases 3000 URLs would be beyond the acceptable use clause.

Thunderstone Support Forums

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run

Deciding needs from limited run