Page 1 of 1

Deciding needs from limited run

Posted: Fri Nov 09, 2001 8:41 am
by b.sims
Can anyone suggest a way to do this:

I gave the free version of Webinator 3000 URLs to look at. It stopped after 10,000 pages as designed. Can anyone suggest a way that I could get a rough estimate from this of how many pages would be indexed if the 10,000 page limitation was not in place?

I need to know in order to decide which commercial version we need and how much bandwidth we should plan for the crawls.

Thanks in advance,

Deciding needs from limited run

Posted: Fri Nov 09, 2001 10:03 am
by Kai
It's impossible to know without knowledge of the sites listed in those URLs, ie. how many pages are linked from them. If these URLs are all distinct sites, the total might be several million URLs to crawl.

Deciding needs from limited run

Posted: Mon Nov 12, 2001 10:24 am
by b.sims
Thanks,

I realised that I can look at the todo list, which contains the list of pages not indexed so far. If I run gw again, does it index these pages also?

Deciding needs from limited run

Posted: Mon Nov 12, 2001 10:48 am
by Kai
Yes, if you're still under the applicable page limit. But once it walks those pages, other links will be found and added, so the size of the todo list is not a good estimation of the final size of the walk. Again, it depends on the size of the site(s) involved.

Deciding needs from limited run

Posted: Mon Nov 12, 2001 10:48 am
by mark
Yes, when gw starts it picks up whatever is in the todo list. But that only a small indication. It could still find new urls on those pages too.

Deciding needs from limited run

Posted: Mon Nov 12, 2001 11:11 am
by b.sims
So it fills the db with 10,000 pages and then will not run again using the same DB? Can I run using a different db, for example if I divided by urls by category and just had a search feature for each category?

Deciding needs from limited run

Posted: Mon Nov 12, 2001 12:23 pm
by John
If you are just estimating capacity you could delete from the html table, and run again, probably with noindex to prevent index creation. You could also pick a sample and estimate, for example crawl 30 urls, and see how many pages are fetched.

You may also want to check the Webinator license, as in many cases 3000 URLs would be beyond the acceptable use clause.