Deciding needs from limited run

Post Reply
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Deciding needs from limited run

Post by b.sims »

Can anyone suggest a way to do this:

I gave the free version of Webinator 3000 URLs to look at. It stopped after 10,000 pages as designed. Can anyone suggest a way that I could get a rough estimate from this of how many pages would be indexed if the 10,000 page limitation was not in place?

I need to know in order to decide which commercial version we need and how much bandwidth we should plan for the crawls.

Thanks in advance,
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

Deciding needs from limited run

Post by Kai »

It's impossible to know without knowledge of the sites listed in those URLs, ie. how many pages are linked from them. If these URLs are all distinct sites, the total might be several million URLs to crawl.
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Deciding needs from limited run

Post by b.sims »

Thanks,

I realised that I can look at the todo list, which contains the list of pages not indexed so far. If I run gw again, does it index these pages also?
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

Deciding needs from limited run

Post by Kai »

Yes, if you're still under the applicable page limit. But once it walks those pages, other links will be found and added, so the size of the todo list is not a good estimation of the final size of the walk. Again, it depends on the size of the site(s) involved.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Deciding needs from limited run

Post by mark »

Yes, when gw starts it picks up whatever is in the todo list. But that only a small indication. It could still find new urls on those pages too.
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Deciding needs from limited run

Post by b.sims »

So it fills the db with 10,000 pages and then will not run again using the same DB? Can I run using a different db, for example if I divided by urls by category and just had a search feature for each category?
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Deciding needs from limited run

Post by John »

If you are just estimating capacity you could delete from the html table, and run again, probably with noindex to prevent index creation. You could also pick a sample and estimate, for example crawl 30 urls, and see how many pages are fetched.

You may also want to check the Webinator license, as in many cases 3000 URLs would be beyond the acceptable use clause.
John Turnbull
Thunderstone Software
Post Reply