Categories and Walk Speed

User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Categories and Walk Speed

Post by John »

When the walk ends you might do a Refresh walk, which if any of the server processes stopped due to exceeding the Maximum Process Size will pick up again. You can check the todo table to see if there are more URLs in there.

Also you may want to reduce the number of Servers to see if you may be causing contention on the walk machine talking to that number of servers at once.
John Turnbull
Thunderstone Software
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

Categories and Walk Speed

Post by KMandalia »

Thank you both very much for replying.

Mark:

On a brand new high performace server doesn't it sound weird for a paid webinator to take almost 13 hours to walk 12000 pages from 21 websites (I checked them individually and only two website amounts to 5000 pages). Don't you think the issue is speed related as well as completeness related? 12000 pages seems wrong outright without checking parents, childs and all that things. I don't have any exclusions and I tried with 0 crawl delay and 2 servers to no avail. Turned up verbosity to 4 and got acceptable reasons (offsite, reject list for .js files etc) which are okay. I don't want to build separate profiles as I have customized the profile so that it displays multiple categories (one hard coded and others users selected). I can do it with separate profiles but the result page layout will not function as we want.BTW, this would be work around which enterprise webinator users need not think about.


John:

Unless I get some good results first time, how the refresh walk can help me? I tried with reduced servers as well. Will check the todo table.

John and Mark:

Please tell me precisely, without considering any other thing, is there any limit on how many categories could be there and how many patterns could be in each category? And most importantly, what are the implications of categories on the walk? Because in our case, results we are getting doesn't justify our purchase of enterprise webinator.


I will be checking this issue out by running a walk with no category specified to see if it changes anything. In the meantime, your help will be very much
appreciated.

Thanks
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Categories will not affect the number of pages indexed in any way. A huge number of categories can slow the walk down somewhat. There is no limit on the number of categories or how many patterns. Each pattern will slow the crawl down slightly.

Speed can be related any number of factors including connectivity to the other servers and their speed in responding. And a crawl delay of 2 will take a minimum of just under 6 hours for 12000 pages. Factor in a slow server or 3 and it could take the time you're seeing. Once the initial walk is done subsequent refreshes will generally take much less time as every page won't need updating.

What size webinator do you have?
What's hardware is it installed on?
How busy is that system without webinator running?
How responsive are all 21 of the servers being walked?
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

Categories and Walk Speed

Post by KMandalia »

What size webinator do you have?
200,000 pages (the enterprise version)

What's hardware is it installed on?
Windows Server 2003 Standard, 2.8GHz, 2.5GB RAM

How busy is that system without webinator running?
Brand new server bought just for Webinator.

How responsive are all 21 of the servers being walked?
All of them .gov and .org financial website, very popular for banks and credit unions. Has to be better than our system.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

"Has to be better than our system" may not be true, especially considering that everything between you and them factors in, including your DNS servers.

I assume this machine has 100MB FULL DUPLEX connection to your LAN?
What kind of internet connectivity do you have?
If you're swamping your bandwidth at all you might do better with servers set lower than 7.
Run a browser on the machine where webinator is installed and try the various sites to get a feel for their responsiveness. Assuming the browser finds all of the sites snappy, you could try dns mode and net mode system instead of internal to see if that makes a difference.
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

Categories and Walk Speed

Post by KMandalia »

100MB Full Duplex: Yes

T1 Line

Set servers to 3 and crawl delay to 1, all other settings as mentioned earlier.

Kept Net and DNS to internal.

I removed categories and I am now reaching 35000 pages and Webinator is still going on !

Do you think categories have anything to do with number of pages walked (I am not that concerned about speed, since once we get a sufficient page count that we are expecting to see, we will just do refresh). I am pretty sure it shouldn't but the results I am getting surprises me.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Categories will not affect the number of pages indexed in any way.

If the server(s) you're walking or your connectivity aren't 100% reliable they might not return some page(s). That page might be the only way to get to a big chunk of their site. That could randomly change the number of pages acquired. Future refreshes would probably pick those pages up.
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

Categories and Walk Speed

Post by KMandalia »

Thanks for your suggestions. I am going to put in restrictions one by one and will precisely know what was the problem. I will let you know my observations.
Post Reply