Categories and Walk Speed

mcintirj
Posts: 4
Joined: Fri Jun 04, 2004 3:15 pm

Categories and Walk Speed

Post by mcintirj »

Does the addition of categories normally have a negative effect on the speed of walks? Currently, I have a profile that references 400 distinct URLs and indexes approximately 250k pages. A refresh walk of the profile had been taking 40 minutes. After adding categories, two performance issues have been discovered. First, the recategorizatoin process takes 13 hours to complete. Is this time frame normal? Second, the refresh walk had to be terminated after not completing in 72 hours. Any suggestions?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Sounds like a bit much. How many categories? What kind of patterns did you use?
mcintirj
Posts: 4
Joined: Fri Jun 04, 2004 3:15 pm

Categories and Walk Speed

Post by mcintirj »

Six categories, but only one has real patterns. The one large category consists of both single page listings and URL patterns with a single trailing * wild card. There are currently 216 URL patterns defined for the category. The other five have one URL specified with a trailing * wild card. The plan was to update these five placeholder categories to be similar to the larger category.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Yes, that's a lot of categories. Works out to around 55 million comparisons/updates. It's a little unusual to have that many categories. Can you tell a little about how you're intending to use them?
mcintirj
Posts: 4
Joined: Fri Jun 04, 2004 3:15 pm

Categories and Walk Speed

Post by mcintirj »

One of the features of our web site allows users to perform searches against a controlled universe of verified web sites. This collection of websites is currently 1000 distinct URLs. Each of the URLs is marked to be included in one or more disciplines (e.g fire services, law enforcement). We are trying to use the categories to allow a user to search against URLs that a defined to be relevant to a given discipline.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Would they ever search across multiple categories simultaneously? If not they should be kept in separate profiles.
mcintirj
Posts: 4
Joined: Fri Jun 04, 2004 3:15 pm

Categories and Walk Speed

Post by mcintirj »

It is unlikely that multiple disciplines would be searched simultaneously. The use of multiple profiles was actually the approach that was originally utilized. Since a large number of the same URLs are used by multiple disciplines, the same pages were being indexed multiple times. Using the categories seemed like a good solution, as it would significantly reduce the total number of pages that were indexed and would remove redundancy in the walks. Is there a limit to the number of URL patterns that can be defined for a category? Also, is there a limit to the total number of categories?
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Categories and Walk Speed

Post by John »

There isn't a hard limit to the number of categories and patterns, however each URL fetched needs to be checked against each pattern which is where the performance slow down comes in.

Some customization of the scripts could probably be of great benefit, either by assigning categories more efficiently, or by allowing fetched pages to be stored in multiple profiles.
John Turnbull
Thunderstone Software
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

Categories and Walk Speed

Post by KMandalia »

I think I also have a similar issue. I have 21 websites in base url section and two categories. The first category has 20 websites and the second has 1 website. Both categories have standard url patterns like http://www.mysite.com/* separated by space. The walk however fetches only about 12000 results and takes 13 hours.

I am using paid webinator 5.0.5 on windows server 2003. Net and DNS mode is internal, no query stripping, ignoring case, no robots.txt checking, all extensions, all meta, process size unlimited. Crawl delay set to 2 and servers set to 7 (I have experimented with crawl delay set to 0 as well, no difference). No limit on page size and page timeout is 180. No stay under.

What more I can do to get everything there is !!!

I don't understand what (if there is any) problem on my side?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Categories and Walk Speed

Post by mark »

Sounds like you're asking more about walk completeness than about speed. What's missing? Is that page's parent page in the database? If not keep going up until you find the parent page that is. When you find the parent in the database with List/edit urls click on it's "Children" link. See if the links listed are the expected ones. See if there are errors listed next to the ones not in the database. If no error they were excluded because of extension, robots, or other exclusion rules. Check your settings for what would be required to include such files. If you turn verbosity up to 4 and do a new walk it will list the reason for every excluded url.
Post Reply