nice the rewalk process?

gweinstock
Posts: 15
Joined: Tue Jul 05, 2005 11:05 am

nice the rewalk process?

Post by gweinstock »

Hello,
Our texis search appliance server is currently heavily loaded due to the fact that we have a large number of profiles with 2,000 to 80,000 pages being rewalked on a daily basis (which is a requirement.)
Is there a setting, such as crawl delay, parallelism, that we can set to increase the responsiveness of the search functionality even if the walk processes run at lower priority? Typically, what would be appropriate values for the different configuration settings in order to ensure that the search appliance does not timeout during actual searches?
Finally, can we unix 'nice' the rewalk process so that every time a walk is started, it starts at lower priority within the OS than the search process? if so, how is this accomplished, and can it be done through the webmin interface?
Thank you,
Gabriel
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

nice the rewalk process?

Post by mark »

Setting threads and servers to 1 will reduce crawl load. As would having a page delay greater than 0. Also keep max process size to small or medium so they're not using up the memory. Setting maximum load to something other than unlimited (-1) will prevent the walks from swamping the system. In that case you'd probably want a frequent schedule so that the walks will resume as the load decreases.

Changing the unix level nice values for processes using the database is not a good idea. It can cause lock contention and actually slow things down.
gweinstock
Posts: 15
Joined: Tue Jul 05, 2005 11:05 am

nice the rewalk process?

Post by gweinstock »

It sounds like setting the maximum load might alone take care of the problem, however, is there a heuristic for what that value should be set to? currently, it is unlimited. What, for instance, is a typical load for the texis server, and we could set the maximum load to a value around that. I understand the maximum load average value can be a percentage of a fully utilized CPU (so .5 for 50% utilization), is that correct?
Thanks,
g.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

nice the rewalk process?

Post by mark »

Basically correct. But you probably don't want any load limit less than 1 or you're just wasting cpu power or causing thrashing as walks stop and restart too frequently. A limit of 1.5 or 2 might be good. Higher loads may also be acceptable depending on the situation. You can check the load at any time from the system info page in the maintenance area.
dietric
Posts: 100
Joined: Fri May 20, 2005 10:57 am

nice the rewalk process?

Post by dietric »

We changed the load time to four, but that still seems way too low to allow acceptable reindexing times - most of the walks are being paused most of the time. It seems as if three or four indexing processes running already lead to a load of 4.
Considering the search appliance hardware, what would be a sane maximum setting for the maximum load that still allows the search appliance to actually perform searches?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

nice the rewalk process?

Post by mark »

I'd keep it under 10. Exactly where depends somewhat on usage levels and user tolerance.

If you have that many profiles running that much you might consider a 2nd appliance to spread the profiles out or to have one for crawling and the other for searching (use replication to get the crawl data to the searcher).
dietric
Posts: 100
Joined: Fri May 20, 2005 10:57 am

nice the rewalk process?

Post by dietric »

I tried setting it to eight, and it seems to be responsive enough for the search and index at a decent speed - I will see whether we need to get a second appliance...
When having profiles that index a lot of different base URL's (and therefore take long to index), would it be advisable to create a different profile for each site to be indexed and use replication to push them over to a receiver profile? It seems as if that would at least make results available earlier without having to wait for all URL's to be indexed, but is there a big overhead performance-wise?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

nice the rewalk process?

Post by John »

If all the profiles are on the same machine then using replication only adds to the work needed. What may be helpful is disabling the "Follow Cross-site Links" setting, as that avoids the need to check all URLs found against all the base urls. It does assume though that every URL you want indexed can be found from its own base URL.
John Turnbull
Thunderstone Software
dietric
Posts: 100
Joined: Fri May 20, 2005 10:57 am

nice the rewalk process?

Post by dietric »

What if there are sites that need to be walked in several profiles? Is it recommended to create a separate profile and replicate it to all receivers so the site does not have to be indexed multiple times?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

nice the rewalk process?

Post by mark »

You might consider a meta search in that case. Setup the granular profiles for walking the sites. Then create meta profile(s) to search different combinations of the sites.
Post Reply