Page 1 of 1
Refresh or New
Posted: Tue Jan 31, 2006 10:01 am
by pete.smith
Hello
I now have my entire intranet sliced up into profiles, and using metaprofiling to aggregate. I would like them all to automatically crawl the site on a sched to get new changes / update etc. I thought I wanted refresh walk nightly, but I think I might have been wrong on that? What would your recommended strategy be for keeping current? If I do a constant refresh (every 15) it seems to slow down the performance.
Pete
Refresh or New
Posted: Tue Jan 31, 2006 10:21 am
by mark
It depends on how many pages are in the profiles and how dynamic they are. A refresh walk is generally more efficient than a new unless more than half of the pages are always changing. How much work the refresh does is also controlled by the refresh time settings under all walk settings.
Refresh or New
Posted: Tue Jan 31, 2006 10:23 am
by pete.smith
Thanks Mark,
So my big profile is 300K pages, and people add new content all the time. I just need it so, that if someone adds something down in a tree, thunderstone finds it. Maybe it is a new walk nightly? I can do the whole thing in 8 hours.
Refresh or New
Posted: Tue Jan 31, 2006 10:47 am
by John
I there is a consistent place that new content gets added, or is linked from then a refresh walk should work well. Otherwise if content could be linked from a page that doesn't normally change it could take longer to pick up the change, and the New walk would be better.
Refresh or New
Posted: Tue Jan 31, 2006 10:53 am
by mark
If not many of the existing pages are changing and finding new pages once a day (nightly) is sufficient then try a refresh with max refresh time of 12 hours or 1 day and a schedule of daily at the desired hour.
Refresh or New
Posted: Tue Jan 31, 2006 10:55 am
by mark
Note that all scheduled walks are refresh regardless of type setting in the profile. To do a new walk on a schedule you'll have to turn off the profile schedule and use some external scheduler such as unix cron or windows task scheduler. See the manual under "using dowalk" for how to launch a walk.
Refresh or New
Posted: Tue Jan 31, 2006 2:45 pm
by pete.smith
Thanks Mark, this is the behavior I dont get:
I have walk type "refresh" ( I get it does not matter for sched) and nightly at 1AM. I get "walk completed 11 minutes" . There is no way it could do anything on that many pages in 11 minutes. If I hit "Go" it appears to do a "resume". What is the diff between "Go" and just letting the job refresh by sched?
Refresh or New
Posted: Tue Jan 31, 2006 3:11 pm
by mark
The scheduled run is the same as hitting go except that the walk type will be ignored and refresh used on a scheduled walk. The walk status page should give you a more detailed idea of what happened than the 1 line walk summary.
On a refresh walk only pages scheduled to be refreshed will be checked (see the default/min/max refresh time settings). For the pages that do get checked, if the server supports if-mod-since and the page will not be downloaded if it hasn't changed.