Hello!
We are using the Ignore tags function to avoid indexing of menus and navigation. This is working well on manually started walks, but after a scheduled rewalk (type=New) the previously ignored sections (menus & navigation) reappear in the search index.
I suspect there may be something wrong with the scheduled rewalks since they are very fast:
"Last complete walk: 2005-06-02 02:00:34 (took 28 seconds)
Success. 2,075 pages (55,966,859 bytes) (135 errors) (641 duplicates)"
A scheduled rewalk will only refresh those pages that are due to be refreshed, which may only be a few depending on your refresh time settings.
There is no difference that we are aware of between the ignore tag behaviour wether a manual or scheduled walk. If you look at the page under List/Edit URLs what does it show in terms of content as well as visited and modified times?
Firstly: Hats off to the Thunderstone support team for your quick response times!
I thought the rewalk type "New" (as opposed to "Refresh") would rebuild the index from scratch, but from what you are saying I gather this is not the case.
We recently added ignore tags to our pages (CMS-based intranet) but the content was still in the index after the scheduled rewalks. The issue seems to be that our added ignore tags ("<!--NOINDEX START--> and <!--NOINDEX END--> didn't cause the search engine to update the index on the scheduled rewalks. A manually started walk forces the index to update all pages, so we'll do that for all profiles.
I still don't quite grasp how the search engine decides if a page should be reindexed or not? Here's some log data and settings from our search server:
-------------------------------
One of the non-updated pages in the List/Edit URLs:
Indexed: 2005-06-01 13:55:43
Modified: 2005-06-01 13:55:43
Last Visit: 2005-06-01 13:55:43
Next Visit: 2005-06-01 14:55:43
The last scheduled rewalk for that profile:
Last complete walk: 2005-06-03 04:00:12 (took 10 seconds)
Success. 657 pages (11,837,072 bytes) (35 errors) (248 duplicates)
Default Refresh Time: 1 hour
Minimum Refresh Time: 1 minute
Maximum Refresh Time: 90 days
--------------------------------------
10 seconds to rewalk 657 pages seems a bit too quick to me - apparently the search engine isn't revisiting the pages, so how does it decide wether the page should be reindexed?
What can we do to make sure that pages will be reindexed when the content has been updated, without putting unnecessary load on the involved servers? Meta tags? Changed settings?
A "new" walk does refetch and reindex everything. But all scheduled walks are "refresh" regardless of mode setting.
Initially all pages will have default refresh time setting. Upon refresh every page that's due will be checked, using if-mod-since if the server supports it. The refresh time will be adjusted up or down depending on whether it changed or not so that frequently changing pages will ultimately get refreshed more often and rarely changing pages will get refreshed less often. If you scroll down in the walk status you should be able to get a better idea of how much was refreshed in the most recent cycle. The refresh can't know what changed without asking about everything, hence the above strategy.