I have set up Watch URLs and set up my profile to do a rewalk onchange.
The rewalk is triggered every 15 minutes or so, despite the fact that the content at the watchURL has not changed and I'm not setting a lastModified META tag. Any suggestions?
That url was returning an error page with a modified time of "now" so it would appear changed every time. While I was looking it started working. So it will probably work now.
I see why that would have happened. Is there any way for it NOT to trigger a rewalk if the watchURL returns a response code other than 200? The last thing you'd want to happen if your servers's down anyway is a rewalk to be triggered.
Actually the server was returning a document that was good as far as the client/crawler was concerned. That's why it had a modified time. But the text of the document said the backend was having problems or some such. The crawler can't really know that it was a failure in that case.
Do you mean it did return 200 as the status code? If that's the case I might be able to fix that. What is the crawler looking at to determine whether it's "good"?
HTTP codes 100-299 are considered ok. Anything else will prevent triggering of a rewalk and will be treated as if no attempt was made to fetch the url. Other non-http conditions that will prevent triggering are connection timeouts, dns failures, etc.