Selective refresh

nroot · Post by **nroot** » Tue Oct 24, 2006 4:40 pm

We've run into an interesting problem. We're using Thunderstone to index some usenet groups through a dynamic html reader (a PHP script). The script has two modes: listing articles by page and displaying individual articles.

We've got 500,000+ articles, so a "new" walk of all pages every time would probably be impossible. We think it's safe to assume that "article" pages *never* change. So, when we do a refresh walk, we want to refresh all index pages and NEW articles, but completely ignore previously-indexed articles.

Essentially, I think we want to be able to set up different refresh rules based on URL pattern match. How can we tackle this?

Tnx- N

Post by **John** » Tue Oct 24, 2006 4:50 pm

The simplest method is probably to edit the calcnextcheck function to take the URL, and pass it in where needed. You could then look at the URL and set nextcheck in the future if it matches, e.g.

<rex article $url>
<if $ret neq ''>
<return "2030-01-01">
<else>
<return "now">
</if>

nroot · Post by **nroot** » Fri Oct 27, 2006 11:37 am

Thanks John- I created a script that (I think) did roughly what you said: changed the NextCheck for pages in the database with URLs matching a specific pattern to the year 2030. I set that script up to run every 15 minutes while the "new" walk was going on and replace the NextCheck date for just rows that hadn't already been hit. Seems to have worked.

But that sounds a little different than editing the "calcnextcheck" function... where is that function? I couldn't find any info on it anywhere.

Thanks for the help- N

Post by **mark** » Fri Oct 27, 2006 11:48 am

calcnextcheck is in the dowalk script.