Rewalk Type: Refresh - Question?

Post Reply
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Rewalk Type: Refresh - Question?

Post by rhuber0 »

I am indexing ASP pages that pull data out of a database. In the database, data may change and the new/changed data appears on the ASP page. However, when I walk these pages doing a 'Rewalk Type = Refresh', the new/changed data does not get updated in the index. I'm guessing the changes aren't reflected in the index b/c the URL didn't change and Thunderstone treats it as "not updated". How can get these data changes into the index with a 'Rewalk Type = Refresh'? I know that a 'Rewalk Type = New' would be effective but there is too much data to index all the database content on a daily schedule.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Rewalk Type: Refresh - Question?

Post by mark »

URL changing is not really related. New urls will be added to the database. Unchanged urls will be updated if their content changes.

Refresh will fetch each page that's due for refresh (check your default/min/max refresh times under all walk settings) using "ifmodsince". If the webserver respects ifmodsince and the page hasn't changed since the last visit nothing will be downloaded or updated. If there's no ifmodsince or the webserver says it's modified the page will be downloaded and checksummed. If the checksum is different than the last download the new page will be stored.

You can use list/edit urls to see if your pages are due for update.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Rewalk Type: Refresh - Question?

Post by John »

There are a couple of possibilities. What do you have the refresh settings set to? Pages are only checked if they are likely to have changed. If the database record includes a modification time it would be helpful to set a the Modified time header, as well as respond to the If-Modified-Since correctly. That will allow much more efficient processing of updates. Otherwise the entire page is fetched, and compared to the previous version of the page.

Since you have Webinator if you can identify the changes URLs you could modify the script to update just those pages.
John Turnbull
Thunderstone Software
rhuber0
Posts: 17
Joined: Tue Dec 06, 2005 9:29 am

Rewalk Type: Refresh - Question?

Post by rhuber0 »

After doing some more research, I believe the solution that I need is similar to the one in this thread: http://thunderstone.master.com/texis/ma ... 3fb4a56110

The process would be to:
1. Create a NEW walk to index all ASP pages that contain our database content
2. Create a Page File with a list of URLs of ASP pages of database content that's been updated since the last index creation.
3. Run a dowalk with singles option e.g. "...MORPH3\texis\scripts/Webinator/dowalk\singles.txt" to add in the new content from the URLs in the Page File.

The only problem that I see is that when I have content in the index for a URL from the initial walk, the content is not updated when the URL is placed in the Page File. Example: ASP URL returns the following in the body "Bug 101: SQL Server error". The initial walk is run and this data is stored in the index. A week later, the database item is updated and the ASP URL now returns "Bug 101: Oracle error". The URL is placed in the Page File and processed via dowalk\singles but when you run a Live Search, the original "SQL Server" value is still in the index. Is this an error or just a limitation of using a Page File?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Rewalk Type: Refresh - Question?

Post by John »

Singles is designed to add content that does not already exist in the database, it will not update existing content. Since with Webinator you have access to the scripts you could change that behaviour.
John Turnbull
Thunderstone Software
Post Reply