Rewalking modified sites using script

Post Reply
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

Rewalking modified sites using script

Post by Faiz »

Hi,
I am using the dowalk script to crawl sites. Since, rewalking is different from actual crawling, so I modified the script to rewalk modified sites. Here's what I did to the script,
select Url, Visited from html
fetch using ifmodsince Visited
if modified, then insert all Urls in the table todo.
elseif not found, then delete the Url from the table html.
After the table todo is populated with the Modified Urls I call the dispatch function to crawl all the Urls in todo parallely.
Then update the respective fields in table html. I am not processing any links in that page as I need to crawl only one page at a time.
What I wanted to know, is whether my approach is correct or not. Is there anything else I need to take care or any other modifications to be made in the code? I ran it on a sample database and i think it worked fine.

Thanx,
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

Rewalking modified sites using script

Post by mark »

That's mostly ok. Except I would insert the urls into the internal todo list like any other page fetch instead of placing them in the todo table. Then the refresh is just one invocation.

You need to process the links on the page to find new ones. And depending how you do it you may want to delete all entries from the refs table for the given url before inserting new ones.
Post Reply