Hi,
I am using the dowalk script to crawl sites. Since, rewalking is different from actual crawling, so I modified the script to rewalk modified sites. Here's what I did to the script,
select Url, Visited from html
fetch using ifmodsince Visited
if modified, then insert all Urls in the table todo.
elseif not found, then delete the Url from the table html.
After the table todo is populated with the Modified Urls I call the dispatch function to crawl all the Urls in todo parallely.
Then update the respective fields in table html. I am not processing any links in that page as I need to crawl only one page at a time.
What I wanted to know, is whether my approach is correct or not. Is there anything else I need to take care or any other modifications to be made in the code? I ran it on a sample database and i think it worked fine.
Thanx,
I am using the dowalk script to crawl sites. Since, rewalking is different from actual crawling, so I modified the script to rewalk modified sites. Here's what I did to the script,
select Url, Visited from html
fetch using ifmodsince Visited
if modified, then insert all Urls in the table todo.
elseif not found, then delete the Url from the table html.
After the table todo is populated with the Modified Urls I call the dispatch function to crawl all the Urls in todo parallely.
Then update the respective fields in table html. I am not processing any links in that page as I need to crawl only one page at a time.
What I wanted to know, is whether my approach is correct or not. Is there anything else I need to take care or any other modifications to be made in the code? I ran it on a sample database and i think it worked fine.
Thanx,