I need to somehow "scan" sites for changes, but not completely re-walk them
unless changes are found. Is there a way to do this with webinator?
Doesn't it read every file (unless you restrict it) every time it walks a
site?
A: Most of the overhead in HTTP is in the actual connect/request operation.
A1. In roughly the same time you can do a HEAD, you can fetch a whole page.
A2. Not all HTTPDs support HEAD.
A3. If a doc _is_ changed via HEAD you then have to issue a full GET
thereby doubling the overhead/time on modified pages.
B: A child document changes do not affect the LAST-MODIFIED of its parents.
B1. You will still have to follow all hyperlinks to ensure that no
document has been changed.
The only way to avoid a total rescan given A & B is for the Webserver
to give you a list of NEW and MODIFIED URL's.
Nice idea though.
In truth, ROBOTS.TXT is really too under-specified where it could
be greatest use. The spec should really include the list of new, changed
and preferred docs too.
Is there a way to read the entry page, compare it to the last time it was
read to determine if there are any changes and if so go to a full walk.
This doesn't address B, but it's better than nothing. One improvement
would be to scan the entry page for the words "new" or "news" in a link
label and check those pages for changes as well.
Another alternative would be to do a full walk during the update, but note
the pages that have changed. I'd like to make this a feature request. I'd
like to be able to present a list of links that have changed since the last
index _and_ since a given date (such as when the user last logged in,
recorded elsewhere).
The method is rather "possiblistic" and would tend to contradict Webinator's
more traditional role as a precise tool. A quick look at the Yahoo
home page should illustrate its inherent flaws.
If you would like to identify documents that have changed, it is already
possible. It will require a little creative shell scripting, but
the infrastructure for it is already present:
A: create two different databases each containing the full content of
the target website on different dates. Specify the "-unique" option
to gw when creating each database. (This makes the id field of the
html table into a hash key for the Body content field.)
This will produce a list of new/different documents at that site.
You may feed this list back into "gw" to re-fetch these pages when
walking the site the third time.
To present the user with a list of newly fetched Urls, just issue
a gw command like this:
gw -sh "select '<a href=http://'+Url+'>'+Title+'</a>' from html where Visited > '-1 day'" >newdocs.html
This will produce an HTML document of the kind you describe.
Happy Walking,
Thunderstone
PS. After having said all this, its still _not_ possible to absolutely know
all the docs that are new or changed on a site without having re-walked
the entire site.