Web walking updates

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Web walking updates

Post by Thunderstone »



I need to somehow "scan" sites for changes, but not completely re-walk them
unless changes are found. Is there a way to do this with webinator?
Doesn't it read every file (unless you restrict it) every time it walks a
site?


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Web walking updates

Post by Thunderstone »




Not really. Here are the reasons:

A: Most of the overhead in HTTP is in the actual connect/request operation.
A1. In roughly the same time you can do a HEAD, you can fetch a whole page.
A2. Not all HTTPDs support HEAD.
A3. If a doc _is_ changed via HEAD you then have to issue a full GET
thereby doubling the overhead/time on modified pages.

B: A child document changes do not affect the LAST-MODIFIED of its parents.
B1. You will still have to follow all hyperlinks to ensure that no
document has been changed.

The only way to avoid a total rescan given A & B is for the Webserver
to give you a list of NEW and MODIFIED URL's.

Nice idea though.

In truth, ROBOTS.TXT is really too under-specified where it could
be greatest use. The spec should really include the list of new, changed
and preferred docs too.


Thunderstone


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Web walking updates

Post by Thunderstone »



At 06:20 PM 3/6/97 EST, you wrote:
page.

Is there a way to read the entry page, compare it to the last time it was
read to determine if there are any changes and if so go to a full walk.
This doesn't address B, but it's better than nothing. One improvement
would be to scan the entry page for the words "new" or "news" in a link
label and check those pages for changes as well.

Another alternative would be to do a full walk during the update, but note
the pages that have changed. I'd like to make this a feature request. I'd
like to be able to present a list of links that have changed since the last
index _and_ since a given date (such as when the user last logged in,
recorded elsewhere).




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Web walking updates

Post by Thunderstone »



[note: cool Webinator commands below]

..[snipping]...


The method is rather "possiblistic" and would tend to contradict Webinator's
more traditional role as a precise tool. A quick look at the Yahoo
home page should illustrate its inherent flaws.


If you would like to identify documents that have changed, it is already
possible. It will require a little creative shell scripting, but
the infrastructure for it is already present:

A: create two different databases each containing the full content of
the target website on different dates. Specify the "-unique" option
to gw when creating each database. (This makes the id field of the
html table into a hash key for the Body content field.)

B: issue the commands:

gw -dDBA -st "select id,Url from html" >/tmp/lista
gw -dDBB -st "select id,Url from html" >/tmp/listb
diff /tmp/lista /listb | grep '>' | cut -f2

This will produce a list of new/different documents at that site.
You may feed this list back into "gw" to re-fetch these pages when
walking the site the third time.

To present the user with a list of newly fetched Urls, just issue
a gw command like this:

gw -sh "select '<a href=http://'+Url+'>'+Title+'</a>' from html where Visited > '-1 day'" >newdocs.html

This will produce an HTML document of the kind you describe.

Happy Walking,

Thunderstone

PS. After having said all this, its still _not_ possible to absolutely know
all the docs that are new or changed on a site without having re-walked
the entire site.