Web walking updates

Post by **Thunderstone** » Thu Mar 06, 1997 4:47 pm

I need to somehow "scan" sites for changes, but not completely re-walk them
unless changes are found. Is there a way to do this with webinator?
Doesn't it read every file (unless you restrict it) every time it walks a
site?

Post by **Thunderstone** » Thu Mar 06, 1997 6:20 pm

Not really. Here are the reasons:

A: Most of the overhead in HTTP is in the actual connect/request operation.
A1. In roughly the same time you can do a HEAD, you can fetch a whole page.
A2. Not all HTTPDs support HEAD.
A3. If a doc _is_ changed via HEAD you then have to issue a full GET
thereby doubling the overhead/time on modified pages.

B: A child document changes do not affect the LAST-MODIFIED of its parents.
B1. You will still have to follow all hyperlinks to ensure that no
document has been changed.

The only way to avoid a total rescan given A & B is for the Webserver
to give you a list of NEW and MODIFIED URL's.

Nice idea though.

In truth, ROBOTS.TXT is really too under-specified where it could
be greatest use. The spec should really include the list of new, changed
and preferred docs too.

Thunderstone

Post by **Thunderstone** » Thu Mar 06, 1997 8:23 pm

At 06:20 PM 3/6/97 EST, you wrote:
page.

Is there a way to read the entry page, compare it to the last time it was
read to determine if there are any changes and if so go to a full walk.
This doesn't address B, but it's better than nothing. One improvement
would be to scan the entry page for the words "new" or "news" in a link
label and check those pages for changes as well.

Another alternative would be to do a full walk during the update, but note
the pages that have changed. I'd like to make this a feature request. I'd
like to be able to present a list of links that have changed since the last
index _and_ since a given date (such as when the user last logged in,
recorded elsewhere).

Post by **Thunderstone** » Fri Mar 07, 1997 11:03 am

[note: cool Webinator commands below]

..[snipping]...

The method is rather "possiblistic" and would tend to contradict Webinator's
more traditional role as a precise tool. A quick look at the Yahoo
home page should illustrate its inherent flaws.

If you would like to identify documents that have changed, it is already
possible. It will require a little creative shell scripting, but
the infrastructure for it is already present:

A: create two different databases each containing the full content of
the target website on different dates. Specify the "-unique" option
to gw when creating each database. (This makes the id field of the
html table into a hash key for the Body content field.)

B: issue the commands:

gw -dDBA -st "select id,Url from html" >/tmp/lista
gw -dDBB -st "select id,Url from html" >/tmp/listb
diff /tmp/lista /listb | grep '>' | cut -f2

This will produce a list of new/different documents at that site.
You may feed this list back into "gw" to re-fetch these pages when
walking the site the third time.

To present the user with a list of newly fetched Urls, just issue
a gw command like this:

gw -sh "select '<a href=http://'+Url+'>'+Title+'</a>' from html where Visited > '-1 day'" >newdocs.html

This will produce an HTML document of the kind you describe.

Happy Walking,

Thunderstone

PS. After having said all this, its still _not_ possible to absolutely know
all the docs that are new or changed on a site without having re-walked
the entire site.