keeping spider in a sub-directory.

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

keeping spider in a sub-directory.

Post by Thunderstone »




Hello all,

I have been attempting to selectively index various "web-sites"
By "web-sites" I mean those pages maintained by a particular organization
even if hosted on another's computer.
These sites will often have links to other areas of the host computer.
My goal is to index ONLY the desired "web-site" and not anything else on
the host.

EG:
I want to index:
http://www.abc.org/info/target_organization/

However, on many pages of "target_organizations" pages, there are links
to other parts of www.abc.org's web server.
I don't want these other pages, and additionally, there is no way for me
to know which pages of "target_organizations" pages link elsewhere on the
web server.
For this reason, it is impractical to make a list of pages to exclude.

What I would like to be able to say is:
Exclude all of http://www.abc.org/ EXCEPT for all pages below:
http://www.abc.org/info/target_organization/

Has anyone else run into this same problem?

Thanks,

David Cohen


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

keeping spider in a sub-directory.

Post by Thunderstone »



[...some snipping...]
The Webinator's idea of a "Web-site" is synonymous with a single
IP address. This is because of all of the possible ways DNS can
resolve names in conjunction with the mixed bag of ways to write
a URL.

The Webinator has no notion of a path other than its understanding of
"../" and "./" relative references. This is the correct and intended
operation of the program. (URL's don't always mean protocol://host/pathname.)

Webinator is just following the linkages in order to ensure that it
it doesn't miss any referenced URLs on the server, and since its a
Web and not a tree it has no concept of up or down. (Trust me,
it would have been a whole lot easier to write if it was a tree.)

With all that said, here's a way you might try if you only
want to index single subdirectory on a server where you have
telnet permission.

First "cd" to the htdocs directory on the server.
Then:

find MYDIR -name "*.html" -follow -exec echo http://www.abc.com/ {}\; | sed -e "s/ //" | gw -a -dMYDB "&-"

The above is all one line. MYDIR is the name of the directory you wish
to index, and MYDB id the name of the Webinator database directory
in which you'd like to place the index.

Note: This may index more than you intend because even documents which
are not exposed via hyperlinks will be located by the find command.

The only other way if you don't have telnet is to play with the -D and
-a options in conjunction with the "include/exclude" lists.
You can "hand-walk" a site by feeding lists to it from
a command like:

gw -d. -st "select Ref from refs where Ref like '/texis'" | gw -a -d. "-&"

or you could just delete the stuff that you don't want afterwards
with a command like:

gw -d. -st "delete from html where Url like '/wais'"

There's a whole lot of control and flexibility available when you
start to use the Texis SQL aspects of the Webinator along with
the Unix command line stuff.

Keep them coming!
Bart Richards
Post Reply