address lookup for servers not to be indexed

Post by **Thunderstone** » Wed Oct 14, 1998 3:17 pm

Hello,

I told this crawler of mine to index some site and all other ones in the
same domain that it finds.
However, every once in a while I see things like:

000 Can't get address for host `ferretworld.com': Unknown error

or

000 Can't get address for host `www.yourdomain.com': Unknown error

these server names the robot must be picking up from pages it crawled
(somebody probably used www.yourdomain.com as an example), but I don't
understand why the crawler is trying to look them up?

If I said I want to index http://www.mysite.com and -domain=mysite.com why
does the crawler even bother looking up something that is obviously not in
the same domain? I know it could be a virtual domain hosted by mysite.com,
but I normally wouldn't want to index that, I think.

I think I just found a partial answer to this in the list archives at
http://www.thunderstone.com/texis/webin ... 4dxrXqwww/
full.html

However, the answer says this is done in order to catch aliases for the same
server. That is what I thought.
However, I know you can have something like www.mysite.com and
aliaswww.mysite.com and that can both be the same server (aliases, CNAMEs)
... but how could www.yourdomain.com be an alias for www.mysite.com - those
are 2 different _domains_, not _host_ names. Can those be aliases of one
another as well?

Also, since the crawler often encounters non-existing domain/servers (those
that people use as examples, like mysite.com here) and it tries to look them
up, it wastes a lot of time trying to look them up and finally giving up
after it times out.
Now, often it will find the same address twice (in a row) and it will try to
look it up twice (in a row). I imagine this is good for situations where the
first failure was a temporary one, but is there any way to tell the crawler
not to 're-look them up' so often? In other words, something like: "if you
fail to lookup server www.X.com don't look it up for next N minutes even if
you encounter it".
This, I think, would help the crawler move onto working servers in the todo
list, and postpone requests to those that may temporarily be down. Otherwise
it will waste so much time and if the problem with the server continues it
won't even get any documents from it. It would be better use of its time if
it moved onto other, working servers in the todo list, no?
Does this even make sense?

Thank you,

Otis

Post by **Thunderstone** » Wed Oct 14, 1998 3:34 pm

Yes. www.joe.com could be an alias for www.bob.com. See www.ptr.net and
www.ptr.org for just one example.

Use the -L option to disable looking up all names.
http://www.thunderstone.com/gw25man/node63.html

The viewable and searchable manual is here:
http://www.thunderstone.com/gw25man/gw2.html

Post by **Thunderstone** » Wed Oct 14, 1998 3:57 pm

On a similar line of thought...

I'm running my main site and a couple virtual hosts on the same server (all
the same domain). I set up my gw options to start at the main site, search
the entire domain, but exclude one of the virtual hosts. gw tried the first
page of the main site, then somehow checked the exclusion list, and stopped
saying that this server is on the exclusion list. Would using the -L option
allow me to do what I wanted to do?

Thanks,
-John

--On Wednesday, October 14, 1998, 3:36 PM -0400 "Mark Willson"
<mark@thunderstone.com> wrote:

why
in
mysite.com,
http://www.thunderstone.com/texis/webin ... 4dxrXqwww/
same
CNAMEs)
those
(those
them
to
the
crawler
you
if
todo
Otherwise
it
if

Post by **Thunderstone** » Wed Oct 14, 1998 4:18 pm

If you're walking multiple virtual servers that use the technique that
gives them all the same IP address, you will have to use -L . Otherwise
it will think they are all the same server.

Post by **Thunderstone** » Thu Oct 15, 1998 12:17 pm

well i was in love with the program but now have hit a big dead end .. i
currently down load the alt.jobs. newsgroup.. and save it as a very large
html file every time i try to index this page. It is in a dir by itself.

the gw program gives me a (truncated) which only indexs a very small part of
the file.. is there a way to have it index the whole file with out getting
this error and indexing the whole page this is the only page i want to index

it allows my viewers to view the alt.jobs database *.html file and search
for jobs by keywords...