Hello,
I told this crawler of mine to index some site and all other ones in the
same domain that it finds.
However, every once in a while I see things like:
000 Can't get address for host `ferretworld.com': Unknown error
or
000 Can't get address for host `www.yourdomain.com': Unknown error
these server names the robot must be picking up from pages it crawled
(somebody probably used www.yourdomain.com as an example), but I don't
understand why the crawler is trying to look them up?
If I said I want to index http://www.mysite.com and -domain=mysite.com why
does the crawler even bother looking up something that is obviously not in
the same domain? I know it could be a virtual domain hosted by mysite.com,
but I normally wouldn't want to index that, I think.
I think I just found a partial answer to this in the list archives at
http://www.thunderstone.com/texis/webin ... 4dxrXqwww/
full.html
However, the answer says this is done in order to catch aliases for the same
server. That is what I thought.
However, I know you can have something like www.mysite.com and
aliaswww.mysite.com and that can both be the same server (aliases, CNAMEs)
... but how could www.yourdomain.com be an alias for www.mysite.com - those
are 2 different _domains_, not _host_ names. Can those be aliases of one
another as well?
Also, since the crawler often encounters non-existing domain/servers (those
that people use as examples, like mysite.com here) and it tries to look them
up, it wastes a lot of time trying to look them up and finally giving up
after it times out.
Now, often it will find the same address twice (in a row) and it will try to
look it up twice (in a row). I imagine this is good for situations where the
first failure was a temporary one, but is there any way to tell the crawler
not to 're-look them up' so often? In other words, something like: "if you
fail to lookup server www.X.com don't look it up for next N minutes even if
you encounter it".
This, I think, would help the crawler move onto working servers in the todo
list, and postpone requests to those that may temporarily be down. Otherwise
it will waste so much time and if the problem with the server continues it
won't even get any documents from it. It would be better use of its time if
it moved onto other, working servers in the todo list, no?
Does this even make sense?
Thank you,
Otis