excluding sub-directories

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

excluding sub-directories

Post by Thunderstone »



Each of our directories contains a subdirectory named 'rev' for
revision control. We do NOT want this directory indexed. Is there a
way to have webinator NOT look in these directories when an index.html
is not present?

Example: my.site.com/foo/ does not have an index.html file. Therefore
it will produce an "Index Of" page listing all files and directories.
Can we stop it from looking in the foo/rev/ foo/bar/rev/ directories
WITHOUT specifying the full paths as -xhttp://my.site.com/foo/rev/ in
a file for use with -m?

In theory what I'd like to do is say:
gw -xrev/ (I know this doesn't work, I've tried it.)

without having to say:
gw -s "delete from html where Url like '/rev/'"
or listing all full URLS

Temporarily, my work around is -mrev.set
Where rev.set was created using find . -name rev

Obviously, this is not a very good work around, since it requires
recreating the rev.set file each time.


---------

Also, a slightly unrelated question:

While evaluating your software, I have found that a few documents have
broken external links. Namely, they call "http://www.domain.com/"
there is no such domain name. Why does gw try resolving this?
I'm using all three possible ways of excluding this issue:
-jmy.site.com
-xhttp://www.domain.com/
-domain=my.site.com

None of those work. This DRASTICALLY slows down the index time, since
it will take about 65 seconds to time out of the DNS (Yes, I'm using
-w0). This link is included at least 5 times!

My biggest question here, is why is it trying to read www.domain.com?
With all three of those options set, it SHOULD (as I understand it)
completely disregard that link.


Thanks in advance,
Tim Rosine


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

excluding sub-directories

Post by Thunderstone »




You need to list them all in the options or in robots.txt.
I'm not sure why you would link them into the live site in the first
place though.
You could put protection on those directories to deny access to
the server itself (the machine the walk is done from).


should be: -jhttp://my.site.com


gw looks up all hostnames to resolve host aliasing and prevent duplicates
from a single machine with multiple names.
You can prevent these lookups by using the -L option.


Post Reply