exclude recursive pages - how?

Post by **Thunderstone** » Wed Apr 12, 2000 4:58 pm

Hello again -

We're having severe problems with pages with dynamic content (*.shtml
pages and *.html pages). Somewhere there appears to be an incorrect link,
of the form:

http://ourserver.edu/dynamicpage.shtml/

Note the trailing slash. Since this is a dynamically generated page, the
web server treats this as valid path information and serves out the
page. However, webinator (and some browsers) gets fouled up
by this. If there are links on the page that are relative, i.e.

<a href="txt/index.html">

the URL

http://ourserver.edu/dynamicpage.shtml/txt/index.html

gets put into the todo list. When webinator hits that link, the
relative link then becomes

<a href="txt/txt/index.html">

This continues ad infinitum, until webinator's built in depth
limit gets hit. We're spidering an infinite loop.

I'd like to exclude URLs of the form:

http://ourserver.edu/.../*.html/ or http://ourserver.edu/.../*.shtml/

it's not clear to me how to do this without excluding the pages

http://ourserver.edu/.../*.html and http://ourserver.edu/.../*.shtml

themselves. Since we can't control how people make (or mismake) their
links, this becomes a real problem.

There's got to be some way to use the -x option with some regexp, but
I'm not coming up with it. I'm willing to exclude any valid directories
that are called *.html or *.shtml.

Suggestions?

Thanks,
Susan

Susan Alderman Susan_Alderman AT brown.edu
Box 1885 vox: 401-863-9466
CIS, Brown University fax: 401-863-7329
Providence, RI 02912

Post by **Thunderstone** » Wed Apr 12, 2000 5:30 pm

If your gw version (gw -version) is newer than 2.53 19990226 you can
use the -x/ option (http://www.thunderstone.com/gw25man/node58.html).

-x"/\.=s?html/=>>="