(How) can Webinator handle authentication data in the URL?

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

(How) can Webinator handle authentication data in the URL?

Post by Thunderstone »



Hi, folks.

Is there a way to get gw to use a URL that looks like this:
https://www.foo.com/cgi-bin/connect/aut ... ument.html

where home2.html is the actual page to be indexed?

Most all the pages on a secure site I maintain have all their links in the
form:

https://www.foo.com/cgi-bin/connect/aut ... ument.html

which is changed on the fly (by auth.pl) to:

https://www.foo.com/cgi-bin/connect/aut ... ument.html

and delivered to the browser. Then the browser user clicks on this URL to
get the next page- it's an old (sooo old) way to carry the authentication
around - in fact it'll be obsoleted by the new site in a few months.

So far, I haven't managed to convince gw that any of these links are places
to go, so it only indexes the login page and a couple of support pages. I
do have a filter (a pretty slow sed script, for this purpose) that I could
put in front and tell gw to pipe everything through (as a plugin), but I'm
not sure that would help.

Hmm. I just realized that I set it up to use "http://www.foo.com" rather
than https://www.foo.com.

I tried running gw with different options, and geturl on a bunch of
different approaches to this problem, without success. I tried this :
gw -y -C -v10 -n"text/html,html,./filter"
as my best (most creative) example so far.

Is there a way to do this?
GEB

-----------------------------------------
The axiomatic basis of political science:
1. Something must be done.
2. This is something.
3. Therefore, we must do it.

- adapted from http://www.javaworld.com/javaworld/jw-0 ... undon.html




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

(How) can Webinator handle authentication data in the URL?

Post by Thunderstone »



Gary E. Bickford said:



The basic issue here is that the Webinator does not handle secure pages,
i.e. https: URLs. The rest of the URL would work if it was a normal
http: URL.

A plugin will not help, as its purpose is to extract the text from the
page to be indexed, and the result is not reinterpreted as HTML.

Using Webinator (or any other web-indexed) to index pages secured in the
manner you describe above is not a good idea, as all the URLs visible
in the results will have the name and code visible.

To do this you will need to make the pages available via a normal http
server.


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

(How) can Webinator handle authentication data in the URL?

Post by Thunderstone »



Hmm.

Our entire site is secure, as it's for clients only. My intent in the long
run is to use the index as a source for our server side processes, rather
than to use the Webinator client frontend - this may require the commercial
product, but that depends on a successful demonstration to my group.

We're moving in a month or so to a new server (stronghold), which will
still be https, but won't have the auth.pl BS - the links will be std.
links. All pages will still be passed through a server side process, but
once the login is accomplished it is transparent to the user. Of course, I
can provide a hook to get Webinator in the front door.

But Webinator would still have to be able to traverse the site - what about
using file://etc./etc./ somehow, rather than http:// - this would bypass
the server entirely, if it can be done. Then the links will be file://
links, but right now I'm more interested in finding broken links and errors
than anything else.

Hmm. I suppose I could set up an Apache server symlinked to the same
document tree. Ick. Of course, then all the links will still be
incorrect, pointing through the apache server port. But I could redirect
that port at the server after running Webinator. Ick.

A word of encouragement - I've been promoting Metamorph to lotsa folks, as
I still think it's a great tool. Webinator looks like a good way to
publicize the capabilities of Metamorph.

At 1:09P -0500 8/15/97, John Turnbull wrote:


-----------------------------------------
The axiomatic basis of political science:
1. Something must be done.
2. This is something.
3. Therefore, we must do it.

- adapted from http://www.javaworld.com/javaworld/jw-0 ... undon.html




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

(How) can Webinator handle authentication data in the URL?

Post by Thunderstone »



Gary E. Bickford wrote:


About the links being incorrect, you could have the texis script fix them up for
you with the "sandr" function I suppose. I have been contemplating this because I
have two webservers , one on which webinator runs ( slow poke )and the other which
is a much much faster machine. I have considered tweaking the $ret string that I
get back so that the url reference points to the faster server. For want of time I
have not done so yet.

This may be well suited to your needs if all you are doing is a replacement for
the server name.






Post Reply