I have searched the threads on how to only return URLs from a search that the requester can access. Since we have data on our site that we do not want to expose to the outside world, this is of interest to me.
We have a hybrid environment which uses IIS, Apache, Squid/Apache/Zope. All of these servers contain public and private information.
Our current method was to use two profiles, one for internal crawls and one for external crawls. I gave the external profile an obscure name, but apparently not obscure enough, as it is being accessed by users outside of our network.
I have tried in vain to configure the search settings to return only the urls the requester is allowed to see.
With our hybrid environment, is that possible? If that will not work, is there a way to set access permissions to individual profiles?
It should be possible, by setting Results Authorization Method (under Search Settings) to a value other than None. Typically it is set to Basic/NTLM/file, which supports URLs authenticated with HTTP Basic, NTLM, and Windows file:// methods.
What authentication method(s) does your environment use? What settings have you tried, with what results?
Thanks. I tried that and it did not work, then I thought, "Hey, maybe this doesn't work with a Meta-search" profile, so I changed it on the collection I was actually crawling.
What I was returned was a login box. Is there a way to just return the urls that the user has permission to see and not return a login box?
It appears that this may be the way to go, but I'm not quite sure what is being asked for here:
"Unauthorized Result Query
If a query is given in this box, then search results that return 200 Ok when fetched with the search user's credentials (when Authorization Method is active) are also checked with this query: if it matches, the user is not permitted to see the result. This is for sites that return 200 Ok with a "Bad login"-type message instead of 401 Unauthorized, for bad/unauthorized logins."
Correct; Results Authorization does not currently work with Meta Search.
The Unauthorized Result Query is not normally needed and can be left blank. It is only needed if the origin server returns a text message of "Unauthorized" (or the like) for bad logins, but does not give an actual HTTP 401 response for them. Since the Appliance normally goes by the HTTP result code, it would not know that the login failed if HTTP 200 Ok is returned with such a message.
1. If you are proxying or submitting to the Appliance and using Basic/NTLM/form auth, yes. There will be an Appliance update coming out shortly (next few days) that supports this, ie. you can submit the rauser and rapass vars along with query etc. for a search, and it will authenticate on that submit instead of presenting the login form. (If you're using Forward Login cookies auth, you don't need the update; just make sure the cookies are domain-wide so the browser will submit them to the Appliance.) Single-sign-on NTLM auth (ie. forwarding the user's credentials from their workstation Windows login without another Appliance login) is not yet supported; we are working on it.