Exlude all HTTPS page from walk

yuko0
Posts: 8
Joined: Wed Mar 21, 2007 8:16 pm

Exlude all HTTPS page from walk

Post by yuko0 »

I tried various way to exclude HTTPS links from being walked by Search Appliance...however, I can't make it work.

I unchecked "HTTPS" as "protocol"

I also tried adding https://www.myurl.com/ to "exclusion"

Because site is dynamic, although HTTP page and HTTPS has the same content, Search Appliance tend to index both...and it is a problem for the site search.

I appreciate any direction on this.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exlude all HTTPS page from walk

Post by mark »

Unless you're putting https in the base url or have the https option checked it shouldn't walk https. Double check all of your walk settings and post all of the non-default ones here so someone can maybe spot the problem.
yuko0
Posts: 8
Joined: Wed Mar 21, 2007 8:16 pm

Exlude all HTTPS page from walk

Post by yuko0 »

Thank you for your quick reply...
I found even weired thing...
When I "uncheck "HTTPS" in protocol, the walk "fails"

Extension allowed: .html .htm .txt .pdf .doc .xls .swf .php

I think this is the only thing I set no-default value...

What may be a problem?
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Exlude all HTTPS page from walk

Post by jason112 »

A walk failing usually means that it didn't get any content at all, meaning the baseURLs provided failed.

Make sure the BaseURL(s) you're walking don't have any
robots.txt that prevent them from being indexed.

Baring that, are your baseURLs public pages that we'd be able to see?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exlude all HTTPS page from walk

Post by mark »

Check the walk status to see why it fails.
yuko0
Posts: 8
Joined: Wed Mar 21, 2007 8:16 pm

Exlude all HTTPS page from walk

Post by yuko0 »

I see... I do see baseURL is correct site. docmagic.com

I just saw this selection to check "title" for duplication.
This may work for our solution... because all I wanted to do was delete any duplicate content. (Search appliance may be able to filter better with "title" being checked. If this doesn't work, I will come back to figure out what this HTTP/HTTPS issue is....

Thank you so much for your input and care.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Exlude all HTTPS page from walk

Post by jason112 »

It looks like there's a <script> on docmagic.com that reference a https: script, and that is causing
it to fail with "disallowed protocol", even though the disallowed protocol is on the <script> instead of the page itself.

It looks like if you check the "HTTPS" box in the protocols section, and add "https://" to "Exclusion Prefix", then it will do what you want - HTTPS pages will not be included in the walk, and pages referencing https scripts will be ok.

You probably also want to add .jsp to your extensions, as that's what all the pages are.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Exlude all HTTPS page from walk

Post by mark »

Another option would be to turn off javascript processing if it's not required to navigate the site.
yuko0
Posts: 8
Joined: Wed Mar 21, 2007 8:16 pm

Exlude all HTTPS page from walk

Post by yuko0 »

Thank you so much for looking into details. I appreciate both of your kind support.

It's also nice that I know exactly why HTTP was failing. Thank you for solving the puzzle !!!