explain results..

Post Reply
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

explain results..

Post by jgdoke »

Searching for a result that would only be on one page I get two results. the only difference is the depth. Why???

1: Allen-Bradley Drives Home page
A long-time leader in providing reliable, power control solutions, Rockwell Automation® offers an industry-leading family of drives. Designed for powerful performance...
http://www.ab.com/drives/ 75%
Size: 22K
Depth: 2
Find Similar
Match Info
Show Parents
2: Allen-Bradley Drives Home page
Communications A long-time leader in providing reliable, power control solutions, Rockwell Automation® offers an industry-leading family of drives. Designed for powerful...
http://www.ab.com/drives 75%
Size: 22K
Depth: 5
Find Similar
Match Info
Show Parents
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

explain results..

Post by mark »

The crawler can't know that url http://www.ab.com/drives linked to from one of your other pages is really supposed to be http://www.ab.com/drives/ which is also referred to by one of your pages. Normally duplicate removal (on by default) would prevent the double entry of such duplicate pages with different urls unless the page content has something like a timestamp or such. The "Ignore tags" option could be used to strip out such variations so the pages would compare the same again and only one copy gets stored. Click match info on each to see the text that was stored to see what the difference is.
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

explain results..

Post by jgdoke »

http://www.ab.com/drives
http://www.ab.com/drives/
http://www.ab.com/drives/index.html

Are all the same url... Your box does not recognize this?

Also duplicate removal was finding thousands of duplicate pages which were not duplicates. So I turned it off. Any hints on making it work correctly??

to help with your understanding. Our site has SSI on every page so the page dates are meaningless.

Your reference to ignore tags option does that include page date??

Thanks for your help..
John
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

explain results..

Post by mark »

The "Index Name" setting tells it that "/index.html" is the same as "/". Directory urls are supposed to be specified with a trailing slash according to spec. Without the slash it's assumed to be a file. The webserver tries to deliver it and discovers it's a directory and issues a redirect to the proper url with trailing slash. The appliance uses the originally specified url rather than the redirected url for that page.

What text was found on the page that it said was being duplicated? Can you supply a couple sample urls of false duplicates?

I was referring to dynamic content on the page, not modified-date headers or such. If the text of the page changes every time it's fetched the duplicate removal won't work. The ignore tags option lets you specify parts of the html page to ignore when extracting text.
Post Reply