Searching for a result that would only be on one page I get two results. the only difference is the depth. Why???
1: Allen-Bradley Drives Home page
A long-time leader in providing reliable, power control solutions, Rockwell Automation® offers an industry-leading family of drives. Designed for powerful performance... http://www.ab.com/drives/ 75%
Size: 22K
Depth: 2
Find Similar
Match Info
Show Parents
2: Allen-Bradley Drives Home page
Communications A long-time leader in providing reliable, power control solutions, Rockwell Automation® offers an industry-leading family of drives. Designed for powerful... http://www.ab.com/drives 75%
Size: 22K
Depth: 5
Find Similar
Match Info
Show Parents
The crawler can't know that url http://www.ab.com/drives linked to from one of your other pages is really supposed to be http://www.ab.com/drives/ which is also referred to by one of your pages. Normally duplicate removal (on by default) would prevent the double entry of such duplicate pages with different urls unless the page content has something like a timestamp or such. The "Ignore tags" option could be used to strip out such variations so the pages would compare the same again and only one copy gets stored. Click match info on each to see the text that was stored to see what the difference is.
The "Index Name" setting tells it that "/index.html" is the same as "/". Directory urls are supposed to be specified with a trailing slash according to spec. Without the slash it's assumed to be a file. The webserver tries to deliver it and discovers it's a directory and issues a redirect to the proper url with trailing slash. The appliance uses the originally specified url rather than the redirected url for that page.
What text was found on the page that it said was being duplicated? Can you supply a couple sample urls of false duplicates?
I was referring to dynamic content on the page, not modified-date headers or such. If the text of the page changes every time it's fetched the duplicate removal won't work. The ignore tags option lets you specify parts of the html page to ignore when extracting text.