Duplicates?

mmcfadden · Post by **mmcfadden** » Fri Feb 03, 2006 5:02 pm

My walk has several documents that are duplicates that are walked and included in the walk. For example
http://www.jan.wvu.edu/soar
http://www.jan.wvu.edu/soar/
These pages are the same page in content. Why wouldn't Remove Duplicates stop this from occurring?

Also these two documents are identical they are just in a different format. I have examples like this where three documents have been indexed a .pdf, .doc and .html that are exact duplicates. It would be nice if there was a mechanism to remove these duplicates automatically. It would be nice to have an additional option that would allow me to prioritize the file format I would like to keep. For example my preferred format is .html secondary would be .txt then .pdf then .doc. This way Webinator will give me the best option and remove the duplicates.
http://www.usdoj.gov/crt/ada/briefs/hoyreplbr.doc
http://www.usdoj.gov/crt/ada/briefs/hoyreplbr.pdf

Post by **mark** » Fri Feb 03, 2006 5:37 pm

I don't know about the first item unless the page changed between the times it was fetched.

I think you'll find the extracted text from html, pdf, and doc files varies somewhat. In that case the pages will compare as different.