Duplicate PDF documents

Post Reply
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Duplicate PDF documents

Post by vallinem »

We have some scanned PDF documents that are showing up in the walk results as duplicates - e.g.,

The link: http://countynet/procure/purchguide/boardapprmsa.pdf
Referenced by : http://countynet/procure/purchguide/
Is a duplicate of: http://countynet/misc/boscalendar.pdf

(these are on our Intranet, so you won't be able to reach the site.)

I realize that scanned PDF's don't have any body text that can be indexed, but we did give these PDF's unique PDF titles, subjects, and descriptions & then reindexed, and they still showed up as duplicates. Is there anything else we can try?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

The unique hash is based only on the body text. You can change the dowalk script to hash additional fields such as title. Change
<hash $page>
to
<sum "%s " $title $page><hash $ret>
or (to hash meta info too)
<sum "%s " $title $mkeywords $mdescription $mother $page><hash $ret>

BTW, with the right scanning/ocr software scanned PDF files may contain a picture of the page as well OCR'd text from the page.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

Another option is to simply turn off the "Prevent Duplicates" option. If you do this you'll probably want to turn on "Ignore Case" if you're walking webservers hosted on microsoft OS's since webmasters there tend to use random capitalization.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

Lookup the pdf(s) in List/Edit URLs to see what was extracted.

Webinator 4 is pretty ancient. You might consider/need an update.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

Post Reply