Duplicate PDF documents

Post Reply
vallinem
Posts: 37
Joined: Wed Oct 03, 2001 6:32 pm

Duplicate PDF documents

Post by vallinem »

We have some scanned PDF documents that are showing up in the walk results as duplicates - e.g.,

The link: http://countynet/procure/purchguide/boardapprmsa.pdf
Referenced by : http://countynet/procure/purchguide/
Is a duplicate of: http://countynet/misc/boscalendar.pdf

(these are on our Intranet, so you won't be able to reach the site.)

I realize that scanned PDF's don't have any body text that can be indexed, but we did give these PDF's unique PDF titles, subjects, and descriptions & then reindexed, and they still showed up as duplicates. Is there anything else we can try?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

The unique hash is based only on the body text. You can change the dowalk script to hash additional fields such as title. Change
<hash $page>
to
<sum "%s " $title $page><hash $ret>
or (to hash meta info too)
<sum "%s " $title $mkeywords $mdescription $mother $page><hash $ret>

BTW, with the right scanning/ocr software scanned PDF files may contain a picture of the page as well OCR'd text from the page.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

Another option is to simply turn off the "Prevent Duplicates" option. If you do this you'll probably want to turn on "Ignore Case" if you're walking webservers hosted on microsoft OS's since webmasters there tend to use random capitalization.
harold
Posts: 35
Joined: Tue Aug 15, 2000 12:52 am

Duplicate PDF documents

Post by harold »

I have a site with PDF documents that are images with background OCR text. When I first tried walking the site, all the PDFs were identified as duplicates. I turned off Prevent Duplicates in the walk settings, and that indeed prevented everything being a duplicate. However, a search does not find any of the text in the PDFs. I can do a copy and paste of the OCR out of the PDF, so I know it's there. Something else I'm missing?

This is Webinator 4.3-Unix-w/plugin .

THANKS!

Harold
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

Lookup the pdf(s) in List/Edit URLs to see what was extracted.

Webinator 4 is pretty ancient. You might consider/need an update.
harold
Posts: 35
Joined: Tue Aug 15, 2000 12:52 am

Duplicate PDF documents

Post by harold »

Thanks! List/Edit URLs shows all 48 pdfs were extracted. It's interesting that this DID work, but I'm not sure when new PDFs were added last. Over the years, Webinator has been moved between servers several times.

What is involved in getting an update to Webinator?

Thanks!

Harold
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Duplicate PDF documents

Post by mark »

harold
Posts: 35
Joined: Tue Aug 15, 2000 12:52 am

Duplicate PDF documents

Post by harold »

Thanks! I'm checking on an update. Meanwhile, it appears the pdfs are being fetched, but the OCR text is not being extracted by anytopdf. I was thinking that it MIGHT have to do with pdf size, but these are not terribly large (largest is 9 MB). I tried a smaller one (1.2 MB), and text in that is still not found.

Looking at the last walk report, I see:

Webinator Walk Report for MttMonitor

Creating database /usr/local/morph3/texis/MttMonitor/db2...Done.
Walk started at 2020-08-14 19:08:22 (by user)
Verbosity set to 4
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://bh.hallikainen.org/thg/monitor
http://bh.hallikainen.org/thg/monitor
Ignore urls containing any of the following:
/cgi-bin/
started 1 (4754) on http://bh.hallikainen.org/thg/monitor
49 pages fetched (365,283,014 bytes) from http://bh.hallikainen.org/thg/monitor
1 errors

Creating search index on fetched pages...Done.

Walk finished at 2020-08-14 19:08:29 (took 7 seconds)
Making new database live: /usr/local/morph3/texis/MttMonitor/db2
____________________________________________________________________________________

Checking for broken hyperlinks...

The link : http://hallikainen.org/cgi-bin/texis.cg ... or/search/
Referenced by : http://bh.hallikainen.org/thg/monitor
Had this error: Offsite
____________________________________________________________________________________

End of report.

So, I see the 49 pdf pages being fetched, but nothing about pdf text abstraction. Maybe I messed up some configuration when I moved to a different server.

I'll study the documentation, but any assistance from experts would be great!

Thanks!

Harold
harold
Posts: 35
Joined: Tue Aug 15, 2000 12:52 am

Duplicate PDF documents

Post by harold »

OK, progress! Doing a walk from the command line, I see warning:

PDF version 1.5 -- supported version 1.4 (continuing)

I'll try converting the PDFs back to 1.4.

Thanks!
harold
Posts: 35
Joined: Tue Aug 15, 2000 12:52 am

Duplicate PDF documents

Post by harold »

Here's a DOS script to convert a bunch of PDFs down to version 1.4 using GhostScript..

REM Convert all pdfs in the current directory into pdf version 1.4 in a subdirectory named v1.4. Run this script from the directory holding the PDFs to be converted.

if not exist "v1.4" mkdir v1.4

for %%i in (*) do "C:\Program Files\gs\gs9.52\bin\gswin64.exe" -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dBATCH -sOutputFile=v1.4\%%~nxi %%~nxi
Post Reply