PDF hit highlighting is highlighting a lot more than the search query

john.santangelo · Post by **john.santangelo** » Wed Jan 16, 2008 10:52 am

Not an isolated event - PDF hit highlighting seems to randomly highlight other words and often times characters that aren't even words..

How can we make PDF hit highlighting more accurate?

Post by **mark** » Wed Jan 16, 2008 12:22 pm

Adobe has a funny way of hilighting words. You have to tell acrobat the page and word numbers to hilight even there are no "word" boundaries in a PDF file. Occasionally special formatting within the file will throw off the appliance's idea of word number vs. acrobat's.

If you could supply a few sample files and queries and where the hilight is off we could investigate if there's anything that can be done to improve the situation.

john.santangelo · Post by **john.santangelo** » Wed Jan 16, 2008 2:46 pm

As we are not using Thunderstone in production yet, this is the best example I can show you:

Search query: medicare

Result PDF page:
http://www.floridamedicare.com/Part_B/M ... 106801.pdf

First paragraph these words or prhases are highlighted:

better serve

carrier

with the

and medicare

This is the full URL

http://www.floridamedicare.com/Part_B/M ... f#xml=<OUR APPLIANCE>/texis/search/pdfhi.txt?query=medicare&pr=Florida&prox=page&rorder=1000&rprox=750&rdfreq=250&rwfreq=250&rlead=500&rdepth=0&sufs=1&order=r&mode=&opts=&cq=3&id=478dda0fa

john.santangelo · Post by **john.santangelo** » Wed Jan 16, 2008 3:42 pm

Another example:
query= 2008 deductible

URL of search result:
http://www.floridamedicare.com/Part_B/M ... 478dcfd3af

Page 1 highlighting includes text, none of which is the query:

Coding
Misinformation Regarding
......................................................
Clinical Laboratory
Procedure
............................................................................
Interchange
.......................

<note: each line is what text is highlighted>

Post by **mark** » Wed Jan 16, 2008 5:00 pm

Using acrobat reader 8 those searches in those files hilight correctly for me. However, if I use the "remove common" feature the hilights are off. "Remove common" and "keep tags" and "ignore tags" are not usable if you want accurate pdf hilighting. All of the text must be kept so word counts are correct.

john.santangelo · Post by **john.santangelo** » Thu Jan 17, 2008 10:24 am

Accuracy seems to be much better on pages 2+, but page one is still off...

It is better with "remove common" off, but it will remain so, I suppose, because we need to use "keep tags" to remove navigation links from search result abstracts.

Post by **mark** » Thu Jan 17, 2008 10:52 am

The word counting resyncs at each page boundary so remove common should generally only affect the first page.

Keep/Ignore tags are far less likely to cause a problem because the tags you specify for HTML are highly unlikely to match anything in a PDF document. Remove common does not need to be on to use Keep/Ignore Tags.