PDF hit highlighting is highlighting a lot more than the search query

Post Reply
john.santangelo
Posts: 32
Joined: Fri Aug 24, 2007 1:54 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by john.santangelo »

Not an isolated event - PDF hit highlighting seems to randomly highlight other words and often times characters that aren't even words..

How can we make PDF hit highlighting more accurate?
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by mark »

Adobe has a funny way of hilighting words. You have to tell acrobat the page and word numbers to hilight even there are no "word" boundaries in a PDF file. Occasionally special formatting within the file will throw off the appliance's idea of word number vs. acrobat's.

If you could supply a few sample files and queries and where the hilight is off we could investigate if there's anything that can be done to improve the situation.
john.santangelo
Posts: 32
Joined: Fri Aug 24, 2007 1:54 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by john.santangelo »

As we are not using Thunderstone in production yet, this is the best example I can show you:

Search query: medicare

Result PDF page:
http://www.floridamedicare.com/Part_B/M ... 106801.pdf

First paragraph these words or prhases are highlighted:

better serve

carrier

with the

and medicare

This is the full URL

http://www.floridamedicare.com/Part_B/M ... f#xml=<OUR APPLIANCE>/texis/search/pdfhi.txt?query=medicare&pr=Florida&prox=page&rorder=1000&rprox=750&rdfreq=250&rwfreq=250&rlead=500&rdepth=0&sufs=1&order=r&mode=&opts=&cq=3&id=478dda0fa
john.santangelo
Posts: 32
Joined: Fri Aug 24, 2007 1:54 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by john.santangelo »

Another example:
query= 2008 deductible

URL of search result:
http://www.floridamedicare.com/Part_B/M ... 478dcfd3af

Page 1 highlighting includes text, none of which is the query:

Coding
Misinformation Regarding
......................................................
Clinical Laboratory
Procedure
............................................................................
Interchange
.......................

<note: each line is what text is highlighted>
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by mark »

Using acrobat reader 8 those searches in those files hilight correctly for me. However, if I use the "remove common" feature the hilights are off. "Remove common" and "keep tags" and "ignore tags" are not usable if you want accurate pdf hilighting. All of the text must be kept so word counts are correct.
john.santangelo
Posts: 32
Joined: Fri Aug 24, 2007 1:54 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by john.santangelo »

Accuracy seems to be much better on pages 2+, but page one is still off...

It is better with "remove common" off, but it will remain so, I suppose, because we need to use "keep tags" to remove navigation links from search result abstracts.
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

PDF hit highlighting is highlighting a lot more than the search query

Post by mark »

The word counting resyncs at each page boundary so remove common should generally only affect the first page.

Keep/Ignore tags are far less likely to cause a problem because the tags you specify for HTML are highly unlikely to match anything in a PDF document. Remove common does not need to be on to use Keep/Ignore Tags.
Post Reply