Page 1 of 1

PDF hit highlighting is highlighting a lot more than the search query

Posted: Wed Jan 16, 2008 10:52 am
by john.santangelo
Not an isolated event - PDF hit highlighting seems to randomly highlight other words and often times characters that aren't even words..

How can we make PDF hit highlighting more accurate?

PDF hit highlighting is highlighting a lot more than the search query

Posted: Wed Jan 16, 2008 12:22 pm
by mark
Adobe has a funny way of hilighting words. You have to tell acrobat the page and word numbers to hilight even there are no "word" boundaries in a PDF file. Occasionally special formatting within the file will throw off the appliance's idea of word number vs. acrobat's.

If you could supply a few sample files and queries and where the hilight is off we could investigate if there's anything that can be done to improve the situation.

PDF hit highlighting is highlighting a lot more than the search query

Posted: Wed Jan 16, 2008 2:46 pm
by john.santangelo
As we are not using Thunderstone in production yet, this is the best example I can show you:

Search query: medicare

Result PDF page:
http://www.floridamedicare.com/Part_B/M ... 106801.pdf

First paragraph these words or prhases are highlighted:

better serve

carrier

with the

and medicare

This is the full URL

http://www.floridamedicare.com/Part_B/M ... f#xml=<OUR APPLIANCE>/texis/search/pdfhi.txt?query=medicare&pr=Florida&prox=page&rorder=1000&rprox=750&rdfreq=250&rwfreq=250&rlead=500&rdepth=0&sufs=1&order=r&mode=&opts=&cq=3&id=478dda0fa

PDF hit highlighting is highlighting a lot more than the search query

Posted: Wed Jan 16, 2008 3:42 pm
by john.santangelo
Another example:
query= 2008 deductible

URL of search result:
http://www.floridamedicare.com/Part_B/M ... 478dcfd3af

Page 1 highlighting includes text, none of which is the query:

Coding
Misinformation Regarding
......................................................
Clinical Laboratory
Procedure
............................................................................
Interchange
.......................

<note: each line is what text is highlighted>

PDF hit highlighting is highlighting a lot more than the search query

Posted: Wed Jan 16, 2008 5:00 pm
by mark
Using acrobat reader 8 those searches in those files hilight correctly for me. However, if I use the "remove common" feature the hilights are off. "Remove common" and "keep tags" and "ignore tags" are not usable if you want accurate pdf hilighting. All of the text must be kept so word counts are correct.

PDF hit highlighting is highlighting a lot more than the search query

Posted: Thu Jan 17, 2008 10:24 am
by john.santangelo
Accuracy seems to be much better on pages 2+, but page one is still off...

It is better with "remove common" off, but it will remain so, I suppose, because we need to use "keep tags" to remove navigation links from search result abstracts.

PDF hit highlighting is highlighting a lot more than the search query

Posted: Thu Jan 17, 2008 10:52 am
by mark
The word counting resyncs at each page boundary so remove common should generally only affect the first page.

Keep/Ignore tags are far less likely to cause a problem because the tags you specify for HTML are highly unlikely to match anything in a PDF document. Remove common does not need to be on to use Keep/Ignore Tags.