Page 1 of 1
PDF hit highlighting is highlighting a lot more than the search query
Posted: Wed Jan 16, 2008 10:52 am
by john.santangelo
Not an isolated event - PDF hit highlighting seems to randomly highlight other words and often times characters that aren't even words..
How can we make PDF hit highlighting more accurate?
PDF hit highlighting is highlighting a lot more than the search query
Posted: Wed Jan 16, 2008 12:22 pm
by mark
Adobe has a funny way of hilighting words. You have to tell acrobat the page and word numbers to hilight even there are no "word" boundaries in a PDF file. Occasionally special formatting within the file will throw off the appliance's idea of word number vs. acrobat's.
If you could supply a few sample files and queries and where the hilight is off we could investigate if there's anything that can be done to improve the situation.
PDF hit highlighting is highlighting a lot more than the search query
Posted: Wed Jan 16, 2008 2:46 pm
by john.santangelo
As we are not using Thunderstone in production yet, this is the best example I can show you:
Search query: medicare
Result PDF page:
http://www.floridamedicare.com/Part_B/M ... 106801.pdf
First paragraph these words or prhases are highlighted:
better serve
carrier
with the
and medicare
This is the full URL
http://www.floridamedicare.com/Part_B/M ... f#xml=<OUR APPLIANCE>/texis/search/pdfhi.txt?query=medicare&pr=Florida&prox=page&rorder=1000&rprox=750&rdfreq=250&rwfreq=250&rlead=500&rdepth=0&sufs=1&order=r&mode=&opts=&cq=3&id=478dda0fa
PDF hit highlighting is highlighting a lot more than the search query
Posted: Wed Jan 16, 2008 3:42 pm
by john.santangelo
Another example:
query= 2008 deductible
URL of search result:
http://www.floridamedicare.com/Part_B/M ... 478dcfd3af
Page 1 highlighting includes text, none of which is the query:
Coding
Misinformation Regarding
......................................................
Clinical Laboratory
Procedure
............................................................................
Interchange
.......................
<note: each line is what text is highlighted>
PDF hit highlighting is highlighting a lot more than the search query
Posted: Wed Jan 16, 2008 5:00 pm
by mark
Using acrobat reader 8 those searches in those files hilight correctly for me. However, if I use the "remove common" feature the hilights are off. "Remove common" and "keep tags" and "ignore tags" are not usable if you want accurate pdf hilighting. All of the text must be kept so word counts are correct.
PDF hit highlighting is highlighting a lot more than the search query
Posted: Thu Jan 17, 2008 10:24 am
by john.santangelo
Accuracy seems to be much better on pages 2+, but page one is still off...
It is better with "remove common" off, but it will remain so, I suppose, because we need to use "keep tags" to remove navigation links from search result abstracts.
PDF hit highlighting is highlighting a lot more than the search query
Posted: Thu Jan 17, 2008 10:52 am
by mark
The word counting resyncs at each page boundary so remove common should generally only affect the first page.
Keep/Ignore tags are far less likely to cause a problem because the tags you specify for HTML are highly unlikely to match anything in a PDF document. Remove common does not need to be on to use Keep/Ignore Tags.