Page 1 of 1

dashes turning to ?

Posted: Mon Feb 07, 2005 10:54 am
by jgdoke
In this pdf file is the part number of the product.
1485T–P2T5–T5

When you index this pdf file the dashes become "?"
1485T?P2T5?T5
This causes searches for the part number to fail.
I just checked and the entire string in the pdf file is helvetica. Here is a lint to the pdf file. Can you find out how to have the dashes indexed as dashes?

http://literature.rockwellautomation.co ... _-en-p.pdf

dashes turning to ?

Posted: Mon Feb 07, 2005 3:12 pm
by Kai
The dashes in the original PDF are en-dashes (Unicode U+2013). In the walk data, this currently will be mapped to either a question-mark or a soft-hyphen (U+00AD), depending on whether XML UTF-8 is Y or N, respectively. (Even if mapped to soft-hyphen, it might still *display* as question-mark, depending on browser/charset settings. View HTML source under List/Edit URLs.)

If it is a soft-hyphen (XML UTF-8 is N), this will cause searches to fail, as the character is a hi-bit char considered part of a word, instead of punctuation. There will be a fix for this out later this week.

If it is truly a question-mark (XML UTF-8 is Y), then searches should work, as the question-mark is a punctuation char. Eg. with query `1485T-P2T5-T5', the dashes make the query a phrase, and any whitespace/punctuation char(s) may appear between the words.