dashes turning to ?

Post Reply
jgdoke
Posts: 167
Joined: Wed Jul 14, 2004 10:52 am

dashes turning to ?

Post by jgdoke »

In this pdf file is the part number of the product.
1485T–P2T5–T5

When you index this pdf file the dashes become "?"
1485T?P2T5?T5
This causes searches for the part number to fail.
I just checked and the entire string in the pdf file is helvetica. Here is a lint to the pdf file. Can you find out how to have the dashes indexed as dashes?

http://literature.rockwellautomation.co ... _-en-p.pdf
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

dashes turning to ?

Post by Kai »

The dashes in the original PDF are en-dashes (Unicode U+2013). In the walk data, this currently will be mapped to either a question-mark or a soft-hyphen (U+00AD), depending on whether XML UTF-8 is Y or N, respectively. (Even if mapped to soft-hyphen, it might still *display* as question-mark, depending on browser/charset settings. View HTML source under List/Edit URLs.)

If it is a soft-hyphen (XML UTF-8 is N), this will cause searches to fail, as the character is a hi-bit char considered part of a word, instead of punctuation. There will be a fix for this out later this week.

If it is truly a question-mark (XML UTF-8 is Y), then searches should work, as the question-mark is a punctuation char. Eg. with query `1485T-P2T5-T5', the dashes make the query a phrase, and any whitespace/punctuation char(s) may appear between the words.
Post Reply