entity character resolution

Post Reply
leslie.mohn
Posts: 3
Joined: Tue Feb 01, 2005 1:02 pm

entity character resolution

Post by leslie.mohn »

We have some pages that have character set issues, and I'm not sure if there is any easy way to resolve them.

In the walk settings (we are using the Search Appliance), we have the storage charset and source default charset both set to WINDOWS-1252. XML UTF-8 is set to N.

In the search settings, we left the display charset blank.

This seemed to resolve most of our charset issues. Looking at the HTML pages where we still have problems, though, it seems to only be happening in places where a number entity is used rather than the HTML version. For example, in the same file, it has ™ in one location and ® in another. The ® is resolving correctly, but the ™ is not.

Other than going into each file and changing all of the references, does anyone have any ideas for fixing this?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

entity character resolution

Post by Kai »

HTML entity references are supposed to be interpreted as Unicode, regardless of the document character set. Character number 153 is the trademark sign in Windows-1252, but in Unicode it is U+0099, a control character with no mapping to Windows-1252, so Webinator maps such unmappable characters to a question mark.

It shows up ok in Explorer because IE apparently tries to re-interpret some numeric entities in either the document character set or one of its own Windows charsets if they do not map as Unicode, an unofficial extension.

For portability, use symbolic names where possible, eg. ™ instead of ™, or at least the Unicode numeric reference, eg. ™.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

entity character resolution

Post by Kai »

If you change the Storage Charset to ISO-8859-1 and leave the entity references as-is, then ® and ™ should both map ok. This might affect other characters however.
Post Reply