entity character resolution

leslie.mohn · Post by **leslie.mohn** » Thu Apr 07, 2005 4:34 pm

We have some pages that have character set issues, and I'm not sure if there is any easy way to resolve them.

In the walk settings (we are using the Search Appliance), we have the storage charset and source default charset both set to WINDOWS-1252. XML UTF-8 is set to N.

In the search settings, we left the display charset blank.

This seemed to resolve most of our charset issues. Looking at the HTML pages where we still have problems, though, it seems to only be happening in places where a number entity is used rather than the HTML version. For example, in the same file, it has  in one location and ® in another. The ® is resolving correctly, but the  is not.

Other than going into each file and changing all of the references, does anyone have any ideas for fixing this?

Post by **Kai** » Fri Apr 08, 2005 10:51 am

HTML entity references are supposed to be interpreted as Unicode, regardless of the document character set. Character number 153 is the trademark sign in Windows-1252, but in Unicode it is U+0099, a control character with no mapping to Windows-1252, so Webinator maps such unmappable characters to a question mark.

It shows up ok in Explorer because IE apparently tries to re-interpret some numeric entities in either the document character set or one of its own Windows charsets if they do not map as Unicode, an unofficial extension.

For portability, use symbolic names where possible, eg. ™ instead of , or at least the Unicode numeric reference, eg. ™.

Post by **Kai** » Fri Apr 08, 2005 10:59 am

If you change the Storage Charset to ISO-8859-1 and leave the entity references as-is, then ® and  should both map ok. This might affect other characters however.