Encoding Type used to store input

mjacobson · Post by **mjacobson** » Fri Jul 30, 2004 2:25 pm

I am developing an XML output interface to return search results. I have run into an issue trying to set the correct character type when returning the "Body" value for a cache request. If the page was anything other than html or txt, I am getting "illegal" character error while parsing the XML.

If I change the character type to "ISO-8859-1" these problems go away. I was wondering if this is a correct thing to do, or do I need to set the type depending on what type of document was stored, pdf, powerpoint, word, html, etc.?

sourceone · Post by **sourceone** » Wed Jun 14, 2006 4:28 pm

When I try to encode the character 'í' using %hhV, I get an error when trying to read the file using utf-8 encoding. How would I go about encoding characters like this for utf-8?

Post by **Kai** » Wed Jun 14, 2006 6:00 pm

What is the error you get?

What is the hex value of the bytes in the file you're reading, from your %hhV output? They should be C3 AD.

sourceone · Post by **sourceone** » Wed Jun 14, 2006 8:52 pm

I get an invalid character when trying to open the xml file using Internet Explorer. I believe the hex value is ED.

Post by **mark** » Thu Jun 15, 2006 12:28 pm

This works in IE for me:

<$funny='í'>
<dataset>
<record>
<text>accented i: <fmt "%hhV" $funny></text>
</record>
</dataset>

Where the í in the data is hex ED. %hhV gives hex C3 AD.

Post by **mark** » Thu Jun 15, 2006 12:29 pm

I think I cc'd the wrong person. See the previous msg on the board.

sourceone · Post by **sourceone** » Thu Jun 15, 2006 4:26 pm

I think I solved the problem. I was using <strfmt %hhV $funny> to save it to a variable and then using fmt %s to print it out again.

Now I seem to be getting another error with Chinese characters. The following string is produced by fmt %hhV and is invalid in IE:

Re:±Ïҵ²»º𲻿ìʮÊ׾µä¸èÇú

Post by **Kai** » Thu Jun 15, 2006 5:36 pm

That doesn't look like valid <strfmt "%hhV"> output (as some of it doesn't UTF-8 decode properly), unless something was mangled during the message board post.

If the original characters were Chinese, they must already be UTF-8 or some other Unicode format, not ISO-8859-1 as in msg #3. You should then not be using <strfmt "%hhV">, and instead leave the charset alone, eg. <strfmt "%s"> or "%H".

Can you copy the original (not <strfmt>) characters to a file, then run this:

<read "/tmp/file"><$org = $ret>
<fmt "Original: [%U]\n" $org>
<strfmt "%hhV" $org>
<fmt "UTF-8 enc: [%U]\n" $ret>

which will print them in hex to avoid potential cut-and-paste issues.

sourceone · Post by **sourceone** » Mon Jun 19, 2006 12:00 pm

I do get the following error when fetching the Chinese page:



Could this possibly be the cause of the errors in the xml?

Also, is there way to check if what has been fetched is already UTF-8 encoded? I'm thinking that I would need to add this check, before I use strfmt %hhV when outputing to xml.

Post by **Kai** » Mon Jun 19, 2006 5:02 pm

Yep, that could certainly be the cause. Is the original Chinese doc a public URL? Can you post it?

After a fetch, <urlinfo charsetsource> returns the charset that the source (raw) HTML is, as determined from headers/META/scan. <urlinfo charsettxt> returns the charset that the formatted text (<urlinfo text>) is, as determined by <urlcp charsettext> and possible conversion errors.

If the source-to-text charset conversion fails altogether (non-zero exit code), <urlinfo charsettext> will probably be left the same as the source charset (ie. not UTF-8). However, if it only partially fails (as in this case), <urlinfo charsettext> will be UTF-8, but some or all of the characters in the text may not be valid. <fetch> assumes that a partial failure is still mostly good from a display standpoint (ie. HTML browser), and in general has no way of knowing how bad the failure was anyway (typically only a character or two).

To check if the <urlinfo text> is 100% valid UTF-8 in such cases, you could try to convert it back to ISO-8859-1 and see if any errors occur:

<urlinfo text><$orgtxt = $ret>
<rex row "\?=" $orgtxt></rex><$orgcount = $loop>
<strfmt "%!hV" $orgtxt>
<rex row "\?=" $ret></rex>
<if $ret gt $orgcount>
Text contains invalid UTF-8 sequences
<else>
Text is ok
</if>

This code counts the number of question-marks before and after UTF-8 decoding: if there are more after decoding, then the text has invalid UTF-8 sequences.