Page 1 of 3

Encoding Type used to store input

Posted: Fri Jul 30, 2004 2:25 pm
by mjacobson
I am developing an XML output interface to return search results. I have run into an issue trying to set the correct character type when returning the "Body" value for a cache request. If the page was anything other than html or txt, I am getting "illegal" character error while parsing the XML.

If I change the character type to "ISO-8859-1" these problems go away. I was wondering if this is a correct thing to do, or do I need to set the type depending on what type of document was stored, pdf, powerpoint, word, html, etc.?

Encoding Type used to store input

Posted: Fri Jul 30, 2004 2:47 pm
by John
That is a correct solution, as XML expects UTF-8 encoding, not ISO-8859-1.

Another solution is to use <fmt %hhV $var> to print the variable, which will UTF-8 and HTML encode the data, as UTF-8 is the expected format for XML.

Encoding Type used to store input

Posted: Wed Jun 14, 2006 4:28 pm
by sourceone
When I try to encode the character 'í' using %hhV, I get an error when trying to read the file using utf-8 encoding. How would I go about encoding characters like this for utf-8?

Encoding Type used to store input

Posted: Wed Jun 14, 2006 6:00 pm
by Kai
What is the error you get?

What is the hex value of the bytes in the file you're reading, from your %hhV output? They should be C3 AD.

Encoding Type used to store input

Posted: Wed Jun 14, 2006 8:52 pm
by sourceone
I get an invalid character when trying to open the xml file using Internet Explorer. I believe the hex value is ED.

Encoding Type used to store input

Posted: Thu Jun 15, 2006 12:12 pm
by John
Can you create a small file with that character, read it, and fmt %U the data. Then strfmt %hhV the original data, and fmt %U the output to see exactly what is coming out?

You aren't specifying a charset in your XML are you?

Encoding Type used to store input

Posted: Thu Jun 15, 2006 12:28 pm
by mark
This works in IE for me:

<$funny='í'>
<dataset>
<record>
<text>accented i: <fmt "%hhV" $funny></text>
</record>
</dataset>

Where the í in the data is hex ED. %hhV gives hex C3 AD.

Encoding Type used to store input

Posted: Thu Jun 15, 2006 12:29 pm
by mark
I think I cc'd the wrong person. See the previous msg on the board.

Encoding Type used to store input

Posted: Thu Jun 15, 2006 4:26 pm
by sourceone
I think I solved the problem. I was using <strfmt %hhV $funny> to save it to a variable and then using fmt %s to print it out again.

Now I seem to be getting another error with Chinese characters. The following string is produced by fmt %hhV and is invalid in IE:

Re:±Ïҵ²»º𲻿ìʮÊ׾­µä¸èÇú&#27;

Encoding Type used to store input

Posted: Thu Jun 15, 2006 5:36 pm
by Kai
That doesn't look like valid <strfmt "%hhV"> output (as some of it doesn't UTF-8 decode properly), unless something was mangled during the message board post.

If the original characters were Chinese, they must already be UTF-8 or some other Unicode format, not ISO-8859-1 as in msg #3. You should then not be using <strfmt "%hhV">, and instead leave the charset alone, eg. <strfmt "%s"> or "%H".

Can you copy the original (not <strfmt>) characters to a file, then run this:

<read "/tmp/file"><$org = $ret>
<fmt "Original: [%U]\n" $org>
<strfmt "%hhV" $org>
<fmt "UTF-8 enc: [%U]\n" $ret>

which will print them in hex to avoid potential cut-and-paste issues.