Encoding Type used to store input

mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Encoding Type used to store input

Post by mjacobson »

I am developing an XML output interface to return search results. I have run into an issue trying to set the correct character type when returning the "Body" value for a cache request. If the page was anything other than html or txt, I am getting "illegal" character error while parsing the XML.

If I change the character type to "ISO-8859-1" these problems go away. I was wondering if this is a correct thing to do, or do I need to set the type depending on what type of document was stored, pdf, powerpoint, word, html, etc.?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Encoding Type used to store input

Post by John »

That is a correct solution, as XML expects UTF-8 encoding, not ISO-8859-1.

Another solution is to use <fmt %hhV $var> to print the variable, which will UTF-8 and HTML encode the data, as UTF-8 is the expected format for XML.
John Turnbull
Thunderstone Software
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

When I try to encode the character 'í' using %hhV, I get an error when trying to read the file using utf-8 encoding. How would I go about encoding characters like this for utf-8?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

What is the error you get?

What is the hex value of the bytes in the file you're reading, from your %hhV output? They should be C3 AD.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I get an invalid character when trying to open the xml file using Internet Explorer. I believe the hex value is ED.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Encoding Type used to store input

Post by John »

Can you create a small file with that character, read it, and fmt %U the data. Then strfmt %hhV the original data, and fmt %U the output to see exactly what is coming out?

You aren't specifying a charset in your XML are you?
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Encoding Type used to store input

Post by mark »

This works in IE for me:

<$funny='í'>
<dataset>
<record>
<text>accented i: <fmt "%hhV" $funny></text>
</record>
</dataset>

Where the í in the data is hex ED. %hhV gives hex C3 AD.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Encoding Type used to store input

Post by mark »

I think I cc'd the wrong person. See the previous msg on the board.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I think I solved the problem. I was using <strfmt %hhV $funny> to save it to a variable and then using fmt %s to print it out again.

Now I seem to be getting another error with Chinese characters. The following string is produced by fmt %hhV and is invalid in IE:

Re:±Ïҵ²»º𲻿ìʮÊ׾­µä¸èÇú&#27;
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

That doesn't look like valid <strfmt "%hhV"> output (as some of it doesn't UTF-8 decode properly), unless something was mangled during the message board post.

If the original characters were Chinese, they must already be UTF-8 or some other Unicode format, not ISO-8859-1 as in msg #3. You should then not be using <strfmt "%hhV">, and instead leave the charset alone, eg. <strfmt "%s"> or "%H".

Can you copy the original (not <strfmt>) characters to a file, then run this:

<read "/tmp/file"><$org = $ret>
<fmt "Original: [%U]\n" $org>
<strfmt "%hhV" $org>
<fmt "UTF-8 enc: [%U]\n" $ret>

which will print them in hex to avoid potential cut-and-paste issues.
Post Reply