Encoding Type used to store input

mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

Encoding Type used to store input

Post by mjacobson »

I am developing an XML output interface to return search results. I have run into an issue trying to set the correct character type when returning the "Body" value for a cache request. If the page was anything other than html or txt, I am getting "illegal" character error while parsing the XML.

If I change the character type to "ISO-8859-1" these problems go away. I was wondering if this is a correct thing to do, or do I need to set the type depending on what type of document was stored, pdf, powerpoint, word, html, etc.?
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

When I try to encode the character 'í' using %hhV, I get an error when trying to read the file using utf-8 encoding. How would I go about encoding characters like this for utf-8?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

What is the error you get?

What is the hex value of the bytes in the file you're reading, from your %hhV output? They should be C3 AD.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I get an invalid character when trying to open the xml file using Internet Explorer. I believe the hex value is ED.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Encoding Type used to store input

Post by mark »

This works in IE for me:

<$funny='í'>
<dataset>
<record>
<text>accented i: <fmt "%hhV" $funny></text>
</record>
</dataset>

Where the í in the data is hex ED. %hhV gives hex C3 AD.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Encoding Type used to store input

Post by mark »

I think I cc'd the wrong person. See the previous msg on the board.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I think I solved the problem. I was using <strfmt %hhV $funny> to save it to a variable and then using fmt %s to print it out again.

Now I seem to be getting another error with Chinese characters. The following string is produced by fmt %hhV and is invalid in IE:

Re:±Ïҵ²»º𲻿ìʮÊ׾­µä¸èÇú&#27;
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

That doesn't look like valid <strfmt "%hhV"> output (as some of it doesn't UTF-8 decode properly), unless something was mangled during the message board post.

If the original characters were Chinese, they must already be UTF-8 or some other Unicode format, not ISO-8859-1 as in msg #3. You should then not be using <strfmt "%hhV">, and instead leave the charset alone, eg. <strfmt "%s"> or "%H".

Can you copy the original (not <strfmt>) characters to a file, then run this:

<read "/tmp/file"><$org = $ret>
<fmt "Original: [%U]\n" $org>
<strfmt "%hhV" $org>
<fmt "UTF-8 enc: [%U]\n" $ret>

which will print them in hex to avoid potential cut-and-paste issues.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I do get the following error when fetching the Chinese page:

<!-- 018 test:25: Cannot completely convert charset gb2312 to UTF-8 via converter "C:\MORPH3\etc\iconv" -f gb2312 -t UTF-8 -c: returned exit code
1 in the function httransbuf -->

Could this possibly be the cause of the errors in the xml?

Also, is there way to check if what has been fetched is already UTF-8 encoded? I'm thinking that I would need to add this check, before I use strfmt %hhV when outputing to xml.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

Yep, that could certainly be the cause. Is the original Chinese doc a public URL? Can you post it?

After a fetch, <urlinfo charsetsource> returns the charset that the source (raw) HTML is, as determined from headers/META/scan. <urlinfo charsettxt> returns the charset that the formatted text (<urlinfo text>) is, as determined by <urlcp charsettext> and possible conversion errors.

If the source-to-text charset conversion fails altogether (non-zero exit code), <urlinfo charsettext> will probably be left the same as the source charset (ie. not UTF-8). However, if it only partially fails (as in this case), <urlinfo charsettext> will be UTF-8, but some or all of the characters in the text may not be valid. <fetch> assumes that a partial failure is still mostly good from a display standpoint (ie. HTML browser), and in general has no way of knowing how bad the failure was anyway (typically only a character or two).

To check if the <urlinfo text> is 100% valid UTF-8 in such cases, you could try to convert it back to ISO-8859-1 and see if any errors occur:

<urlinfo text><$orgtxt = $ret>
<rex row "\?=" $orgtxt></rex><$orgcount = $loop>
<strfmt "%!hV" $orgtxt>
<rex row "\?=" $ret></rex>
<if $ret gt $orgcount>
Text contains invalid UTF-8 sequences
<else>
Text is ok
</if>

This code counts the number of question-marks before and after UTF-8 decoding: if there are more after decoding, then the text has invalid UTF-8 sequences.
Post Reply