charsetconv

aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

I'm using <urlutil charsetconv> to convert some text from gb2312 to UTF-8. Here is the resulting text:

&#22238;&#22797;&#65306;&#20013;&#22269;-&#22235;&#24029;=? &#1;&#7;****

But when I write that text to an XML file, I get an invalid character error when trying to open that XML file in IE.
User avatar
John
Site Admin
Posts: 2625
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

charsetconv

Post by John »

The &#1; and &#7; are not valid XML characters. What was in the original text?
John Turnbull
Thunderstone Software
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

Why would <urlutil charsetconv> return invalid XML if I'm using it to convert to UTF-8. Is it possible that iconv is failing? If so, how would I go about making sure that the returned text doesn't contain invalid XML characters?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

charsetconv

Post by Kai »

It's possible that iconv is failing. Are there any putmsgs when it is run? To validate purported UTF-8 output, you could decode and re-encode it with <strfmt>:

<strfmt "%!hV" $ret>
<strfmt "%hV $ret>

Then any invalid UTF-8 chars would be mapped to `?'. The `h' subflag encodes out-of-ISO-8859-1-range chars as HTML, to prevent losses.

Note that some browsers may still complain about certain *valid* UTF-8 chars, such as U+0080 through U+009F.
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

I turned on putmsgs and see this:

Cannot completely convert charset gb2312 to UTF-8 via converter "c:\morph3\etc\iconv" -f gb2312 -t UTF-8 -c: returned exit code 1 in the function httransbuf

charsetconv still returns a value. So I shouldn't assume that this return value will always be valid UTF-8? Should I look for this error first?
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

Here's my code:
<$var="»ظ´£ºÖйú-ËĴ¨=? ??****">
<$charset="gb2312">
<urlutil charsetconv $var $charset "UTF-8">
<if $ret ne "">
<strfmt "%!hV" $ret>
<strfmt "%hV" $ret>
<strfmt "%H" $var>
<$var = $ret>
</if>

For this specific call, I don't get the error in putmsgs mentioned above. But it still produces invalid UTF-8.
User avatar
John
Site Admin
Posts: 2625
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

charsetconv

Post by John »

The ?? are the problematic characters, they get converted to valid UTF-8, but not valid XML as they are control characters, which are not allowed in XML even though they are valid in UTF-8.
John Turnbull
Thunderstone Software
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

I assume I should try to detect and remove these control characters before trying to convert to UTF-8. How would I go about detecting these control characters?
User avatar
John
Site Admin
Posts: 2625
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

charsetconv

Post by John »

You could sandr them out of the source, before the charsetconv e.g.

<$s="[\x00-\x08\x0b\x0c\x0e-\x1f]+">
<$r=" ">
<sandr $s $r $var>
<$var=$ret>

to replace strings of control characters with spaces. It will leave tabs, CR and LF alone as they are acceptable.
John Turnbull
Thunderstone Software
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

charsetconv

Post by aitchon »

To avoid any errors like those mentioned in message #5 when converting from gb2312 to utf8, are there any other alternate programs that will convert from gb2312 to utf8?