Why would <urlutil charsetconv> return invalid XML if I'm using it to convert to UTF-8. Is it possible that iconv is failing? If so, how would I go about making sure that the returned text doesn't contain invalid XML characters?
It's possible that iconv is failing. Are there any putmsgs when it is run? To validate purported UTF-8 output, you could decode and re-encode it with <strfmt>:
<strfmt "%!hV" $ret>
<strfmt "%hV $ret>
Then any invalid UTF-8 chars would be mapped to `?'. The `h' subflag encodes out-of-ISO-8859-1-range chars as HTML, to prevent losses.
Note that some browsers may still complain about certain *valid* UTF-8 chars, such as U+0080 through U+009F.
Cannot completely convert charset gb2312 to UTF-8 via converter "c:\morph3\etc\iconv" -f gb2312 -t UTF-8 -c: returned exit code 1 in the function httransbuf
charsetconv still returns a value. So I shouldn't assume that this return value will always be valid UTF-8? Should I look for this error first?
The ?? are the problematic characters, they get converted to valid UTF-8, but not valid XML as they are control characters, which are not allowed in XML even though they are valid in UTF-8.
I assume I should try to detect and remove these control characters before trying to convert to UTF-8. How would I go about detecting these control characters?
To avoid any errors like those mentioned in message #5 when converting from gb2312 to utf8, are there any other alternate programs that will convert from gb2312 to utf8?