iconv

aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

iconv

Post by aitchon »

Using exec, how would I go about using iconv to convert text in a character set to UTF-8?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

iconv

Post by mark »

aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

iconv

Post by aitchon »

Would this be a more effective way than using <fmt "%hhV" $str> when trying to output text to an XML file?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

iconv

Post by Kai »

<fmt> is builtin to Vortex and thus is faster than <exec>ing a separate iconv process. Also, you'd need to HTML-escape the iconv output (for XML) after the <exec> with <fmt "%H">, whereas <fmt "%hhV"> does that already. (But note that "%hhV" assumes the *input* has HTML-escaped its ampersands, and must be ISO-8859-1.)

A more generic -- and just as fast -- way to convert charsets is with:

<urlutil charsetconv $datatoconvert $sourcecharset "UTF-8">
<strfmt "%H" $ret> <!-- HTML-escape for XML -->
<$converteddata = $ret>

which will handle any builtin or iconv charset. It will use the same (internal/fast) routines as <fmt> if the charset pair can be handled that way, otherwise it <exec>s iconv. Note the <strfmt> for HTML-escaping for XML.
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

iconv

Post by aitchon »

So if I already know that text is already encoded in UTF-8, would only have to <strfmt "%H" $var> to output that variable to XML?
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

iconv

Post by jason112 »

Correct, %H does the HTML escaping of things like & and <
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

iconv

Post by aitchon »

If the $sourcecharset is equal to "Unknown", is there any attempt to try to convert the data to UTF-8? Or should I not even bother trying to call urlutil charsetconv?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

iconv

Post by Kai »

Any charset that is not known internally is punted to an exec'd iconv process. So $sourcecharset of "Unknown" would indeed exec an iconv; you should not bother calling <urlutil charsetconv> then.
aitchon
Posts: 119
Joined: Mon Jan 22, 2007 10:30 am

iconv

Post by aitchon »

I'm using <urlutil charsetconv> to convert a piece of text to UTF-8 from a page that's encoded in ISO-8859-1. It seems that it's having trouble converting the some characters in this string:

gameplay variety to the series — players

is translated to:

gameplay variety to the series â&#128;&#148; players

Is this the correct conversion to UTF-8? The translated characters do not seem to be valid UTF-8 characters.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

iconv

Post by Kai »

It looks like that string was already in UTF-8 (an em dash U+2014); converting a UTF-8 page as if it were ISO-8859-1 will indeed result in incorrect results. Perhaps the page's charset was incorrectly labelled?
Post Reply