Page 1 of 2
iconv
Posted: Thu Dec 06, 2007 12:21 pm
by aitchon
Using exec, how would I go about using iconv to convert text in a character set to UTF-8?
iconv
Posted: Thu Dec 06, 2007 1:53 pm
by mark
<exec bin iconv -f $sourcecharset -t "UTF-8"><fmt "%s" $datatoconvert></exec>
<$converteddata=$ret>
See also
http://www.gnu.org/software/libiconv/do ... onv.1.html
iconv
Posted: Thu Dec 06, 2007 3:54 pm
by aitchon
Would this be a more effective way than using <fmt "%hhV" $str> when trying to output text to an XML file?
iconv
Posted: Thu Dec 06, 2007 5:35 pm
by Kai
<fmt> is builtin to Vortex and thus is faster than <exec>ing a separate iconv process. Also, you'd need to HTML-escape the iconv output (for XML) after the <exec> with <fmt "%H">, whereas <fmt "%hhV"> does that already. (But note that "%hhV" assumes the *input* has HTML-escaped its ampersands, and must be ISO-8859-1.)
A more generic -- and just as fast -- way to convert charsets is with:
<urlutil charsetconv $datatoconvert $sourcecharset "UTF-8">
<strfmt "%H" $ret> <!-- HTML-escape for XML -->
<$converteddata = $ret>
which will handle any builtin or iconv charset. It will use the same (internal/fast) routines as <fmt> if the charset pair can be handled that way, otherwise it <exec>s iconv. Note the <strfmt> for HTML-escaping for XML.
iconv
Posted: Fri Dec 07, 2007 10:44 am
by aitchon
So if I already know that text is already encoded in UTF-8, would only have to <strfmt "%H" $var> to output that variable to XML?
iconv
Posted: Fri Dec 07, 2007 11:50 am
by jason112
Correct, %H does the HTML escaping of things like & and <
iconv
Posted: Mon Mar 24, 2008 2:51 pm
by aitchon
If the $sourcecharset is equal to "Unknown", is there any attempt to try to convert the data to UTF-8? Or should I not even bother trying to call urlutil charsetconv?
iconv
Posted: Mon Mar 24, 2008 4:46 pm
by Kai
Any charset that is not known internally is punted to an exec'd iconv process. So $sourcecharset of "Unknown" would indeed exec an iconv; you should not bother calling <urlutil charsetconv> then.
iconv
Posted: Tue Apr 01, 2008 12:19 pm
by aitchon
I'm using <urlutil charsetconv> to convert a piece of text to UTF-8 from a page that's encoded in ISO-8859-1. It seems that it's having trouble converting the some characters in this string:
gameplay variety to the series — players
is translated to:
gameplay variety to the series — players
Is this the correct conversion to UTF-8? The translated characters do not seem to be valid UTF-8 characters.
iconv
Posted: Tue Apr 01, 2008 2:31 pm
by Kai
It looks like that string was already in UTF-8 (an em dash U+2014); converting a UTF-8 page as if it were ISO-8859-1 will indeed result in incorrect results. Perhaps the page's charset was incorrectly labelled?