Encoding Type used to store input

sourceone · Post by **sourceone** » Mon Jun 19, 2006 12:00 pm

I do get the following error when fetching the Chinese page:



Could this possibly be the cause of the errors in the xml?

Also, is there way to check if what has been fetched is already UTF-8 encoded? I'm thinking that I would need to add this check, before I use strfmt %hhV when outputing to xml.

Post by **Kai** » Mon Jun 19, 2006 5:02 pm

Yep, that could certainly be the cause. Is the original Chinese doc a public URL? Can you post it?

After a fetch, <urlinfo charsetsource> returns the charset that the source (raw) HTML is, as determined from headers/META/scan. <urlinfo charsettxt> returns the charset that the formatted text (<urlinfo text>) is, as determined by <urlcp charsettext> and possible conversion errors.

If the source-to-text charset conversion fails altogether (non-zero exit code), <urlinfo charsettext> will probably be left the same as the source charset (ie. not UTF-8). However, if it only partially fails (as in this case), <urlinfo charsettext> will be UTF-8, but some or all of the characters in the text may not be valid. <fetch> assumes that a partial failure is still mostly good from a display standpoint (ie. HTML browser), and in general has no way of knowing how bad the failure was anyway (typically only a character or two).

To check if the <urlinfo text> is 100% valid UTF-8 in such cases, you could try to convert it back to ISO-8859-1 and see if any errors occur:

<urlinfo text><$orgtxt = $ret>
<rex row "\?=" $orgtxt></rex><$orgcount = $loop>
<strfmt "%!hV" $orgtxt>
<rex row "\?=" $ret></rex>
<if $ret gt $orgcount>
Text contains invalid UTF-8 sequences
<else>
Text is ok
</if>

This code counts the number of question-marks before and after UTF-8 decoding: if there are more after decoding, then the text has invalid UTF-8 sequences.

sourceone · Post by **sourceone** » Mon Jun 19, 2006 5:07 pm

What happens if the text doesn't have any question-marks before decoding?

Here is the URL - http://club.music.yule.sohu.com/r-oldso ... -10-0.html

sourceone · Post by **sourceone** » Sun Sep 24, 2006 10:47 pm

I'm trying to understand the conversion to utf-8 during a fetch. The fetch determines the charset of the page and then converts from that charset to utf-8. Is this correct? Is this the call to the conversion function?

C:\MORPH3\etc\iconv" -f gb2312 -t UTF-8 -c

Should I assume that utf-8 will not always be returned if the source charset cannot be determined? If invalid utf-8 is returned, are there any steps to make sure that only utf-8 is returned?

Post by **Kai** » Mon Sep 25, 2006 3:22 pm

Yes, <fetch> determines the source charset (from headers/meta/guess) and converts it to UTF-8 (the selected text charset).

Yes, iconv is used for the conversion, if the charset is not an internally-parsable charset.

Yes, if the source charset cannot be determined, or can be determined but not translated, the returned (text) charset will be the source charset instead. (You can check the text charset with <urlinfo charsettext>.)

Even if the text charset is UTF-8, it may still contain invalid characters, eg. if the translation partially failed (with a message) or the source was already UTF-8 but had invalid characters. The only way to make sure the text is 100% valid UTF-8 is not only to make sure <urlinfo charsettext> is "UTF-8", but also verify the text with the procedure in message #12.

sourceone · Post by **sourceone** » Sat Sep 30, 2006 7:51 am

Is there a way for me to take the text from a page in its original charset and store that in Texis without converting to UTF-8?

Post by **Kai** » Mon Oct 02, 2006 6:08 pm

You mean prevent <fetch> from changing the charset? You can set <urlcp charsettext source> to make the text (formatted) charset the same as the source (original) charset. But then you'll have to know/assume/store the orignal charset somewhere too so it can be converted/displayed properly on display.

sourceone · Post by **sourceone** » Mon Oct 02, 2006 9:45 pm

Does the fetch actually convert the text to UTF-8? I want to be able to store the return value from fetch and also the result of urltext in the database in its original charset. Can I do this without having fetch changing the charset to UTF-8? Or as you stated, would I have to do a fetch to determine the charset and then do a 2nd fetch as you stated using <urlcp charsettext source> to set the charset to the charset retrieved during the 1st fetch?

Post by **Kai** » Wed Oct 04, 2006 6:16 pm

You do not need to do 2 <fetch>es in order to store the raw <fetch> result (HTML) as well as <urltext> both in their original (source, on-the-web-server) charset. The $ret from <fetch> (or <urlinfo rawdoc>, the same thing) is always in the source charset regardless of <urlcp> settings.
To keep <urltext> in the source charset too (instead of UTF-8), just set <urlcp charsettext source>.

There's no 3rd arg (charset) needed for <urlcp charsettext source>, ie. you don't have to do a pre- <fetch> to determine the actual charset name and then set that. `source' is a "special" charset for the `charsettext' setting that just means "same-as-the-source-HTML-charset-whatever-it-may-be" for precisely this situation.

My comment about storing the charset somewhere was only because you'll almost certainly want to know that charset later, now that the text isn't always going to be UTF-8. But you don't need a 2nd <fetch> for that; just save <urlinfo charsettext> along with <urltext>.

(Note as a caveat however, that there is a chance, however unlikely, that <urltext> will still not be 100% in the source charset even with this setting. This is because the parser for <urltext> must first map the text to UTF-8 for parsing, and then back to the desired charset for <urltext> $ret. So if the source charset is unknown to <fetch>/iconv, one or both of those mappings may fail. Of course, then the text will probably be left in the original charset at parse anyway, so it's mainly a theoretical issue.)

sourceone · Post by **sourceone** » Wed Oct 11, 2006 11:07 am

Is <urlinfo charsettxt> the same as <urlinfo contenttypeparam "charset">?