Encoding Type used to store input

sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I do get the following error when fetching the Chinese page:

<!-- 018 test:25: Cannot completely convert charset gb2312 to UTF-8 via converter "C:\MORPH3\etc\iconv" -f gb2312 -t UTF-8 -c: returned exit code
1 in the function httransbuf -->

Could this possibly be the cause of the errors in the xml?

Also, is there way to check if what has been fetched is already UTF-8 encoded? I'm thinking that I would need to add this check, before I use strfmt %hhV when outputing to xml.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

Yep, that could certainly be the cause. Is the original Chinese doc a public URL? Can you post it?

After a fetch, <urlinfo charsetsource> returns the charset that the source (raw) HTML is, as determined from headers/META/scan. <urlinfo charsettxt> returns the charset that the formatted text (<urlinfo text>) is, as determined by <urlcp charsettext> and possible conversion errors.

If the source-to-text charset conversion fails altogether (non-zero exit code), <urlinfo charsettext> will probably be left the same as the source charset (ie. not UTF-8). However, if it only partially fails (as in this case), <urlinfo charsettext> will be UTF-8, but some or all of the characters in the text may not be valid. <fetch> assumes that a partial failure is still mostly good from a display standpoint (ie. HTML browser), and in general has no way of knowing how bad the failure was anyway (typically only a character or two).

To check if the <urlinfo text> is 100% valid UTF-8 in such cases, you could try to convert it back to ISO-8859-1 and see if any errors occur:

<urlinfo text><$orgtxt = $ret>
<rex row "\?=" $orgtxt></rex><$orgcount = $loop>
<strfmt "%!hV" $orgtxt>
<rex row "\?=" $ret></rex>
<if $ret gt $orgcount>
Text contains invalid UTF-8 sequences
<else>
Text is ok
</if>

This code counts the number of question-marks before and after UTF-8 decoding: if there are more after decoding, then the text has invalid UTF-8 sequences.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

I'm trying to understand the conversion to utf-8 during a fetch. The fetch determines the charset of the page and then converts from that charset to utf-8. Is this correct? Is this the call to the conversion function?

C:\MORPH3\etc\iconv" -f gb2312 -t UTF-8 -c

Should I assume that utf-8 will not always be returned if the source charset cannot be determined? If invalid utf-8 is returned, are there any steps to make sure that only utf-8 is returned?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

Yes, <fetch> determines the source charset (from headers/meta/guess) and converts it to UTF-8 (the selected text charset).

Yes, iconv is used for the conversion, if the charset is not an internally-parsable charset.

Yes, if the source charset cannot be determined, or can be determined but not translated, the returned (text) charset will be the source charset instead. (You can check the text charset with <urlinfo charsettext>.)

Even if the text charset is UTF-8, it may still contain invalid characters, eg. if the translation partially failed (with a message) or the source was already UTF-8 but had invalid characters. The only way to make sure the text is 100% valid UTF-8 is not only to make sure <urlinfo charsettext> is "UTF-8", but also verify the text with the procedure in message #12.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

Is there a way for me to take the text from a page in its original charset and store that in Texis without converting to UTF-8?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

You mean prevent <fetch> from changing the charset? You can set <urlcp charsettext source> to make the text (formatted) charset the same as the source (original) charset. But then you'll have to know/assume/store the orignal charset somewhere too so it can be converted/displayed properly on display.
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

Does the fetch actually convert the text to UTF-8? I want to be able to store the return value from fetch and also the result of urltext in the database in its original charset. Can I do this without having fetch changing the charset to UTF-8? Or as you stated, would I have to do a fetch to determine the charset and then do a 2nd fetch as you stated using <urlcp charsettext source> to set the charset to the charset retrieved during the 1st fetch?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Encoding Type used to store input

Post by Kai »

You do not need to do 2 <fetch>es in order to store the raw <fetch> result (HTML) as well as <urltext> both in their original (source, on-the-web-server) charset. The $ret from <fetch> (or <urlinfo rawdoc>, the same thing) is always in the source charset regardless of <urlcp> settings.
To keep <urltext> in the source charset too (instead of UTF-8), just set <urlcp charsettext source>.

There's no 3rd arg (charset) needed for <urlcp charsettext source>, ie. you don't have to do a pre- <fetch> to determine the actual charset name and then set that. `source' is a "special" charset for the `charsettext' setting that just means "same-as-the-source-HTML-charset-whatever-it-may-be" for precisely this situation.

My comment about storing the charset somewhere was only because you'll almost certainly want to know that charset later, now that the text isn't always going to be UTF-8. But you don't need a 2nd <fetch> for that; just save <urlinfo charsettext> along with <urltext>.

(Note as a caveat however, that there is a chance, however unlikely, that <urltext> will still not be 100% in the source charset even with this setting. This is because the parser for <urltext> must first map the text to UTF-8 for parsing, and then back to the desired charset for <urltext> $ret. So if the source charset is unknown to <fetch>/iconv, one or both of those mappings may fail. Of course, then the text will probably be left in the original charset at parse anyway, so it's mainly a theoretical issue.)
sourceone
Posts: 47
Joined: Tue Mar 29, 2005 2:10 pm

Encoding Type used to store input

Post by sourceone »

Is <urlinfo charsettxt> the same as <urlinfo contenttypeparam "charset">?
Post Reply