Encoding Type used to store input

sourceone · Post by **sourceone** » Mon Jun 19, 2006 5:07 pm

What happens if the text doesn't have any question-marks before decoding?

Here is the URL - http://club.music.yule.sohu.com/r-oldso ... -10-0.html

sourceone · Post by **sourceone** » Sun Sep 24, 2006 10:47 pm

I'm trying to understand the conversion to utf-8 during a fetch. The fetch determines the charset of the page and then converts from that charset to utf-8. Is this correct? Is this the call to the conversion function?

C:\MORPH3\etc\iconv" -f gb2312 -t UTF-8 -c

Should I assume that utf-8 will not always be returned if the source charset cannot be determined? If invalid utf-8 is returned, are there any steps to make sure that only utf-8 is returned?

Post by **Kai** » Mon Sep 25, 2006 3:22 pm

Yes, <fetch> determines the source charset (from headers/meta/guess) and converts it to UTF-8 (the selected text charset).

Yes, iconv is used for the conversion, if the charset is not an internally-parsable charset.

Yes, if the source charset cannot be determined, or can be determined but not translated, the returned (text) charset will be the source charset instead. (You can check the text charset with <urlinfo charsettext>.)

Even if the text charset is UTF-8, it may still contain invalid characters, eg. if the translation partially failed (with a message) or the source was already UTF-8 but had invalid characters. The only way to make sure the text is 100% valid UTF-8 is not only to make sure <urlinfo charsettext> is "UTF-8", but also verify the text with the procedure in message #12.

sourceone · Post by **sourceone** » Sat Sep 30, 2006 7:51 am

Is there a way for me to take the text from a page in its original charset and store that in Texis without converting to UTF-8?

Post by **Kai** » Mon Oct 02, 2006 6:08 pm

You mean prevent <fetch> from changing the charset? You can set <urlcp charsettext source> to make the text (formatted) charset the same as the source (original) charset. But then you'll have to know/assume/store the orignal charset somewhere too so it can be converted/displayed properly on display.

sourceone · Post by **sourceone** » Mon Oct 02, 2006 9:45 pm

Does the fetch actually convert the text to UTF-8? I want to be able to store the return value from fetch and also the result of urltext in the database in its original charset. Can I do this without having fetch changing the charset to UTF-8? Or as you stated, would I have to do a fetch to determine the charset and then do a 2nd fetch as you stated using <urlcp charsettext source> to set the charset to the charset retrieved during the 1st fetch?

Post by **Kai** » Wed Oct 04, 2006 6:16 pm

You do not need to do 2 <fetch>es in order to store the raw <fetch> result (HTML) as well as <urltext> both in their original (source, on-the-web-server) charset. The $ret from <fetch> (or <urlinfo rawdoc>, the same thing) is always in the source charset regardless of <urlcp> settings.
To keep <urltext> in the source charset too (instead of UTF-8), just set <urlcp charsettext source>.

There's no 3rd arg (charset) needed for <urlcp charsettext source>, ie. you don't have to do a pre- <fetch> to determine the actual charset name and then set that. `source' is a "special" charset for the `charsettext' setting that just means "same-as-the-source-HTML-charset-whatever-it-may-be" for precisely this situation.

My comment about storing the charset somewhere was only because you'll almost certainly want to know that charset later, now that the text isn't always going to be UTF-8. But you don't need a 2nd <fetch> for that; just save <urlinfo charsettext> along with <urltext>.

(Note as a caveat however, that there is a chance, however unlikely, that <urltext> will still not be 100% in the source charset even with this setting. This is because the parser for <urltext> must first map the text to UTF-8 for parsing, and then back to the desired charset for <urltext> $ret. So if the source charset is unknown to <fetch>/iconv, one or both of those mappings may fail. Of course, then the text will probably be left in the original charset at parse anyway, so it's mainly a theoretical issue.)

sourceone · Post by **sourceone** » Wed Oct 11, 2006 11:07 am

Is <urlinfo charsettxt> the same as <urlinfo contenttypeparam "charset">?

Post by **mark** » Wed Oct 11, 2006 12:32 pm

No. contenttypeparam is the type specified in the downloaded page. charsettxt is what the extracted text is. It could be different.

aitchon · Post by **aitchon** » Thu Feb 08, 2007 1:17 pm

If I detect that <urltext> produces invalid UTF-8 data, is there a way to replace all invalid UTF-8 characters with a valid character like a space?