Unicode Support

Post Reply
skalyanaraman
Posts: 109
Joined: Tue May 29, 2001 9:13 pm

Unicode Support

Post by skalyanaraman »

Hi,
We are trying to find out how to load, index and search data that is UniCode. Could you please point me to some material or give information on how these are handled in Texis? Also, mainly information on Sorting, word boundaries and such in Unicode data is required.
Specifically how are Japanese, Chinese, Korean, Farsi, Cyrillic, Greek, and other non-Roman alphabet languages are indexed and searched.

Any help is greatly appreciated.
Thanks In Advance!!
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Unicode Support

Post by Kai »

For extended multi-byte character set data, storing as UTF-8 is probably best, as it avoids nul-truncation in varchar fields that would happen with UTF-16 or plain Unicode. Texis stores varchar data as-is in SQL, so the character set/encoding you store as will be the one returned. There are format codes in Vortex to transform data to/from ISO-8859, UTF-16, UTF-8 etc. if needed before insert; see the extended codes in <strfmt> in the manual online.

Since you're indexing CJK languages, another issue is word separation. CJK data doesn't always have separator characters like space between words, so the index expression and queries will have to be modified to be aware of this. To cover both, try setting index expressions like this:

set delexp=0;
set addexp='[\alnum\x80-\xff]{2,99}';
set addexp='[\xc0-\xfd]=[\x80-\xbf]+';

The first one covers English and Roman-alphabet languages, the second covers adjacent-word languages like CJK by mapping each UTF-8 sequence to a single word.

You'll also want to modify CJK search queries to require all the "words" to be adjacent (a phrase):

<-- Double-quote each group of UTF-8 entities to make a phrase: -->
<sandr "[\x80-\xfd]+" '\x22\1\x22' $query>
<-- Separate each UTF-8 entity with space to make each a word: -->
<sandr "[\xc0-\xfd]=[\x80-\xbf]+" "\x20\1\2" $ret>
<-- Remove leading spaces from phrases: " x" becomes "x": -->
<sandr '\x22\P=\x20=\F[\xc0-\xfd]=' "" $ret>
sabety
Posts: 76
Joined: Wed Dec 06, 2000 7:11 am

Unicode Support

Post by sabety »

Hi folks, been a while.

Do you have to create a unicode database in Texis in order to store unicode data, like one might do in Oracle or MySQL, if so what is the syntax?

If not, then is Texis handling the VARCHAR data columns like the Oracle NCHAR, NVARCHAR2, and NCLOB data types, all the time?

What about mixing unicode and iso inadverdently in the same data column???


thanks
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Unicode Support

Post by mark »

You can store any encoding in Texis's varchar. It's up to you to know/remember the encoding for later presentation.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Unicode Support

Post by Kai »

More specifically, you can store any encoding that does not use nul (0) bytes in a varchar; if the data uses nul bytes you should use a varbyte field. So ASCII, ISO-8859-1, Windows-1251 and Unicode stored as UTF-8 are ok in a varchar. For Unicode as UTF-16 or wide-char, you'd need to convert to UTF-8 first or store in a varbyte field.

For example, Webinator 5.0.5 stores data (typically as UTF-8) in the varchar Body column, and notes the character set in the Charset column in case the encoding differes for a particular row (eg. unknown character set).
Post Reply