Hi,
We are trying to find out how to load, index and search data that is UniCode. Could you please point me to some material or give information on how these are handled in Texis? Also, mainly information on Sorting, word boundaries and such in Unicode data is required.
Specifically how are Japanese, Chinese, Korean, Farsi, Cyrillic, Greek, and other non-Roman alphabet languages are indexed and searched.
Any help is greatly appreciated.
Thanks In Advance!!
For extended multi-byte character set data, storing as UTF-8 is probably best, as it avoids nul-truncation in varchar fields that would happen with UTF-16 or plain Unicode. Texis stores varchar data as-is in SQL, so the character set/encoding you store as will be the one returned. There are format codes in Vortex to transform data to/from ISO-8859, UTF-16, UTF-8 etc. if needed before insert; see the extended codes in <strfmt> in the manual online.
Since you're indexing CJK languages, another issue is word separation. CJK data doesn't always have separator characters like space between words, so the index expression and queries will have to be modified to be aware of this. To cover both, try setting index expressions like this:
set delexp=0;
set addexp='[\alnum\x80-\xff]{2,99}';
set addexp='[\xc0-\xfd]=[\x80-\xbf]+';
The first one covers English and Roman-alphabet languages, the second covers adjacent-word languages like CJK by mapping each UTF-8 sequence to a single word.
You'll also want to modify CJK search queries to require all the "words" to be adjacent (a phrase):
<-- Double-quote each group of UTF-8 entities to make a phrase: -->
<sandr "[\x80-\xfd]+" '\x22\1\x22' $query>
<-- Separate each UTF-8 entity with space to make each a word: -->
<sandr "[\xc0-\xfd]=[\x80-\xbf]+" "\x20\1\2" $ret>
<-- Remove leading spaces from phrases: " x" becomes "x": -->
<sandr '\x22\P=\x20=\F[\xc0-\xfd]=' "" $ret>
More specifically, you can store any encoding that does not use nul (0) bytes in a varchar; if the data uses nul bytes you should use a varbyte field. So ASCII, ISO-8859-1, Windows-1251 and Unicode stored as UTF-8 are ok in a varchar. For Unicode as UTF-16 or wide-char, you'd need to convert to UTF-8 first or store in a varbyte field.
For example, Webinator 5.0.5 stores data (typically as UTF-8) in the varchar Body column, and notes the character set in the Charset column in case the encoding differes for a particular row (eg. unknown character set).