Use equivalences in Unicode

Post Reply
josmani
Posts: 53
Joined: Tue Jun 03, 2003 3:38 am

Use equivalences in Unicode

Post by josmani »

I am replacing the Texis thesaurus file with a smaller/more specific file and use backref to compile the new file for use in the site search. While this works perfectly with our English language website, I was trying to recreate the same nationalities for our Cyrillic (Unicode) version and I am not having any success.

The documentation mentions and ASCII equivalence file. Is there a way I can do this in Unicode?

Thanks.
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

Use equivalences in Unicode

Post by Kai »

Case/diacritic insensitivity with Unicode (UTF-8) words in equivalences -- either explicitly via parenthetical syntax in the query, or via a thesaurus -- are not yet supported in Texis; this is planned for a future release (no target date yet). Only single-word sets are currently UTF-8 case/diacritic-insensitive. UTF-8 in equivs will probably be mangled.

For a single-set search, you might be able to emulate the equiv by manually translating to a zero-intersect query, e.g. translate `(car,auto,vehicle)' to `car auto vehicle @0', where any of those words could be UTF-8. But of course you'd have to then look up and map the equiv yourself, and any other multi-word sets in the same query would have to be ASCII parenthetical or equiv sets (with an explicit `+' prefix to require them, to prevent grouping with `@0').
Post Reply