Use equivalences in Unicode

josmani · Post by **josmani** » Tue Dec 03, 2013 5:41 am

I am replacing the Texis thesaurus file with a smaller/more specific file and use backref to compile the new file for use in the site search. While this works perfectly with our English language website, I was trying to recreate the same nationalities for our Cyrillic (Unicode) version and I am not having any success.

The documentation mentions and ASCII equivalence file. Is there a way I can do this in Unicode?

Thanks.

Post by **Kai** » Tue Dec 03, 2013 10:01 am

Case/diacritic insensitivity with Unicode (UTF-8) words in equivalences -- either explicitly via parenthetical syntax in the query, or via a thesaurus -- are not yet supported in Texis; this is planned for a future release (no target date yet). Only single-word sets are currently UTF-8 case/diacritic-insensitive. UTF-8 in equivs will probably be mangled.

For a single-set search, you might be able to emulate the equiv by manually translating to a zero-intersect query, e.g. translate `(car,auto,vehicle)' to `car auto vehicle @0', where any of those words could be UTF-8. But of course you'd have to then look up and map the equiv yourself, and any other multi-word sets in the same query would have to be ASCII parenthetical or equiv sets (with an explicit `+' prefix to require them, to prevent grouping with `@0').