French accents normalization

edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

Hi,

we have some French records in the database and I've noticed that when user does a search containing a French word, with accent or not, it retrieves exactly the same word. For example, entering "Montreal" will only return "montreal", but not "montéal"; and entering "montéal" will only return "montréal", not "montreal". We used to have AltaVista search engine and it was able to normalize the French accents, returning both "montreal" and "montréal" regardless what users enter.

We have to index words with accents, so removing accents in the database is not an option. Is there anyway to make the "search" script search for both letters, when users enters é,à,ê in a French word? Can you do a replace in a query string, ie ask the search function to search for both the phrase with the accented character and without, then combine the results?

I'd appreciate any help tips. Thank you.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

French accents normalization

Post by Kai »

Which version of Webinator are you using? What are the values for Storage Charset and Display Charset (if applicable)?
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

I have commercial Webinator 5.1.24, Storage Char set is by default (ISO-8859-1) and display character set is UTF-8. The search engine is hosted on Windows 2000 server.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

French accents normalization

Post by Kai »

With Storage Charset set to ISO-8859-1, you could change the locale to one that maps all variants of `e' (upper/lower/accent/etc.) to `e' (lower-case) when generating the lower-case equivalent. Then either spelling of `Montreal' as a query would find either spelling in the text, as queries and text are both lower-cased for searching. The locale can be set in the dowalk and search scripts with:

<sql "set locale='xxx'"></sql>

The problem is that no standard locale (English/Spanish/French) would remove accents when lower-casing letters; you'd have to create your own, which I don't have information on.

The other possibility, which would not solve the issue universally but on a word-by-word basis, would be to create a user equivalence file for the variant spellings of `Montreal', and then use that in the search script. See http://thunderstone.master.com/texis/ma ... +file&s=.1
for how to create an equiv file (you may need to copy the monitor.exe program to backref.exe to create it). Then tell the search script to use it by adding this to the end of the <a name=getapisettings> function:

<apicp ueqprefix "c:\path\to\your\equivfile">
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

Hi Kai, I'll try the second suggestion and hopefully it will work. Thanks very much!
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

Hi again,

I read the help files for "Customizing Thesaurus", with many references to creating your own equivfile. I thought you have to have full Texis to make changes to the thesaurse. We have commercial Webinator, not the full version of Texis, would we still be able to create our own equivfile?

I found monitor.exe, to reindex my new equivfile, I will need to rename it to "backref.exe" and run it?

One crucial question, there are so many French words with accents, creating a table with all words is very lengthy and inefficient. Is there anyway I can loop through the query strings user has entered, extract the particular accented character (ie. é,à), and replace the character with a normal English character, then do a search for both the original string and the new string
with a sql statement such as

<$sql="from html
where <$sql="from html
where Title\Description\Keywords\Meta\Body " $liketype " $$sq or $$sq_new
and Title like $$stq
and Url matches $$suq
and Depth <= $$sdq
and Catno matches $$scq
">

would that work?

Thanks in advance.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

French accents normalization

Post by Kai »

You can create custom equivs with Webinator. Do not rename monitor.exe to backref.exe: make sure to copy it instead, as monitor.exe is still needed for Webinator.

You could use <sandr> to replace the accented characters. Eg. to replace small `e' with an acute accent (é) with `e':

<sandr "\xE9" "E" $sq><$sq_new = $ret>

Use similar <sandr> calls for the other possible characters.

But there are problems with simply ORing a LIKE for the two variations. The main one is that the OR is a bit slower and may affect pagination, because the index is used twice. Another is that some results may be missed, eg. if two accented words are being searched for, results with one of the words accented but the other not will be missed.

You could try instead to keep one query ($sq) in the SQL, but replace the accented words with an equiv list containing the unaccented version. Eg. if the user searches for `Montréal Canada' you replace it with `(Montréal,Montreal) Canada'. You can use <fmtcp sandcall> to search for accented words, and in the callback function, do the accent removal, then print both versions in parens (this code is untested):

<a name=fixword hit>
<local fixed>
<sandr "\xE9" "e" $hit>
... more <sandr>s on $ret for other chars ...
<fmt "(%s,%s)" $hit $ret>
</a>

<a name=fixquery sq>
<fmtcp sandcall noesc "[^\space,\x22()]*>>[\x80-\xFF]=[^\space,\x22]*" fixword>
<capture><sb>$sq</sb></capture>
<$sq = $ret>
</a>

Call <fixquery sq=$&sq> to fix up the query, then do a search as usual.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

Thank you so much Kai, you've been a great help. I will catch up on more reading so I can understand your solution more, and let you know if it would work.

Thank you very much!
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accents normalization

Post by edev »

Hi Kai,

this is what I did:

<a name=normword hit>
<local fixed>
<sandr "\xE9" "t" $hit>
<!----- ... more <sandr>s on $ret for other chars ... ----->
<sandr "\xC9" "\x45" $hit>
<sandr "\xEA" "\x65" $hit>
<sandr "\xEB" "\x65" $hit>
<sandr "\xEE" "i" $hit>
<sandr "\xEF" "i" $hit>
<sandr "\xE0" "a" $hit>
<sandr "\xE2" "a" $hit>
<sandr "\xC7" "C" $hit>
<sandr "\xE7" "c" $hit>
<sandr "\xE8" "\x65" $hit>
<!--- rewrite word with accented word with an OR statement (orignal_word,new_word) -->
<$normword=$ret>
<fmt "(%s,%s)" $hit $normword>

</a>

<a name=normquery sq>
<fmtcp sandcall noesc "[^\space,\x22()]*>>[\x80-\xFF]=[^\space,\x22]*" normword>
<capture><sb>$sq</sb></capture>
<$sq = $ret>
<html>the query is $sq</html>

</a>

<A NAME=main public>
<top><!-- top of page boilerplate -->
<normquery sq=$&query>
<search><!-- perform search -->
<bottom><!-- bottom of page boilerplate -->
<flush><!-- send everything to user while we continue locally -->
...

</A>

The problem is, the "sandr" function is not replacing accented characters with normal characters. I tried using both the letter "e" and its hex value "\x45". When I did a screen display of the fixed query I got exactly the word I entered: (montréal,montréal) instead of (montréal,montreal). Any idea why?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

French accents normalization

Post by mark »

You keep sandring the original $hit instead of the result of sandr, $ret. The way you have it only the last sandr will have any effect.
Post Reply