Accented Noise words and synonyms

Post Reply
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Hi,

it seems like French accented characters are giving me more trouble than anything else. I got the equivalence file to work (created a customized "eqvsusr" file), but when I include French words with accented characters in the list the backref.exe runs fine, but when the search does not pick up the equivalence.

For example, if I have in my eqvsusr.lst

montréal,montreal
grâce,grace,honor

the results will not include any of the equivalent words. I typed in the accented é,â in wordpad by using "alt+131" and "alt+133".

I also modified the $noiselist included in the search script, but it does not pick anything with accents. For example, I want the search script to treat "à" as a noise word. I tried
<$noiselist="à" "&agrave;" "\xC3\xA0" "\xE0">
<apicp noise $noiselist>

and it does not work.

Is there a special format for accented characters to be included in thesaurus and noiselist? Thanks.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

I'm confused - what are the wordc and langc settings and where can I find them? Thanks.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks guys. I'm not sure if accented characters would be included in "\alpha", but I put the SQL statements in the <search> function:

<A NAME=search>
<local savenext saveindexcount>

<sql "set wordc='[\alnum]'"></sql>
<sql "set langc='[\alnum]'"></sql>
...

and I also included them in the <init> function. None of them worked. I'm not sure if the accented characters are a part of alpha characters, or if I had placed the SQL statements in the wrong place. I'm using UTF-8 as searching character sets. Thanks again for any help you could provide...
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

They should be in the fpar function like the other set's. You've made them more restrictive not less. Put space and - in langc as in the default case. Also try adding all the high bits to them both.

<sql "set wordc='[\alnum\x80-\xff'']'"></sql>
<sql "set langc='[\alnum\x80-\xff'' \-]'"></sql>

Make sure you include the utf-8 version of montréal in your equiv.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks Mark!

I changed the settings to fpar, but I'm not sure what you mean by including the utf-8 version in my equiv. In my eqvsusr.lst, I have for example:

montréal,Montreal,Montréal,montreal

the accents are entered by wordpad using the alt+130 key, by including the utf-8 version do I have to translate the English words to utf-8 code, such as:

montr\xC3\A9al,montreal

or every single letter in utf-8 code?
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Another question, after including the sql statements in my fpar function, I have my noiselist set to exclude the noise word "à", so I have

<$noiselist = "a" "à" "\xE0" "&Agrave;" "\xC3\xA0" ...>

to include every possible format of the letter à, but when I do a search it still does not filter out the à. What am I doing wrong?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

You can't use \x notation. You have to use the actual chars.
<$noiselist = "a" "à"> works for me.
You need to enter the words for noise and equivs in whatever form they are coming from the browser. You could use
{query=<fmt "%U" $query>}
to get an idea what's coming in. Or save them to a file with
<write append /tmp/queries><fmt "%s\n" $query></write>
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks Mark! I figured out the problem - our Java encoder was encoding the incoming utf-8 twice, so instead of putting "à" in the noiselist I have to put "Ã", then it works. Thank you for all your help!
Post Reply