Accented Noise words and synonyms

Post Reply
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Hi,

it seems like French accented characters are giving me more trouble than anything else. I got the equivalence file to work (created a customized "eqvsusr" file), but when I include French words with accented characters in the list the backref.exe runs fine, but when the search does not pick up the equivalence.

For example, if I have in my eqvsusr.lst

montréal,montreal
grâce,grace,honor

the results will not include any of the equivalent words. I typed in the accented é,â in wordpad by using "alt+131" and "alt+133".

I also modified the $noiselist included in the search script, but it does not pick anything with accents. For example, I want the search script to treat "à" as a noise word. I tried
<$noiselist="à" "&agrave;" "\xC3\xA0" "\xE0">
<apicp noise $noiselist>

and it does not work.

Is there a special format for accented characters to be included in thesaurus and noiselist? Thanks.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Accented Noise words and synonyms

Post by John »

Make sure that the accented characters are in the wordc and langc settings, otherwise the search terms will not be seen as "language" words to be processed against the thesaurus and noise list.
John Turnbull
Thunderstone Software
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

I'm confused - what are the wordc and langc settings and where can I find them? Thanks.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks guys. I'm not sure if accented characters would be included in "\alpha", but I put the SQL statements in the <search> function:

<A NAME=search>
<local savenext saveindexcount>

<sql "set wordc='[\alnum]'"></sql>
<sql "set langc='[\alnum]'"></sql>
...

and I also included them in the <init> function. None of them worked. I'm not sure if the accented characters are a part of alpha characters, or if I had placed the SQL statements in the wrong place. I'm using UTF-8 as searching character sets. Thanks again for any help you could provide...
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

They should be in the fpar function like the other set's. You've made them more restrictive not less. Put space and - in langc as in the default case. Also try adding all the high bits to them both.

<sql "set wordc='[\alnum\x80-\xff'']'"></sql>
<sql "set langc='[\alnum\x80-\xff'' \-]'"></sql>

Make sure you include the utf-8 version of montréal in your equiv.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks Mark!

I changed the settings to fpar, but I'm not sure what you mean by including the utf-8 version in my equiv. In my eqvsusr.lst, I have for example:

montréal,Montreal,Montréal,montreal

the accents are entered by wordpad using the alt+130 key, by including the utf-8 version do I have to translate the English words to utf-8 code, such as:

montr\xC3\A9al,montreal

or every single letter in utf-8 code?
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Another question, after including the sql statements in my fpar function, I have my noiselist set to exclude the noise word "à", so I have

<$noiselist = "a" "à" "\xE0" "&Agrave;" "\xC3\xA0" ...>

to include every possible format of the letter à, but when I do a search it still does not filter out the à. What am I doing wrong?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Accented Noise words and synonyms

Post by mark »

You can't use \x notation. You have to use the actual chars.
<$noiselist = "a" "à"> works for me.
You need to enter the words for noise and equivs in whatever form they are coming from the browser. You could use
{query=<fmt "%U" $query>}
to get an idea what's coming in. Or save them to a file with
<write append /tmp/queries><fmt "%s\n" $query></write>
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

Accented Noise words and synonyms

Post by edev »

Thanks Mark! I figured out the problem - our Java encoder was encoding the incoming utf-8 twice, so instead of putting "à" in the noiselist I have to put "Ã", then it works. Thank you for all your help!
Post Reply