Page 1 of 1
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 1:06 pm
by edev
Hi,
it seems like French accented characters are giving me more trouble than anything else. I got the equivalence file to work (created a customized "eqvsusr" file), but when I include French words with accented characters in the list the backref.exe runs fine, but when the search does not pick up the equivalence.
For example, if I have in my eqvsusr.lst
montréal,montreal
grâce,grace,honor
the results will not include any of the equivalent words. I typed in the accented é,â in wordpad by using "alt+131" and "alt+133".
I also modified the $noiselist included in the search script, but it does not pick anything with accents. For example, I want the search script to treat "à" as a noise word. I tried
<$noiselist="à" "à" "\xC3\xA0" "\xE0">
<apicp noise $noiselist>
and it does not work.
Is there a special format for accented characters to be included in thesaurus and noiselist? Thanks.
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 2:05 pm
by John
Make sure that the accented characters are in the wordc and langc settings, otherwise the search terms will not be seen as "language" words to be processed against the thesaurus and noise list.
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 2:25 pm
by edev
I'm confused - what are the wordc and langc settings and where can I find them? Thanks.
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 2:54 pm
by mark
http://thunderstone.master.com/texis/ma ... ml?q=wordc
Also make sure your equivs include the character sets being used.
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 3:33 pm
by edev
Thanks guys. I'm not sure if accented characters would be included in "\alpha", but I put the SQL statements in the <search> function:
<A NAME=search>
<local savenext saveindexcount>
<sql "set wordc='[\alnum]'"></sql>
<sql "set langc='[\alnum]'"></sql>
...
and I also included them in the <init> function. None of them worked. I'm not sure if the accented characters are a part of alpha characters, or if I had placed the SQL statements in the wrong place. I'm using UTF-8 as searching character sets. Thanks again for any help you could provide...
Accented Noise words and synonyms
Posted: Tue Oct 11, 2005 3:54 pm
by mark
They should be in the fpar function like the other set's. You've made them more restrictive not less. Put space and - in langc as in the default case. Also try adding all the high bits to them both.
<sql "set wordc='[\alnum\x80-\xff'']'"></sql>
<sql "set langc='[\alnum\x80-\xff'' \-]'"></sql>
Make sure you include the utf-8 version of montréal in your equiv.
Accented Noise words and synonyms
Posted: Thu Oct 13, 2005 1:13 pm
by edev
Thanks Mark!
I changed the settings to fpar, but I'm not sure what you mean by including the utf-8 version in my equiv. In my eqvsusr.lst, I have for example:
montréal,Montreal,Montréal,montreal
the accents are entered by wordpad using the alt+130 key, by including the utf-8 version do I have to translate the English words to utf-8 code, such as:
montr\xC3\A9al,montreal
or every single letter in utf-8 code?
Accented Noise words and synonyms
Posted: Thu Oct 13, 2005 1:16 pm
by edev
Another question, after including the sql statements in my fpar function, I have my noiselist set to exclude the noise word "à", so I have
<$noiselist = "a" "à" "\xE0" "À" "\xC3\xA0" ...>
to include every possible format of the letter à, but when I do a search it still does not filter out the à. What am I doing wrong?
Accented Noise words and synonyms
Posted: Thu Oct 13, 2005 3:49 pm
by mark
You can't use \x notation. You have to use the actual chars.
<$noiselist = "a" "à"> works for me.
You need to enter the words for noise and equivs in whatever form they are coming from the browser. You could use
{query=<fmt "%U" $query>}
to get an idea what's coming in. Or save them to a file with
<write append /tmp/queries><fmt "%s\n" $query></write>
Accented Noise words and synonyms
Posted: Fri Oct 14, 2005 3:45 pm
by edev
Thanks Mark! I figured out the problem - our Java encoder was encoding the incoming utf-8 twice, so instead of putting "à" in the noiselist I have to put "Ã", then it works. Thank you for all your help!