Page 1 of 2
Suffix/Prefix and Minwordlen
Posted: Wed Nov 10, 2004 6:15 pm
by Zeus
Hi,
We have a field called DOCTEXT, which as this data,
"this kid is problematic indeed".
Our minwordlen is 3.
we have suffixproc and prefixproc 'ON' with the default list. We have indexed the DOCTEXT field too.
When we search for DOCTEXT like 'problem'
we do not get a hit.
is it because, our minwordlen is 3 and pro being in the prefixlist strips the search word to just 'blem'?
Does the engine search for exactly the final stripped down version which is 'blem' and not find it in the index?
Any help is greatly appreciated!!
thanks!!
Suffix/Prefix and Minwordlen
Posted: Wed Nov 10, 2004 9:00 pm
by John
Yes, that is what is happening. If you disable the prefix processing it should work. Prefixes are disabled by default as they are generally less useful, since they can cause issues, such as the "pro" prefix being applied here.
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 2:12 am
by Zeus
Thanks!!
Just to be sure, the search for 'problem' after all the suffix and prefix stripping finally searches for 'blem' in the data. Does texis search for *blem* or just blem?
if so, how does the search for 'receive' work? we have ve in the suffix list. So, it will be stripped to 'recei' and does Texis search for recei*? Because we do get receive and receiver as hits.
Sorry, I was a little confused.
Thanks for all the help!!
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 10:32 am
by mark
It searches for words ending in blem in that case.
receive will strip down to rece (see apicp defsufrm). It will search for words beginning with rece.
See also wordc and langc to see what's considered a word.
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 10:42 am
by Zeus
In the case of the first part, searching for words ending in blem, does texis do linear or indexed search?
It looks like prefix processing will always create problems. is that right?
also, in the case of the second search, looking for words beginning with rece, does texis do linear or indexed search?
Suffix alone, I am more comfortable with.
Thanks for the help!!
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 11:07 am
by Zeus
Oh, I forgot to mention,in the first scenario, I said prefix stripping may not work because,
we have data which has just the word 'problem' along with the word 'problematic'.
thanks
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 11:54 am
by mark
Only prefix searching, case 2(suffix stripping) can use an index. Suffix searching, case 1(prefix stripping) can not use an index and will be linear. Middle searching (where both prefixes and suffixes have been removed) will also be linear. The index/linear behavior is basically the same as with wildcards: *word and *word* are linear, word* uses an index.
Prefixes have a tendency to change the meaning of a word anyhow.
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 2:59 pm
by Zeus
Sorry, I have one more question on this.
As I said earlier, the data also has the problem just by itself.
Why did not the linear search for *blem find the record as a hit?
our wordc and langc are
wordc=\alnum\X24
langc=\alnum\X24 \-
thanks!!
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 4:03 pm
by mark
Did you enable linear searching? It's off by default. View the source of the results page and see what errors or warnings you're getting in html comments. What are your precise settings and query? What's the excerpt of text around the word "problem"?
Suffix/Prefix and Minwordlen
Posted: Thu Nov 11, 2004 4:11 pm
by Zeus
I dont get any error or warnings. allinear is on.
The text containing the word problem is,
"What do we do when there is a problem with fit?".
our settings are (sorry for the longlist),
apicp qmaxsetwords 0> <!--Allow for wildcard search set to be of size provided-->
<apicp qmaxwords 0> <!--Allow for wildcard search words to be of size provided-->
<apicp alintersects 1> <!--Allow for intersects in queries for ex. @0,@1..-->
<$noise = "and" "or" "not"> <!--Allow searching of all words except these-->
<apicp noise $noise>
<apicp "allinear" "on">
<apicp "alpostproc" "on"> <!--Enable post-processing-->
<apicp "alnot" "on">
<apicp "alwithin" "on">
<apicp "exactphrase" "on"> <!--Allow quotes to force exact phrases-->
<apicp "alequivs" "on">
<apicp defsuffrm 0>
<!--Force post-processing to keep words with these suffixes-->
<apicp "suffixproc" "on">
<$suffixlist="able" "age" "aged" "ager" "ages" "al" "ally" "ance" "anced" "ancer" "ances" "ant" "ary" "at" "ate" "ated" "ater" "atery" "ates" "atic" "ed" "en" "ence" "enced" "encer" "ences" "end" "ent" "er" "ery" "es" "ess" "est" "ful" "ial" "ible" "ibler" "ic" "ical" "ice" "iced" "icer" "ices" "ics" "ide" "ided" "ider" "ides" "ier" "ily" "ing" "ion" "ious" "ise" "ised" "ises" "ish" "ism" "ist" "ity" "ive" "ived" "ives" "ize" "ized" "izer" "izes" "less" "ly" "ment" "ncy" "ness" "nt" "ory" "ous" "re" "red" "res" "ry" "s" "ship" "sion" "th" "tic" "tion" "ty" "ual" "ul" "ward" "'s" "'">
<apicp suffix $suffixlist>
<apicp "prefixproc" "on">
<$prefixlist="ante" "anti" "arch" "auto" "be" "bi" "counter" "de" "dis" "em" "en" "ex" "extra" "fore" "hyper" "in" "inter""mis" "non" "post" "pre" "pro" "re" "semi" "sub" "super" "ultra" "un">
<apicp prefix $prefixlist>
<apicp "suffixproc" "off">
<apicp "prefixproc" "off">
<apicp minwordlen 3> <!--Number of suffixes removed-->
<apicp qminwordlen 1> <!--Allow 1 character searches-->
<apicp qminprelen 0> <!--Allow wildcards at beginning of word-->
<$wordc = '[\alnum\X27]'>
<$langc = '[\alnum\X27 \- \.]'>