Suffix/Prefix and Minwordlen

Zeus · Post by **Zeus** » Wed Nov 10, 2004 6:15 pm

Hi,
We have a field called DOCTEXT, which as this data,
"this kid is problematic indeed".

Our minwordlen is 3.
we have suffixproc and prefixproc 'ON' with the default list. We have indexed the DOCTEXT field too.

When we search for DOCTEXT like 'problem'

we do not get a hit.
is it because, our minwordlen is 3 and pro being in the prefixlist strips the search word to just 'blem'?

Does the engine search for exactly the final stripped down version which is 'blem' and not find it in the index?

Any help is greatly appreciated!!
thanks!!

Zeus · Post by **Zeus** » Thu Nov 11, 2004 2:12 am

Thanks!!
Just to be sure, the search for 'problem' after all the suffix and prefix stripping finally searches for 'blem' in the data. Does texis search for *blem* or just blem?

if so, how does the search for 'receive' work? we have ve in the suffix list. So, it will be stripped to 'recei' and does Texis search for recei*? Because we do get receive and receiver as hits.

Sorry, I was a little confused.

Thanks for all the help!!

Post by **mark** » Thu Nov 11, 2004 10:32 am

It searches for words ending in blem in that case.
receive will strip down to rece (see apicp defsufrm). It will search for words beginning with rece.

See also wordc and langc to see what's considered a word.

Zeus · Post by **Zeus** » Thu Nov 11, 2004 10:42 am

In the case of the first part, searching for words ending in blem, does texis do linear or indexed search?
It looks like prefix processing will always create problems. is that right?

also, in the case of the second search, looking for words beginning with rece, does texis do linear or indexed search?
Suffix alone, I am more comfortable with.

Thanks for the help!!

Zeus · Post by **Zeus** » Thu Nov 11, 2004 11:07 am

Oh, I forgot to mention,in the first scenario, I said prefix stripping may not work because,
we have data which has just the word 'problem' along with the word 'problematic'.

thanks

Post by **mark** » Thu Nov 11, 2004 11:54 am

Only prefix searching, case 2(suffix stripping) can use an index. Suffix searching, case 1(prefix stripping) can not use an index and will be linear. Middle searching (where both prefixes and suffixes have been removed) will also be linear. The index/linear behavior is basically the same as with wildcards: *word and *word* are linear, word* uses an index.

Prefixes have a tendency to change the meaning of a word anyhow.

Zeus · Post by **Zeus** » Thu Nov 11, 2004 2:59 pm

Sorry, I have one more question on this.
As I said earlier, the data also has the problem just by itself.
Why did not the linear search for *blem find the record as a hit?
our wordc and langc are
wordc=\alnum\X24
langc=\alnum\X24 \-

thanks!!

Post by **mark** » Thu Nov 11, 2004 4:03 pm

Did you enable linear searching? It's off by default. View the source of the results page and see what errors or warnings you're getting in html comments. What are your precise settings and query? What's the excerpt of text around the word "problem"?

Zeus · Post by **Zeus** » Thu Nov 11, 2004 4:11 pm

I dont get any error or warnings. allinear is on.
The text containing the word problem is,

"What do we do when there is a problem with fit?".

our settings are (sorry for the longlist),

apicp qmaxsetwords 0> 

<apicp qmaxwords 0> 

<apicp alintersects 1> 

<$noise = "and" "or" "not"> 

<apicp noise $noise>

<apicp "allinear" "on">

<apicp "alpostproc" "on"> 

<apicp "alnot" "on">

<apicp "alwithin" "on">

<apicp "exactphrase" "on"> 

<apicp "alequivs" "on">

<apicp defsuffrm 0>



<apicp "suffixproc" "on">

<$suffixlist="able" "age" "aged" "ager" "ages" "al" "ally" "ance" "anced" "ancer" "ances" "ant" "ary" "at" "ate" "ated" "ater" "atery" "ates" "atic" "ed" "en" "ence" "enced" "encer" "ences" "end" "ent" "er" "ery" "es" "ess" "est" "ful" "ial" "ible" "ibler" "ic" "ical" "ice" "iced" "icer" "ices" "ics" "ide" "ided" "ider" "ides" "ier" "ily" "ing" "ion" "ious" "ise" "ised" "ises" "ish" "ism" "ist" "ity" "ive" "ived" "ives" "ize" "ized" "izer" "izes" "less" "ly" "ment" "ncy" "ness" "nt" "ory" "ous" "re" "red" "res" "ry" "s" "ship" "sion" "th" "tic" "tion" "ty" "ual" "ul" "ward" "'s" "'">

<apicp suffix $suffixlist>

<apicp "prefixproc" "on">

<$prefixlist="ante" "anti" "arch" "auto" "be" "bi" "counter" "de" "dis" "em" "en" "ex" "extra" "fore" "hyper" "in" "inter""mis" "non" "post" "pre" "pro" "re" "semi" "sub" "super" "ultra" "un">

<apicp prefix $prefixlist>

<apicp "suffixproc" "off">

<apicp "prefixproc" "off">

<apicp minwordlen 3> 

<apicp qminwordlen 1> 

<apicp qminprelen 0> 

<$wordc = '[\alnum\X27]'>

<$langc = '[\alnum\X27 \- \.]'>

Post by **mark** » Thu Nov 11, 2004 5:20 pm

What's your sql statement with query?