Suffix/Prefix and Minwordlen

Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

Hi,
We have a field called DOCTEXT, which as this data,
"this kid is problematic indeed".

Our minwordlen is 3.
we have suffixproc and prefixproc 'ON' with the default list. We have indexed the DOCTEXT field too.

When we search for DOCTEXT like 'problem'

we do not get a hit.
is it because, our minwordlen is 3 and pro being in the prefixlist strips the search word to just 'blem'?

Does the engine search for exactly the final stripped down version which is 'blem' and not find it in the index?

Any help is greatly appreciated!!
thanks!!
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Suffix/Prefix and Minwordlen

Post by John »

Yes, that is what is happening. If you disable the prefix processing it should work. Prefixes are disabled by default as they are generally less useful, since they can cause issues, such as the "pro" prefix being applied here.
John Turnbull
Thunderstone Software
Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

Thanks!!
Just to be sure, the search for 'problem' after all the suffix and prefix stripping finally searches for 'blem' in the data. Does texis search for *blem* or just blem?

if so, how does the search for 'receive' work? we have ve in the suffix list. So, it will be stripped to 'recei' and does Texis search for recei*? Because we do get receive and receiver as hits.

Sorry, I was a little confused.

Thanks for all the help!!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Suffix/Prefix and Minwordlen

Post by mark »

It searches for words ending in blem in that case.
receive will strip down to rece (see apicp defsufrm). It will search for words beginning with rece.

See also wordc and langc to see what's considered a word.
Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

In the case of the first part, searching for words ending in blem, does texis do linear or indexed search?
It looks like prefix processing will always create problems. is that right?

also, in the case of the second search, looking for words beginning with rece, does texis do linear or indexed search?
Suffix alone, I am more comfortable with.

Thanks for the help!!
Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

Oh, I forgot to mention,in the first scenario, I said prefix stripping may not work because,
we have data which has just the word 'problem' along with the word 'problematic'.

thanks
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Suffix/Prefix and Minwordlen

Post by mark »

Only prefix searching, case 2(suffix stripping) can use an index. Suffix searching, case 1(prefix stripping) can not use an index and will be linear. Middle searching (where both prefixes and suffixes have been removed) will also be linear. The index/linear behavior is basically the same as with wildcards: *word and *word* are linear, word* uses an index.

Prefixes have a tendency to change the meaning of a word anyhow.
Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

Sorry, I have one more question on this.
As I said earlier, the data also has the problem just by itself.
Why did not the linear search for *blem find the record as a hit?
our wordc and langc are
wordc=\alnum\X24
langc=\alnum\X24 \-

thanks!!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Suffix/Prefix and Minwordlen

Post by mark »

Did you enable linear searching? It's off by default. View the source of the results page and see what errors or warnings you're getting in html comments. What are your precise settings and query? What's the excerpt of text around the word "problem"?
Zeus
Posts: 31
Joined: Thu Jul 29, 2004 5:12 pm

Suffix/Prefix and Minwordlen

Post by Zeus »

I dont get any error or warnings. allinear is on.
The text containing the word problem is,

"What do we do when there is a problem with fit?".

our settings are (sorry for the longlist),

apicp qmaxsetwords 0> <!--Allow for wildcard search set to be of size provided-->

<apicp qmaxwords 0> <!--Allow for wildcard search words to be of size provided-->

<apicp alintersects 1> <!--Allow for intersects in queries for ex. @0,@1..-->

<$noise = "and" "or" "not"> <!--Allow searching of all words except these-->

<apicp noise $noise>

<apicp "allinear" "on">

<apicp "alpostproc" "on"> <!--Enable post-processing-->

<apicp "alnot" "on">

<apicp "alwithin" "on">

<apicp "exactphrase" "on"> <!--Allow quotes to force exact phrases-->

<apicp "alequivs" "on">


<apicp defsuffrm 0>

<!--Force post-processing to keep words with these suffixes-->

<apicp "suffixproc" "on">

<$suffixlist="able" "age" "aged" "ager" "ages" "al" "ally" "ance" "anced" "ancer" "ances" "ant" "ary" "at" "ate" "ated" "ater" "atery" "ates" "atic" "ed" "en" "ence" "enced" "encer" "ences" "end" "ent" "er" "ery" "es" "ess" "est" "ful" "ial" "ible" "ibler" "ic" "ical" "ice" "iced" "icer" "ices" "ics" "ide" "ided" "ider" "ides" "ier" "ily" "ing" "ion" "ious" "ise" "ised" "ises" "ish" "ism" "ist" "ity" "ive" "ived" "ives" "ize" "ized" "izer" "izes" "less" "ly" "ment" "ncy" "ness" "nt" "ory" "ous" "re" "red" "res" "ry" "s" "ship" "sion" "th" "tic" "tion" "ty" "ual" "ul" "ward" "'s" "'">

<apicp suffix $suffixlist>

<apicp "prefixproc" "on">

<$prefixlist="ante" "anti" "arch" "auto" "be" "bi" "counter" "de" "dis" "em" "en" "ex" "extra" "fore" "hyper" "in" "inter""mis" "non" "post" "pre" "pro" "re" "semi" "sub" "super" "ultra" "un">

<apicp prefix $prefixlist>



<apicp "suffixproc" "off">

<apicp "prefixproc" "off">


<apicp minwordlen 3> <!--Number of suffixes removed-->

<apicp qminwordlen 1> <!--Allow 1 character searches-->

<apicp qminprelen 0> <!--Allow wildcards at beginning of word-->



<$wordc = '[\alnum\X27]'>

<$langc = '[\alnum\X27 \- \.]'>
Post Reply