Word stemming challenges

Post Reply
Posts: 9
Joined: Mon May 13, 2024 10:52 pm

Word stemming challenges

Post by rjshelq77 »

In a recent search which included the word sobriety, a user used ~sobriety to broaden their search and the results were quite off-target, with the majority of the highlighted results in the preview being semantically unrelated terms such as continually, continual, continuous, and continent.

Since the search did specify "any word forms", I suspect that the builtin thesaurus included continence as a synonym for sobriety, and some sort of stemmer happily took off from there without any semantic guardrails.

Sometimes users want to widen their search a bit by using the tilde, but results like that are quite discouraging, and make my users run off to use Google.

Is there a solution for such problems?
User avatar
Site Admin
Posts: 2607
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Re: Word stemming challenges

Post by John »

The built in thesaurus and "any word forms" together will give a very broad range of matches, where words derived from the same root may be found, even if the meaning has since diverged a bit. What the right answer is depends on the goals of the searcher. The builtin thesaurus and all word forms may be most appropriate for a researcher making sure they don't miss anything, and seeing related concepts that may not immediately be obvious is helpuful.

It may be for example having a more restrictive thesaurus, where for example abstinence and temperance are the only synonyms for sobriety, or having a step allowing the thesaurus and just plurals and possessives, or a more restrictive custom stemming list first.
John Turnbull
Thunderstone Software
Post Reply