Single Character Word Search in Text Fields

aaron.drielick
Posts: 1
Joined: Wed May 23, 2001 5:38 pm

Single Character Word Search in Text Fields

Post by aaron.drielick »

I am trying to incoporporate the ability to search on a single, specific character within a search index. The character is the letter "B", both upper and lower case, preceeded always by a space, and followed always by either a space OR a puncutation mark. For instance, I want a search hit on a phrase such as "Vitamin B is good for you," while at the same time, I want the search engine to ignore a phrase such as "it was basic beginner's luck".

In creating the search index, I've incorporated the following word matching expressions with only limited success:

-k"\alnum{2,30}" -k"[Bb]{1}"

I've also added <apipc qminwordlen 1> to the search script to allow single character word searches in the query string.

The above word matching expression, unfortunately, will return any article that simply contains the letter "B". This does make sense. Searches for any other single character other than a "B" result in a pretty ugly error, but that's okay for now.

I've attempted to get the index to only accept <space>-"B"-<space> and <space>-"B"-<punct> with no success using the following word pattern matching expressions:

-k"\alnum{2,30}" -k"[\s][Bb]{1}[\s]" -k"[\s][Bb]{1}[\punct]"

This returns an ugly error message no matter what single character search is performed, including "B".

In all cases, I get the error message indicating that the "query would require a linear search". But first things first....

Being somewhat unfamiliar with regular expressions, I could be way off in left field on the word pattern matching expressions...or maybe not. Either way, any assistance you could provide in helping me solve this problem would be greatly appreciated!

Thanks in advance,
Aaron Drielick
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Single Character Word Search in Text Fields

Post by John »

Allowing a particular single character search is unusual. Since you are setting <apicp qminwordlen 1> then you are allowing all single character queries through to the engine, the simplest solution might be to use an expression of:

-k"\alnum{1,30}"

and simply allow those queries as well. The problem with the expressions you have is that it will index the space or punctuation before and after the "b", so the plain "b" won't be found.

REX does allow you to specify preceding and following expressions that must be present, so:

-k"[^\alnum]\P=b=[^\alnum]\F="

means a "b" (case-insensitive) preceded and followed by a non-alphanumeric character.
John Turnbull
Thunderstone Software
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

Single Character Word Search in Text Fields

Post by Faiz »

I have an index expression, <sql "set addexp='\alnum=[\alnum\+\_]{1,30}'"></sql> and <apicp qminwordlen 1> in the search script. But when I search for `e-business` or `e-mail`, it says, "term e not indexable, needs post-processing".
In some cases I need to pass single character through the search engine. How can I do that?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Single Character Word Search in Text Fields

Post by mark »

Before adding your custom expression, delete the default one and replace it with one that allows single letters

<sql "set delexp=0"></sql>
<sql "set addexp='\alnum{1,30}'"></sql>
<sql "set addexp='>>\alnum=[\alnum\+\_]{1,30}'"></sql>
monty1
Posts: 16
Joined: Sun Jun 10, 2001 12:12 pm

Single Character Word Search in Text Fields

Post by monty1 »

I have databases that include references to U.S. President George W. Bush, but a query for his exact name (including the W.) returns nothing.

Search for "George Bush" alone, however, and you get results.

I suspect that people are typing in "George W. Bush" and are disappointed to get nothing.

I tried adding:

k"\alnum{1,30}"

to the options file used when fetching the pages, but that produced no difference in the search results.

After reading this thread and node40.html, I just don't see what the correct configuration should be to ensure people can search for "George W. Bush" and get the results they expect.
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Single Character Word Search in Text Fields

Post by John »

You may want to check the HTML source to see what if any messages are produced. I suspect that it is the '.' that is after the W that is causing the issue. You have a couple of options. You could either remove the '.' from the query before it is submitted. Webinator will do that automatically if it is the last thing in the query.

Another option would be to index words ending in '.', e.g.

k"\alnum{1,30}"
k"\alnum{1,30}\.="

You would need both expression so that a search for "George W Bush" would work equally well.
John Turnbull
Thunderstone Software