problem using IN clause with likep

kevin31 · Post by **kevin31** » Tue Sep 09, 2003 4:50 pm

I am finding that using an IN clause in a query adversely affects the likep statement. For example: "select $rank, ID from tableA where TextBlob likep 'foo'" might return 10 rows if likeprows is set to 10. However, if I add an additional IN clause like so "select $rank, ID from tableA where TextBlob likep 'foo' and Type IN (2,3,4)" suddenly many more than 10 rows is returned. What is happening here? My texis version is
Texis Web Script (Vortex) Copyright (c) 1996-2001 Thunderstone - EPI, Inc.
Commercial Version 3.01.992447526 of Jun 13, 2001 (i686-intel-winnt-32)

Post by **Kai** » Tue Sep 09, 2003 5:06 pm

likeprows is disabled (in effect 0) if something is ANDed with a LIKEP. Otherwise likeprows may cause some valid results to be dropped by LIKEP (due to low rank) before they can be matched by the IN clause. Eg. if on average 5% of the table matches the IN, and LIKEP let only the best likeprows (10) results through, it's likely you'd only get 5% of 10 results -- ie. 1 or 0 rows. So LIKEP ignores likeprows when there's an AND, so that IN-matching results aren't lost.

Use <SQL MAX> to limit results in this case.

kevin31 · Post by **kevin31** » Tue Sep 09, 2003 5:46 pm

My major problem here is performance. With likeprows disabled perf is unacceptable. Adding <SQL MAX=100 does not seem to fix that problem. I also tried doing a self join, i.e. "select $rank, ID from tableA a, tableB b where a.TextBlob likep 'foo' and b.Type IN (2,3,4) and a.ID=b.ID". This works and likeprows also seems to work, but performance is still very bad (> 30 secs on 200000 rows).

Let me re-state my problem in it's simplest terms. I have a table that looks like this:
tableA (ID int, Type int, TextBlob varchar)
I need to write a query that accomplishes the following:
"select $rank, ID from tableA where Text likep 'foo' AND Type IN (1,2,3)"

What is the most optimal way to do this such that likeprows and the various rank knobs still work? What form of query should I use and what indexes (simple, metamorph, compound etc.) should I create? My experience with Texis thus far seems to indicate that the only query that performs really fast is one that ONLY contains a likep statement. Is my only option to break the table up into multiple tables? Any feedback is most appreciated. Thanks.

Post by **Kai** » Tue Sep 09, 2003 6:04 pm

Is Type an integer field? Use a compound index: Make a metamorph inverted index on tableA(TextBlob,Type) (drop any other metamorph index on TextBlob first so it won't be used instead). This avoids the post-process table access for the IN (one index can resolve both parts).

If the results of running just "where Type in (2,3,4)" is a very small fraction of the table (eg. a few rows) it may be faster to make a regular index on Type and reverse the order of the two parts of the AND.

kevin31 · Post by **kevin31** » Tue Sep 09, 2003 6:28 pm

Sorry. We crossed paths. I was busy editting item 3 when you responded. Could you please look at my last entry? I will try what you've suggested in the meantime.

kevin31 · Post by **kevin31** » Tue Sep 09, 2003 7:58 pm

I created an inverted metamorph index on (TextBlob, Type). I previously had an index like this on (TextBlob, Type, ID). Neither of these make any difference to the performance of the query.

In answer to your previous question, the "where Type in (...)" portion of the query can return a large chunk of the table. In 200,000 rows there are perhaps 20 Types.

Post by **Kai** » Thu Sep 11, 2003 12:18 pm

The original index you had should have been sufficient, since it contained Type as an auxiliary field.

It appears that your June 2001 version did not optimally handle usage of the compound index in this case, ie. only partially utilized the compound index. The current release uses it more efficiently and is faster.