Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post Reply
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by barry.marcus »

I think this is a continuation of my previous issue, but since I'm not sure, I'm starting this as a new thread. In any case, this is still a very high priority issue for us, as we are in the middle of a project, and its progress is hampered until we can find a work-around or resolution.

We have a number of equivalences we use in our queries. In some of those queries we combine suffix processing suppression (by using @suffixproc=0 inline in the query) in conjunction with requiring at least one of the equivalent terms (by prepending + to the root term). This combination seems to be causing an issue.

Below is a script that shows, in a nutshell, what we are seeing. It indexes exactly as we index our production data and it sets all of the query parameters that we use in our code. In it are a number of queries. However, it is only the LAST query that is representative of the kind of construct we'd like to use in this case, and it is the one that is not returning a hit. We don't know why that is. Perhaps we are missing something, or are misunderstanding how Texis handles this construct. In any case, we are expecting a hit for the last query, but are not getting one.

For this to work, you will need to build an equivalence file (called TestEquivs.lst in the script) from the following:

phatype;n,poly-hydroxy-alkanoate
xpolymertype;n,copolymer
xpolymertype;n,copolymers
xpolymertype;n,plastic
xpolymertype;n,plastics
xpolymertype;n,polymer
xpolymertype;n,polymers
xpolymertype;n,thermoplastic
xpolymertype;n,thermoplastics

Here is the script:

<script language=vortex>
<a name=main>
<db=<your db path here>>

<!-- Create table -->
<sql novars "drop table test"></sql>
<sql novars "create table test(Text varchar(20))"></sql>

<$text="polymer is selected from the group consisting of poly(hydroxy alkanoates)">
<loop $text>
<sql novars "insert into test (Text) values ($text)"></sql>
</loop>

<!-- Index table -->
<sql "set delexp=0"></sql>
<sql "set addexp='\alnum{1,99}'"></sql>
<sql "set keepnoise=1"></sql>
<sql "create metamorph inverted index xtext on test(Text)"></sql>

<!-- Query parameters -->
<apicp eqprefix d:\Crosshairs\appData1\equivfiles\TestEquivs>
<apicp keepeqvs 1>
<apicp alequivs 1>
<apicp alpostproc 1>
<apicp allinear 1>
<apicp alintersects 1>
<apicp intersects -1>
<apicp alwithin 1>
<apicp minwordlen 5>
<apicp qmaxsetwords 10000>
<apicp qmaxwords 10000>
<sql "set indexwithin=7"></sql>
<sql "set hyphenphrase=1"></sql>
<apicp keepnoise 1>
<apicp withinmode "word span">

<!-- Queries -->
xPOLYMERTYPE only, suffix processing suppressed:
<sql "select Text from test where Text like '@suffixproc=0 xPOLYMERTYPE @suffixproc=1 @0'">
$Text
</sql>
hits: $loop

xPOLYMERTYPE (required) only, suffix processing suppressed:
<sql "select Text from test where Text like '@suffixproc=0 +xPOLYMERTYPE @suffixproc=1 @0'">
$Text
</sql>
hits: $loop

PHATYPE only:
<sql "select Text from test where Text like 'PHATYPE @0'">
$Text
</sql>
hits: $loop

Both, suffix processing suppressed, but xPOLYMERTYPE not required:
<sql "select Text from test where Text like '@suffixproc=0 xPOLYMERTYPE @suffixproc=1 @0 PHATYPE w/20'">
$Text
</sql>
hits: $loop

xPOLYMERTYPE (required) only, suffix processing not suppressed:
<sql "select Text from test where Text like '+xPOLYMERTYPE @0 PHATYPE w/20'">
$Text
</sql>
hits: $loop

xPOLYMERTYPE only, suffix processing not suppressed:
<sql "select Text from test where Text like 'xPOLYMERTYPE @0 PHATYPE w/20'">
$Text
</sql>
hits: $loop

Both (xPOLYMERTYPE required):
<sql "select Text from test where Text like '@suffixproc=0 +xPOLYMERTYPE @suffixproc=1 @0 PHATYPE w/20'">
$Text
</sql>
hits: $loop <-- Why is this zero when "xPOLYMERTYPE (required) only" returns a hit?
</a>
</script>


Thanks so much for your continued help.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by mark »

Without trying I'd say because "polymer" is more than 20 characters away from "poly(hydroxy alkanoates)".

Intersects only applies to non + or - terms. Since you have one of those it's effectively required as well. Your effective query is: xPOLYMERTYPE @suffixproc=1 PHATYPE w/20

What are you trying to accomplish with that query?
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by barry.marcus »

I guess I should have mentioned that this is a very simplified version of our actual query. Along with the non-required term included here are about thirty others, many of which are other "root terms" in our equivalence term file (e.g., in addition to "PHATYPE there is a PHBHTYPE, PHOTYPE, etc.) In the example posted here there is only a single equivalence for PHATYPE. In our production query each of these root terms could have 30 to 40 equivalences each. Moreover, the root term that, in this example, preceeds @0 is also one of a number of other required terms in the larger production query (i.e., they all are prepended with +), several of which are root terms for equivalence lists. Of these, a few have to have suffix processing suspended, since we may be interested in "polymer" or "polymers" but not "polymeric" or "polymic", etc. For these special equivalence root words (which, by our convention, are prexeeded by a lower case "x", as in "xPOLYMERTYPE", which stands for e(x)act).

So... What we are trying to do here is to hit all the data (these are patent claims, abstracts, detailed descriptions, etc.) which have all of our required terms, (meaning at least one of those for which a root word is provided), some of which we may want to be exact (thus the @suffixproc=0 before and the @suffixproc=1 after) within a certain proximity of at least one of the non-required terms following the @0.

Here is a more realistic example:

<field> LIKE '+claim* @suffixproc=0 +xPOLYMERTYPE @suffixproc=1 +select* @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/20'

(this is actually a part of a much larger query that looks, in part, like this...)

<field> LIKE '+claim* @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/10' OR <field> LIKE '+claim* +resin @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/12' OR <field> LIKE '+claim* @suffixproc=0 +xPOLYMERTYPE @suffixproc=1 +group @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/20' OR <field> LIKE '+claim* @suffixproc=0 +xPOLYMERTYPE @suffixproc=1 +select* @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/20' OR <field> LIKE '+claim* +RENEWABLEPOLYMERTYPE +group @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/20' OR <field> LIKE '+claim* +RENEWABLEPOLYMERTYPE +select* @0 PHBHTYPE PHATYPE PHOTYPE PHVTYPE PHBVTYPE w/20'

etc., but that's beside the point...

So we'd like to find all the rows that have "claim<something>", "select<something>" and at least one of the equivalent terms of xPOLYMERTYPE (as they EXACTLY appear in the equiv file), within 20 *words* of any of the equivalent terms of the root words following the @0. This is NOT working, as far as we can tell, although it should.

Again, I simplified the example because it seems to correctly illustrate we are NOT getting the hits we expect. Put another way, adding all of the additional complexity does not make it work, so I distilled it down to a bare example to show what I think is the core of the issue.

Jeez... I hope I didn't confuse things. :-)

Thanks.
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by barry.marcus »

Oh, and a more realistic example of the equivalences used in THIS EXAMPLE are the following:

phatype;n,mirel
phatype;n,nodax
phatype;n,pha
phatype;n,poly-beta-hydroxy-alkanoate
phatype;n,poly-hydroxy-alkanoate
phatype;n,poly-hydroxyalkanoate
phatype;n,polyhydroxy-alkanoate
phatype;n,polyhydroxyalkanoate
xpolymertype;n,copolymer
xpolymertype;n,copolymers
xpolymertype;n,macro-molecule
xpolymertype;n,macro-molecules
xpolymertype;n,macromolecule
xpolymertype;n,macromolecules
xpolymertype;n,plastic
xpolymertype;n,plastics
xpolymertype;n,polymer
xpolymertype;n,polymeric
xpolymertype;n,polymers
xpolymertype;n,thermoplastic
xpolymertype;n,thermoplastics

But like I said, it doesn't seem to make any difference... The simpler example does not seem to work correctly.
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by barry.marcus »

Hi, Just wondering if there were any ideas about this...

FYI, I tried some rearranging of terms, as well as increasing the proximity (i.e., w/25, w/30, etc.) all to no avail. Frustrating...

Even though "polymer" alone is found, and "poly(hydroxy alkanoates" alone is found, and they are well within 20 words of each other, nothing I do seems to be able to get Texis to find the PAIR in the index when the first term is both required and bracketed by @suffixproc "on/off switches".
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by mark »

The simplest query that doesn't match is
poly-hydroxy-alkanoates @suffixproc=0

We found the problem. Using @setting=val in the query incorrectly causes it to use linear rules which then prevents "poly-hydroxy-alkanoates" from matching "poly(hydroxy alkanoates)" as in your hit markup thread.

The only workaround we've come up with so far is to change the parens in the search text to space or hyphen so the linear phrase matcher will accept it.
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by barry.marcus »

I was just wondering if any progress has been made on this issue, i.e., getting Texis to not revert to linear rules when "@setting=val" constructs are found in a query.

Thanks.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Need explanation (might involve indexing, required fields in searches, suffix processing, ...)

Post by mark »

It's been fixed. Open a ticket to request an update for your platform.
Post Reply