Want Texis to ignore parens and hypens as if they were white space.

Post Reply
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Want Texis to ignore parens and hypens as if they were white space.

Post by barry.marcus »

We have a very urgent issue that has us almost dead in the water, heightened by the fact that we're at a client site and workingon a tight schedule...

In much of the text we are searching we have many constructions such as the following:

poly(hydroxy alkanoates)

We would like to be able to concatenate together the individual chemical terms in these types of constructions (in this example they are "poly", "hydroxy" and "alkanoates") as (again, in this example) poly-hydroxy-alkanoates for inclusion in an equivalence file, and have Texis hit the phrase, regardless of the inclusion of parens, hyphens, etc. in the text. That is, the text may look like any of the following:

poly hydroxy (alkanoates)
poly hydroxy(alkanoates)
poly(hydroxy-alkanoates)
poly-hydroxy(alkanoates)
etc.

In other words, we need to have Texis regard hypens and parens in the text as *white space* -- WHERE THERE MAY NOT BE ACTUAL WHITE SPACE -- so that, for example, the string poly-hydroxy-alkanoates *in an equivalence file* results in a hit of, say, poly(hydroxy alkanoates). Our recollection is that we used to be able to this in the past, and we're scratching our heads wondering if a default search setting has changed with the current version.

Here is our version information:

Texis Web Script (Vortex) Copyright (c) 1996-2010 Thunderstone - EPI, Inc.
Commercial Version 6.00.1289279282 20101109 (i686-intel-winnt-64-32)

Thanks

The issue is compounded by the fact that there are MANY chemical terms involved in our project, and without the ability to concatenate them together with hypens, the number of combinations of individual chemical terms becomes prohibitive.

Again, this is a very urgent issue for us. Your insight and help is appreciated.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Want Texis to ignore parens and hypens as if they were white space.

Post by mark »

This seems to do what you describe if I understood correctly.


<$text="poly hydroxy (alkanoates) 1of5"
"poly hydroxy(alkanoates) 2of5"
"poly(hydroxy-alkanoates) 3of5"
"poly-hydroxy(alkanoates) 4of5"
"poly(hydroxy alkanoates) 5of5"
"hydroxy alkanoates poly WRONG"
"poly hydroxy foo alkanoates WRONG"
>
<$q="poly-hydroxy-alkanoates"
"poly hydroxy alkanoates"
>
<apicp eqprefix eq>
<apicp keepeqvs 1>
<apicp alequivs 1>
<sql novars "drop table test"></sql>
<sql novars "create table test(Text varchar(20))"></sql>
<loop $text>
<sql novars "insert into test values($text)"></sql>
</loop>
<sql "create metamorph inverted index xtext on test(Text)"></sql>
<loop $q>
Query: $q
<sql row "select * from test where Text like $q">
$loop: $Text
</sql>
$loop rows matched for $q
</loop>


where eq.lst contains

poly hydroxy alkanoates
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Want Texis to ignore parens and hypens as if they were white space.

Post by John »

Basically as long as your metamorph inverted index is up to date, and has appropriate index expressions it should work. If you index expressions match stuff between words that might be a problem as the words won't appear to be next to each other.
John Turnbull
Thunderstone Software
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

Want Texis to ignore parens and hypens as if they were white space.

Post by barry.marcus »

The code you posted does work. Thank you. However, it turns out that the issue is actually not one of hitting the data in the table, but rather one of *highlighting*, so let me refine my explanation just a bit.

First, our code creates equivalence files with a slightly different format than you indicated in your initial reply. Rather than "poly-hydroxy-alkanoates" (which, by the way, works as well as "poly hydroxy alkanoates" in your example), our eq.lst looks like this:

myphatest;n,poly-hydroxy-alkanoates

This allows our criteria (in this example) to simply be "where Text LIKE 'myphatest @0'"

Making that slight change, and changing the definition of $q in your example to <$q="myphatest"> DOES work. But, as I said, the issue is highlighting, not hitting. To generate the highlighting, our code uses the return value of the mminfo function (i.e., as in "select mminfo...") as the basis for the markup information, and on closer examination THAT is what does not seem to be working in this example.

So... In addition to the change of eq.lst as described, and to the query definition <$q="myphatest"> in your example above, replace the query loop with this:

<loop $q>
Query: $q
<strfmt "select mminfo('%s',Text,0,0,3) mminf from test" $q>
<sql row $ret>
$loop: $mminf
</sql>
$loop rows matched for $q
</loop>

When I run this the mminfo function returns nothing for each row in the table test. This is the problem we are having.
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

Want Texis to ignore parens and hypens as if they were white space.

Post by Kai »

It comes down to fundamental differences in the way linear and index search work; they cannot always be made to agree, especially as one is character-based and the other word-based. Linear search considers only whitespace to separate words in a phrase; index considers any non-index-expression-match to separate words.

You could sandr out the parens for the mminfo:

<sql "select mminfo($q, sandr('[()]', ' ', Text, 0, 0, 3) mminf from test"></sql>

Since each paren char is replaced with one space char, the mminfo offsets should be the same.

(I'd avoid the <strfmt "...'%s'..." $q><sql $ret>: that is open to SQL injection.)
Post Reply