how to change context view (preview) format

rjshelq · Post by **rjshelq** » Tue Feb 14, 2012 4:20 pm

I'm using Webinator 6.01, and would like to change the format of the context view page (the page which appears when the uses clicks on "Preview document matches"). Currently, the context (or preview) seems to display all of the query hits and (apparently depending upon page size) displays all or most of the page text.

For my purposes, I'd like to see less text surrounding each hit.

So, how could I change the Context View portion of the search script in the following ways:

1) I'd like to have only about 60 characters of context before and after each hit (with ellipses to indicate omitted text, and a line break to put each hit, or cluster of nearly adjacent hits, on it's own line)

2) I'd like to have some text near the top of the page that says how many hits are found on the page.

I am currently a bit bewildered by the power and scope of Vortex, so I'd appreciate some example code that will help me do the reformatting.

Thanks!

Post by **John** » Tue Feb 14, 2012 5:46 pm

The context view is designed to show the entire document, and highlight the hits within it.

The abstract function may help you in terms of the snippets. If you make the Abstract Length much longer in the Search Settings to see how it would look.

If you can see how the abstract is done and then drop that in to the context view that might work. If you want us to help in more detail with the code you can contact support and we can see if that would be covered under your maintenance contract, or an additional service.

rjshelq · Post by **rjshelq** » Tue Feb 14, 2012 11:51 pm

Thank you for the quick reply.. please tell me a little more about how to tweak the abstract, or otherwise get 120 character snippets of every query hit (or the first n hits) in a document.

I notice that if I reduce the abstract length down to 160 char, that display is what I'd like each snippet to look like. But I need a snippet like that for every occurrence of the query term in the entire document.

So, is there a way to somehow iterate the abstract command over the document repeatedly to cause it to put out additional snippets which include every individual query hit in the entire document?

Or... if querymultiple could be forced to show every query hit in the document, with a specified minimum number of characters around each hit, that would seem to be exactly what I'm looking for.

Or... is there a way to format a query for mminfo such that it would return 160 character snippets of the first nhits (maybe 50 hits) in the document (instead of using the abstract function)?

Post by **Kai** » Wed Feb 15, 2012 10:55 am

You can't directly force <abstract> to show every hit and center on each. But you could get close to that by setting $maxsz to the number of actual hits times 160 (your desired per-hit length): <abstract> partially determines the length of each snippet by dividing the total size ($maxsz) by the number of snippets. Assuming the query is $query and the display text is $text, something like this:

<strfmt "%mbs" $query $text>
<rex row "" $ret></rex> 
<$maxsz = ($loop*160)> 
<abstract $text $maxsz querymultiple $query> 
<fmt "%mIH" $query $text> 

Note that <abstract> will still try to align each snippet on a nearby sentence boundary for readability, so there may not be 80 chars left and right of each hit. Also, nearby hits may be merged into one snippet.

Post by **John** » Wed Feb 15, 2012 10:57 am

mminfo is probably the way to go then. You probably want to add something like " @0 w/10" to the end of the users query, and use that as the mminfo query, so:

<strfmt "%s" $txtquery " @0 w/10">
<$disp=(mminfo($ret, $Body, 50))>

which should give you 10 words either side of each matching word.

rjshelq · Post by **rjshelq** » Sun Feb 19, 2012 11:38 pm

Thank you for the suggestion of using:

<$maxsz = ($loop*160)> 
<abstract $text $maxsz querymultiple $query> 

But, unfortunately, the resulting abstract only includes a small fraction of the query terms which occur in the document. For example, in an 8000 word document which actually contains 19 query hits (each about 5 characters), I have a 3000 character abstract which contains only 2 query hits. I even increased the abstract size to 6000 characters, but I still only had 2 query hits shown in the entire abstract.

And interestingly, I get exactly the same text whether I use "smart" or "querymultiple" (or "querybest"). I had expected to get some different formatting with querymultiple than with smart... which makes me wonder if querymultiple is working properly.

I'm using the x86_64 version of Webinator, build 6.01.1325780201.... is there any problem with querymultiple in that build?

Post by **Kai** » Mon Feb 20, 2012 11:08 am

True; it turns out querymultiple only generates more than one hit for a given set (term) if there is only one set in the query, and then only generates a number of hits proportional to the square root of the abstract size; I was mistaken. <abstract> is focused primarily on the "best" hit, not all hits.

You may have to construct the loci manually, and <abstract> each one. Something like this ($text is the text, $query is the query):

<$locusSz = 160> 
<$modQuery = ($query + ' w/. @0' )> 
<apicp alintersects 1>
<apicp alwithin 1>
<$halfLocusSz = ($locusSz/2)>
<$mminfo = (mminfo($modQuery, $text, 0, 0, 8+4+2+1))>
<rex ">>Data from Texis>=\space\P*\digit+" $mminfo>
<$hitOffsets = $ret> 
<capture>
<loop $hitOffsets>
<$off = (convert($hitOffsets, 'int' ) - $halfLocusSz)>
<substr $text $off $locusSz> 
<abstract $ret $locusSz querymultiple $modQuery> 
<fmt "%s ... " $ret>
</loop>
</capture>
<$abstract = $ret>
<fmt "%mIH\n" $modQuery $abstract>

rjshelq · Post by **rjshelq** » Tue Feb 21, 2012 10:54 pm

Thank you Kai... that is brilliant! I really appreciate our help.

Your script suggestion does indeed find all of the hits and it produces the corresponding snippets. So far, so good... but, now I have a few questions:

1) The <substr> call gets the rough 160 character "snippet" of text just fine, but unfortunately that snippet is not aligned on a word boundary. The snippet of text generally starts and stops in the middle of a word, and the subsequent call to <abstract> does not clean up the broken words.

I did find that reducing the abstract $maxsz to be 10 or 20 characters less than the rough snippet size ($locusSz) of 160 characters generally does result in a break on word boundaries, but that approach would seem to require that the difference in "window" size should be about twice as large as the largest word in the text in order to allow room for abstract to clean up the text. Perhaps that's workable. Or perhaps it's problematic.

Do you have any built-in expressions which specifically search for word boundaries?

Do you see any better technique that could be used to get the snippet of text to begin and end on word boundaries?

2) At present, inside the capture loop there is a <fmt> statement which adds ellipsis to separate the snippets of text. How could that be modified to produce two html line breaks in the output rather than ellipsis?

3) At present each hit is processed independently, and a snippet of text is produced for each hit, regardless of the presence of nearby hits. For example, if there is a hit at offset 100, another hit at offset 120 and another at 150, those three hits will presently generate three separate snippets, even though they contain essentially the same content.

What I'd like to do in each iteration of the loop is look at the offset of the next several hits and simply extend the size of the snippet (in increments of 80 characters at a time) to include any hit that is within 80 characters of any previous hit (even if that means extending the snippet size multiple times).

Can you suggest a way to do that??

Post by **mark** » Wed Feb 22, 2012 9:57 am

1) I'd expand the substr size enough for abstract to clean it up.

2) You could change the final <fmt> to <strfmt> then after that add
<sandr " \.\.\. " " " $ret>
<send $ret>

3) Requires a bunch of fiddley programming for which we'd have to charge consulting $.

Post by **Kai** » Wed Feb 22, 2012 11:16 am

For 3, there's already quite a bit of merge-adjacent-loci code in <abstract>; all it probably needs in this case is an (optional) arg for locus size, and a flag/mode to (try to) give all hits. Then you could give a locus size of 160, a maxsz of say 6000 (just to avoid really large results), and the get-all-hits flag.

We're considering adding these options sometime in the future (though it's not a priority right now).