Strip Small Words

ziplux · Post by **ziplux** » Tue Jul 03, 2001 3:54 pm

I need to remove all words smaller than 3 letters from a Vortex variable. What regular expression could I use to do this? Should I use sandr?

wcpriest · Post by **wcpriest** » Tue Jul 03, 2001 4:42 pm

What Ziplux is asking is:

How can you use the "noise list" against a fetched URL using Vortex?
To produce output text minus those words? I.e. how do you apply
a "negative dictionary" against a text string in Vortex?

All the documentation talks about using the "noise list" from within
an SQL or Webinator search.

Thanks.

Post by **Kai** » Tue Jul 03, 2001 4:47 pm

Something like this would work, given $x:

<sandr ">>\space\P=\alpha{1,2}\space" "" $x>
<sandr ">>=\alpha{1,2}\space" "" $ret>
<sandr "\space=\alpha{1,2}>>=" "" $ret>
<$x = $ret>

The 2nd and 3rd <sandr>s catch small words at the start or end of the variable.

Post by **Kai** » Tue Jul 03, 2001 4:51 pm

The noise list is ignored in searches. What exactly is the reason you want to strip it from actual source text?

wcpriest · Post by **wcpriest** » Tue Jul 03, 2001 5:50 pm

Kai,

Thanks so much for the syntax. We are building a learning tool for
the web that matches the learner to teachers/mentors.

The reason we wish to strip noise words out of the text is that we
are going to encylopedia sites and extracting "topic words" related
to a general subject.

So, we go to Encarta, we scoop words out of the encyclopedia pertaining,
say, to Planets. We don't want the "noise words"

We also scoop words from related pages.

We now associate a teacher/mentor with local web pages with
scooped words that pertain to that person's teaching skills.

So, if the mentor says "Planets" -- we now have a web page with
about 5000 words having to do with planets.

Now, if we have used textomm to extract the keywords from some
other web page, we can now ask texis to bring back "related pages" --
by using LIKEP, passing the keywords from textomm, and then
asking for the rank ordered local pages.

We then take the highest ranked page, follow the unique page name
of that web page to its "mentor owner" and we now have made
a match between the random page on the net and this mentor.

So now we can pass this information to a client that talks to both
the mentor and the learner, and match them up using an intermediate
AOL Instant Messenger server we have built.

Could we leave in the noise words? Probably, but this way we know
that the remaining words are all "topical" Storage will be reduced, search
times will be quicker.

wcpriest · Post by **wcpriest** » Tue Jul 03, 2001 5:58 pm

Kai,

One further question. You have provided the solution to question #1,
not to my clarification in #2 about avoiding "noise words"

Can you think of an efficient way to filter out the noise words?

I did find the list of them at:

http://thunderstone.master.com/texis/ma ... de199.html

but could not find a function that let's me do the filtering of those words
as you do against a query string for webinator.

Is there a practical way to take a long string and run it against such
a "negative dictionary" (a phrase we used at MIT in the '70's) ?

bart · Post by **bart** » Wed Jul 04, 2001 6:48 am

See: <sandr>

Post by **John** » Wed Jul 04, 2001 10:07 am

You might also consider using texttomm on the original page. It does not return noise words, and you can have it return however many words you want.

wcpriest · Post by **wcpriest** » Wed Jul 04, 2001 2:03 pm

John and Bart, thanks!

John (or Bart),

OK, so texttomm does use the "noise list" (and that is expected)

Beyond that, does it return the first 10 words not in the "noise list" (by
default) or, does it have some algorithms to try to choose the "best"
words within a string of words that capture the sense of the material
before we pass these as search query words in LIKEP ?

Thanks.

Dr. Priest

Post by **John** » Wed Jul 04, 2001 7:30 pm

It has an algorithm to capture the words that will work best for a LIKEP query. You could specify the number of words to be 5000 if you wanted that many words for another purpose. 10 is a suitably large number for LIKEP. The other thing texttomm does, which may or may not be desired is that it removes duplicate occurrences of words.