Strip Small Words

ziplux
Posts: 3
Joined: Tue Jul 03, 2001 3:53 pm

Strip Small Words

Post by ziplux »

I need to remove all words smaller than 3 letters from a Vortex variable. What regular expression could I use to do this? Should I use sandr?
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Strip Small Words

Post by wcpriest »

What Ziplux is asking is:

How can you use the "noise list" against a fetched URL using Vortex?
To produce output text minus those words? I.e. how do you apply
a "negative dictionary" against a text string in Vortex?

All the documentation talks about using the "noise list" from within
an SQL or Webinator search.

Thanks.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Strip Small Words

Post by Kai »

Something like this would work, given $x:

<sandr ">>\space\P=\alpha{1,2}\space" "" $x>
<sandr ">>=\alpha{1,2}\space" "" $ret>
<sandr "\space=\alpha{1,2}>>=" "" $ret>
<$x = $ret>

The 2nd and 3rd <sandr>s catch small words at the start or end of the variable.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Strip Small Words

Post by Kai »

The noise list is ignored in searches. What exactly is the reason you want to strip it from actual source text?
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Strip Small Words

Post by wcpriest »

Kai,

Thanks so much for the syntax. We are building a learning tool for
the web that matches the learner to teachers/mentors.

The reason we wish to strip noise words out of the text is that we
are going to encylopedia sites and extracting "topic words" related
to a general subject.

So, we go to Encarta, we scoop words out of the encyclopedia pertaining,
say, to Planets. We don't want the "noise words"

We also scoop words from related pages.

We now associate a teacher/mentor with local web pages with
scooped words that pertain to that person's teaching skills.

So, if the mentor says "Planets" -- we now have a web page with
about 5000 words having to do with planets.

Now, if we have used textomm to extract the keywords from some
other web page, we can now ask texis to bring back "related pages" --
by using LIKEP, passing the keywords from textomm, and then
asking for the rank ordered local pages.

We then take the highest ranked page, follow the unique page name
of that web page to its "mentor owner" and we now have made
a match between the random page on the net and this mentor.

So now we can pass this information to a client that talks to both
the mentor and the learner, and match them up using an intermediate
AOL Instant Messenger server we have built.

Could we leave in the noise words? Probably, but this way we know
that the remaining words are all "topical" Storage will be reduced, search
times will be quicker.
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Strip Small Words

Post by wcpriest »

Kai,

One further question. You have provided the solution to question #1,
not to my clarification in #2 about avoiding "noise words"

Can you think of an efficient way to filter out the noise words?

I did find the list of them at:

http://thunderstone.master.com/texis/ma ... de199.html

but could not find a function that let's me do the filtering of those words
as you do against a query string for webinator.

Is there a practical way to take a long string and run it against such
a "negative dictionary" (a phrase we used at MIT in the '70's) ?
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Strip Small Words

Post by bart »

See: <sandr>
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Strip Small Words

Post by John »

You might also consider using texttomm on the original page. It does not return noise words, and you can have it return however many words you want.
John Turnbull
Thunderstone Software
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Strip Small Words

Post by wcpriest »

John and Bart, thanks!

John (or Bart),

OK, so texttomm does use the "noise list" (and that is expected)

Beyond that, does it return the first 10 words not in the "noise list" (by
default) or, does it have some algorithms to try to choose the "best"
words within a string of words that capture the sense of the material
before we pass these as search query words in LIKEP ?

Thanks.

Dr. Priest
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Strip Small Words

Post by John »

It has an algorithm to capture the words that will work best for a LIKEP query. You could specify the number of words to be 5000 if you wanted that many words for another purpose. 10 is a suitably large number for LIKEP. The other thing texttomm does, which may or may not be desired is that it removes duplicate occurrences of words.
John Turnbull
Thunderstone Software
Post Reply