How can you use the "noise list" against a fetched URL using Vortex?
To produce output text minus those words? I.e. how do you apply
a "negative dictionary" against a text string in Vortex?
All the documentation talks about using the "noise list" from within
an SQL or Webinator search.
Thanks so much for the syntax. We are building a learning tool for
the web that matches the learner to teachers/mentors.
The reason we wish to strip noise words out of the text is that we
are going to encylopedia sites and extracting "topic words" related
to a general subject.
So, we go to Encarta, we scoop words out of the encyclopedia pertaining,
say, to Planets. We don't want the "noise words"
We also scoop words from related pages.
We now associate a teacher/mentor with local web pages with
scooped words that pertain to that person's teaching skills.
So, if the mentor says "Planets" -- we now have a web page with
about 5000 words having to do with planets.
Now, if we have used textomm to extract the keywords from some
other web page, we can now ask texis to bring back "related pages" --
by using LIKEP, passing the keywords from textomm, and then
asking for the rank ordered local pages.
We then take the highest ranked page, follow the unique page name
of that web page to its "mentor owner" and we now have made
a match between the random page on the net and this mentor.
So now we can pass this information to a client that talks to both
the mentor and the learner, and match them up using an intermediate
AOL Instant Messenger server we have built.
Could we leave in the noise words? Probably, but this way we know
that the remaining words are all "topical" Storage will be reduced, search
times will be quicker.
OK, so texttomm does use the "noise list" (and that is expected)
Beyond that, does it return the first 10 words not in the "noise list" (by
default) or, does it have some algorithms to try to choose the "best"
words within a string of words that capture the sense of the material
before we pass these as search query words in LIKEP ?
It has an algorithm to capture the words that will work best for a LIKEP query. You could specify the number of words to be 5000 if you wanted that many words for another purpose. 10 is a suitably large number for LIKEP. The other thing texttomm does, which may or may not be desired is that it removes duplicate occurrences of words.