Word frequency count

valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Word frequency count

Post by valery »

Hi,

1. Can I somehow get word frequency distribution from my commercial webinator database? (I suppose this information is being used in ranking algorithm, so it must be available somewhere, right?)

2. Could you please send a pricing structure for full Texis to valery@inetprom.com

Thanks,
Valery.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Word frequency count

Post by bart »

I'm not sure if Webinator will allow this (it may be a Texis only feature) , but
save this script into your htdocs/webinator directory under the name words.
Edit the <db=> to be correct for your machine,yhen execute it with the url
http://mymachine/cgi-bin./texis/webinator/words .

USE "%" as a wildcard character instead of "*". eg: "ab%" will show all the
words beginning in "ab"
--------------------------------------------------------
<script language=vortex>
<db=/usr/people/mosaic/htdocs/webinator/db>

<a name=main>

<form method=get>
Word:<input name=q>
</form>
<if $q neq "">
<sql "set indexaccess=1"></sql>
<sql "select Word,Count from xhtmlbod where Word matches $q">
$Count $Word<br>
</sql>
</if>

</a>
</script>
valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Word frequency count

Post by valery »

Thank you for your help!

However, it might be a full Texis feature indeed, as I cannot find xhtmlbod table under my database directory...
Still, I would expect this information to be stored in some table even in commercial-license Texis...
Here is my table layout:
__________________
YSCOLUMNS.tbl
SYSINDEX.tbl
SYSLOCKS.SEQ
SYSMETAINDEX.tbl
SYSOBJECTS.tbl
SYSPERMS.tbl
SYSSTATS.tbl
SYSTABLES.tbl
SYSTRIG.tbl
SYSUSERS.tbl
error.tbl
gw.log
html.blb
html.tbl
options.tbl
querylog.tbl
refs.tbl
todo.tbl
xhtmlid.btr
xhtmlurl.btr
xoptname.btr
xoptstr.btr
xqueryid.btr
xrefsurl.btr
xtodourl.btr
----------------

Which one should I use?

Valery.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Word frequency count

Post by mark »

xhtmlbod is an index that is created by gw -index and is the one to use.
valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Word frequency count

Post by valery »

Thanks! It's working now.

I have one more question though (search on prev postings did not return much):
I now want to get the same information only for one indexed domain (or url). I suppose there is a field in xhtmlbod table which would either contain Url or (more likely, I guess) a reference to the entry in html table. Trouble is, I can't find the structure of xhtmlbod table anywhere...

Can you help?

Thanks,
Valery.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Word frequency count

Post by bart »

The index is not structured in a manner that would allow you to easily get word frequency data on a URL by URL basis. Text Inversions probably work differently than you suspect.

xhtmlbod is not a table, its an index. That is why I issued "set indexaccess=1" in my little demo script. This index points to arrays of compressed record id and word position data. The record id then points to the actual record.

Using the Vortex functions <rex> and <xtree>, you could fairly quickly code a script that would do what you are describing.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Word frequency count

Post by bart »

Try something like this to get word freq by URL info:

<sql ROW "select Body from html where Url matches $q">
<rex "\alnum{2,}" $Body>
<xtree insert $ret wordlist>
</sql>
<xtree dump wordlist>
<$Words=$ret>
<xtree count wordlist>
<$Count=$ret>
<loop $Words $Count>
$Count $Words<br>
</loop>