Webinator II page fetching without gw (howto)

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Webinator II page fetching without gw (howto)

Post by Thunderstone »



It is possible using Texis Web script to add things to
a Webinator db in a custom manner without using "gw".

The rudimentary script below allows you to add individual
pages to an existing Webinator II database from a web form.

If you're interested in this kind of thing, also see the
sections of the Texis Webscript Manual http://www.thunderstone.com/vortexman/
that describe: FETCH, SUBMIT, READ, URLTEXT, URLLINKS, and URLCP .


1: Place the script in the htdocs/webinator directory under the name addit .

2: Edit the <db> directory name on line 3 of the script to point at your db.

3: Point your browser at http://mysite/cgi-bin/texis/webinator/addit/


<---------CUT HERE AND SAVE AS htdocs/webinator/addit ------------------>

<script language=vortex>
<db= /usr2/pub/httpd/htdocs/tmp>

<!------------- Get the page and insert it into the db ---------------->

<a name=doit>
<look>
<theform>
<small>
Remember to update your indexes with "gw -index" when you're done.
</small>
<hr>
<fetch $theurl> <!-- You should really check for errors -->
<rex ">><title>\P=!</title>+" $ret>
<$title=$ret>
<urltext>
<$text=$ret>
<USER=_SYSTEM> <!-- login to the database as SYSTEM -->
<sql "insert into html
values (counter,1,'now',0,0,$theurl-'http://',$title,$text,'')">
</sql>
<urllinks>
<loop $ret>
<sql "insert into refs values(counter,$theurl-'http://',$ret-'http://')">
</sql>
</loop>
<b>
$loop links on the page<br>
Page contained this text:
</b>
<pre>
$text
</pre>
</look>
</a>

<!------------------------ Show the form ---------------------------->

<a name=theform>
<form method=post action=$url/doit.html>
ADD THIS URL:<input name=theurl size=40 value="http://">
<input type=submit value="Go">
</form>
</a>

<!------------------------ Header and footer ------------------------->

<a name=look>
<html>
<head>
<title>Add A Webinator Document</title>
<body bgcolor="white">
</a>

<a name=/look>
</body>
</html>
</a>

<!--------------------- The main entry point ------------------------>

<a name=main>
<look>
<theform>
</look>
</a>

</script>
<!---------------------------- The end -------------------------------->



wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

We are getting a crash under NT, and we've tried
3 separate installations and both NT4 and NT5
with two different 'db' databases.

The sample 'search' script works fine against the
database (f:\texis\db\)

We have consulted:

http://thunderstone.master.com/texis/ma ... m342a54b90

We are passing this URL:

http://localhost/scripts/texis.exe/texi ... rosoft.com

to this vortex script:

<SCRIPT LANGUAGE=vortex>
<A NAME=main>
<!--varinfo list URL>
<$fields = $ret>
<LOOP $fields>
<getvar $fields>
<IF $fields eq "URL">
<$theurl = $ret>
<fetch $theurl>
<$html = $ret>
<urlinfo actualurl>
<urltext>
</IF>
</LOOP>-->

<IF $Url ne "">
<$theurl = $Url>
$theurl:<P>
<fetch $theurl>
<$html = $ret>
<urlinfo actualurl>
<urltext>
<$urltext = $ret>
<ELSE>
No URL passed!
<exit>
</IF>
$ret
<$query = "Jared">
<USER=_SYSTEM>
<DB = "f:\texis\db\">
<SQL MAX=1 "select Url from html where (Url=$Url)"><P>$next<P>
$next = -1
</SQL>
<P>
Inserting $Url into index...<P>
<SQL "insert into html (Url, Body) VALUES ('$Url', '$urltext')">
</SQL>

<SQL MAX=10 "select Url, Title, id, texttomm(Title\Body, 10) query

from html where (id = $id) or (Url=$Url)

">
<!--<SQL MAX=10
"select 'http://' + Url Url, Title, Body, length(convert(Body,'varchar')) Size, id, Visited
from html
where Title\Meta\Body likep $query">-->
$Title
<BR>
$Url<P>
</SQL>
</A>

</SCRIPT>


Regards,

Dr. Priest
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator II page fetching without gw (howto)

Post by mark »

What kind of crash? On what line of the script? What's the content of that line within the script?

Also your syntax for the insert values is incorrect. The the example in the first message and the manual for the correct syntax. (there should not be quotes around the variables)
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

1. I moved a </P> tag after the line saying "Inserting $Url into database..."
(just before the SQL insert)

2. The "crash" is a access exception violation that reads:

Microsoft (R) Windows 2000 (TM) Version 5.00 DrWtsn32
Copyright (C) 1985-1999 Microsoft Corp. All rights reserved.
Application exception occurred:
App: (pid=1796)
When: 6/2/2001 @ 15:49:48.344
Exception number: c0000005 (access violation)

3. The crash occurs after text from the URL is displayed to the screen
but before anything else is displayed. The last line of displayed text is:

Pacific Time ©2001 Microsoft Corporation. All rights reserved. Terms of Use Text-only Home Page | Disability/accessibility | Contact Us | Privacy Statement

P.S. Is there any ability to get lihe-by-line "trace" information from vortex?

Regards (and thanks),

Dr. Priest
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

P.S. yes, I've removed the single quotes from around the
string variables, thxs.
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

and under NT4, the screen reads:
texis.exe, Exception access violation (0xc0000005),
Address 0x00473c33

Date on texis.exe is 10/31/2000
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator II page fetching without gw (howto)

Post by mark »

try inserting all of the columns as in the example in message 1.
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

OK

I tried this:

<SQL "insert into html (Url, Body, Title, Meta, Visited,
Depth,Dlsecs,id,New) VALUES ($Url, $urltext,'','','',1,1,counter,'')">
</SQL>

but that also crashed.

But changing variable New from a null string to an integer (or leaving it out):

<SQL "insert into html (Url, Body, Title, Meta, Visited,
Depth,Dlsecs,id,New) VALUES ($Url, $urltext,'','','',1,1,counter,0)">
</SQL>

works fine! Record inserted fine. Retrieved fine. And didn't require
a reindex to retrieve it. So can some records be left unindexed?

I know it degrades efficiency, but how do you run reindex from inside
Vortex? We want to further operate on the same url page from the
same Vortex page.

Perhaps at http://www.thunderstone.com/site/gw25ma ... #secfields
you could give the field "type" too ?

P.S. those are double single quotes (null strings) not single double quotes
for anyone trying to read this.

Thanks!

P.S.

Perhaps it shouldn't crash if the key field is missing? And if there is
a type mismatch ??? A little more "error handling" here?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator II page fetching without gw (howto)

Post by mark »

Unindexed records are searched linearly. This will update the existing index on the text:
<sql "create metamorph inverted index xhtmlbod on html(Title\Meta\Body)"></sql>

You can consult the SYSCOLUMNS table to find the types of all of the fields.

You might also want to take a look at ftp://ftp.thunderstone.com/pub/dowalk_beta
wcpriest
Posts: 14
Joined: Sat May 26, 2001 12:59 pm

Webinator II page fetching without gw (howto)

Post by wcpriest »

Dear Mark,

Thank you for a thorough response to my question!

The web walking script looks very useful for our K-12 learning site. We have collected many URLs of sites related to the teaching of math and science, and we need to walk those pages, gather up a lot of words, and then let webinator search those as a single page.

Seeing those hits, the student can then decide which learning web site is best suited to his/her learning needs!

Thanks again,

Dr. Priest
Post Reply