Dowalk and .shtml files

Post Reply
jeuteneier
Posts: 32
Joined: Wed May 16, 2001 2:54 pm

Dowalk and .shtml files

Post by jeuteneier »

Hello,

I'm trying to use do_walk so I can use the <DEL></DEL> to comment out my header that is set up as an include on my pages. I am entering the following command:

/usr/local/apache/cgi-bin/texis top=http://172.18.21.40/retweb/index.shtml do_walk/dispatch.txt

It starts the index.shtml page and then just finishes with that. Nothing is indexed and I do have the .shtml extension being recognized in the script. It says:

Started 1 (40050) on http://172.18.21.40/retweb/index.shtml
Finished 40050 on http://172.18.21.40/retweb/index.shtml
Updating Metamorph index

When I look at the log I get (this is just a few of the first lines):

http://172.18.21.40/robots.txt
100 do_walk(getrobotstxt) 590: Document not found: http://172.18.21.40/robots.txt returned code 404 (Not Found)
http://172.18.21.40/retweb/index.shtml
100 do_walk(procpage) 415: More Values Than Fields in the function Insert while processing url http://172.18.21.40/retweb/index.shtml
http://172.18.21.40/retoffice/applicati ... owto.shtml

It seems to be crawling the links because it visits the PDF's and all the .shtml files, it just doesn't index them. And the "More Values Than Fields in the function Insert while processing url" error seems to be on all the pages.

I saw in one posting you suggested someone look at what url's were in the database. I did that and got:

115 Field url non-existent
115 Field non-existent or type error in url
000 SQL Prepare() failed with -1

Is it something with the .shtml?

Thanx,
Justin
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Dowalk and .shtml files

Post by mark »

It's nothing to do with .shtml. It would seem that whatever modifications you've made to the dowalk_beta script have made it incompatible with the webinator database. Make sure you don't add or delete fields from the "insert" statements.

The field is "Url", not "url". Case matters.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Dowalk and .shtml files

Post by bart »

I was unable to examine the URL http://172.18.21.40/retweb/index.shtml . Your server must be walled or down.

The error message is an unusual one, and we'll have to wait until somebody else is available in the am (EST) who knows about this message.

One clue however is the messages:

115 Field url non-existent
115 Field non-existent or type error in url

Since the actual field in the table is titled "Url" and not "url" , I suspect someone might have been tampering with the code. Try editing line 415 of the script and changing the case of the U in "url" to upper.
jeuteneier
Posts: 32
Joined: Wed May 16, 2001 2:54 pm

Dowalk and .shtml files

Post by jeuteneier »

Here are the parts of the script that I have altered:

<a name=settings>
<$defdb=/usr/local/apache/htdocs/webinator/db>
<!-- allow this to change from command line -->
<if "" eq $db ><$db=$defdb></if> <!-- specify database -->
<user=_SYSTEM>
<db=$db>
<!---------->
<sum "%s" $db "/gw.log">
<$logfile=$ret> <!-- where to log walker output -->
<rex ">>=http://=[^/?]+" $top><$hostprefix=$ret>
<sandr "\." "\\." $hostprefix><strfmt ">>=%s" $ret><$hostexpr=$ret>
<urlcp timeout 30> <!-- page timeout -->
<urlcp maxpgsize 50000> <!-- max page size -->
<urlcp maxredirs 1> <!-- max redirs to follow -->
<urlcp getframes yes> <!-- treat frames as one page -->
<urlcp maxframes 5> <!-- max frames per page -->
<urlcp offsiteok no> <!-- fetch offsite redirs&frames -->
<!-- <urlcp user SOMEUSERNAME> --> <!-- username for secure sites -->
<!-- <urlcp pass SOMEPASSWORD --> <!-- password for secure sites -->
<!-- <urlcp useragent "Mozilla/2.0"> --><!-- pretend to be this browser -->
<urlcp 8bithtml yes> <!-- keep 8 bit chars as is -->
<urlcp alttxt yes> <!-- keep alt text from images -->
<urlcp strike yes> <!-- keep <strike> text -->
<$acceptmime=
"text/*"
"application/pdf"
"application/msword"
"application/wordperfect"
"application/x-shockwave-flash"
>
<urlcp accept $acceptmime> <!-- mime types to accept -->
<!-- <urlcp proxy http://proxyhostname> <!-- use proxy -->
<$maxlinksize=1024> <!-- ignore urls longer than this -->
<$robots=yes> <!-- respect robots.txt -->
<$maxdepth=2000000000> <!-- maximum walk depth -->
<$maxparallel=10> <!-- max # of parallel walkers -->
<$maxpages=2000000000> <!-- max # of pages to get -->
<$storerefs=1> <!-- store refs in refs table -->
<$startoffsiteattop=0> <!-- ignore path on offsite urls -->
<$stayunder=0> <!-- stay under initial url dir -->
<$dounique=1> <!-- hash pages to prevent dups -->
<$ipcstring=1.1.1.1> <!-- IP address to use as signal -->
<!-- acceptable suffixes -->
<$acceptsufs="http://=[^/?]+>>=" "/=[^./]*>>=">
<acceptext ext=".shtml">
<acceptext ext=".pdf" ><!-- Adobe Acrobat PDF Files -->
<$allowedsites=$hostexpr>
<!-- acceptable domains -->
<!-- <acceptdomain domain="somedomain.com"> -->
<!-- undesirable urls -->
<$rejects="/cgi-bin/" "[\&\?\+\~]">
<addrejectprefix prefix="http://www.thunderstone.com/">
<$requires=>
<!-- how to recognize and unalias the "index" page to "/" -->
<setindexhtml filename="index.shtml">
<$partialpageexprs="Can't close connection"
"Page not expected size"
"Max page size exceeded">
</a>

<a name=updateindex>
<!-- delete default index expression -->
<!-- <sql "set delexp=0"></sql>
<!-- add a custom index expression or 2 -->
<!-- uncomment as needed <sql "set addexp='SOME_EXPRESSION'"></sql> -->
<!-- uncomment as needed <sql "set addexp='ANOTHER_EXPRESSION'"></sql> -->
<!-- customize noise list -->
<!-- uncomment as needed - you would also do this in the search script
<$mynoise="noise" "words" "here">
<apicp noise $mynoise>
-->
<!-- index all words including noise -->
<!-- uncomment as needed <apicp keepnoise 1> also do in search script -->
<!-- create standard metamorph search index -->
<sql "create metamorph inverted index xhtmlbod on html(Title\Meta\Body)"></sql>
<!-- create custom metamorph search index - search should reflect this -->
<!-- <sql "create metamorph inverted index xhtmlbod on html(Title\Description\Keywords\Body)"></sql> -->
</a>

Does this part need to be change to indexshtml?
<!------------------------------------------------------------------------->
<!-- cleanup links - remove location anchors -->
<a name=clean>
<sandr "\#=[\alnum_]+>>=" "" $links>
<sandr $s_indexhtml $r_indexhtml $ret>
<$link=$ret>
</a>

Here is the script around line 415. Is the last SQL supposed to be commented out?:
<!-- insert into customized html table -->
<!-- this should agree with the create table statement in maketables -->
<sql novars "insert into html values($id,0,$when,$dlsecs,$depth,
$xurl,$title,$page,$mkeywords,
$mdescription
)">
</sql>
<!-- insert into standard html table
<sum "%s\n" $mdescription $mkeywords>
<sql novars "insert into html values($id,0,$when,$dlsecs,$depth,
$xurl,$title,$page,
$ret
)">
</sql>
-->

And finally the call to Anytotx:
<a name=dofilt opt dt>
<local saveq x pagehead>
<$didplugin=y>
<$saveq=$quietmsg>
<$errmsg="">
<$quietmsg=1>
<exec nobr /usr/local/apache/htdocs/webinator/bin/anytotx -mTitle -mSubject -mKeywords $opt><fmt "%s" $htmlpage></exec>

I haven't changed the case of u in "url" yet because I wanted you to see exactly what script I was running. I know this is a long message. I appreciate the time.

Justin Euteneier
503-797-7316
Fred Meyer
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Dowalk and .shtml files

Post by bart »

Without knowing what changes you made, its difficult to easily see what the issue is here.

One thing is for sure; Tech Support does not extend into consulting about/or debugging OPC (other people's code).

My suggestion is to back track upon your code modifications until it works correctly and then go forward again until you see what the issue is.

Texis is CaSe sensitive with respect to variable and field names. This is at the root of your current problem.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Dowalk and .shtml files

Post by mark »

You need to read the comments at the beginning of the dowalk_beta script. They indicate what parts to comment or uncomment. You should be inserting into the "standard" table.
Post Reply