Page 1 of 2

hardware or software error?

Posted: Tue Nov 20, 2001 12:04 pm
by resume.robot
Operating texis/webinator on a sun 64-bit E-250 server with 2Gb ram running solaris 2.7.

Commercial Version 3.01.962147411 of Jun 27, 2000 (sparc-sun-solaris2.6)

Running webinator searches, this machine has produced errors from the beginning. Kai provided some phone support several months ago, and a portion of the problem was identified as dns error. The dns was corrected but errors persist.

Here is the string that is being executed:

nohup gw -d/database -noindex -a -R -r -O -fshtml -fasp -fcfm -fjsp -fxml -t7 -z5000 -v4 "&list" > nohup.a

After the todo list grows, then additional gw strings are executed:

nohup gw -d/database -noindex -a -R -r -O -fshtml -fasp -fcfm -fjsp -fxml -t7 -z5000 -v4 > nohup.b

When I run multiple strings of gw, almost all of them die within a few hours, giving the message

000 Got signal 11 - quitting now
or
000 Got signal 10 - quitting now

According to the manual "UNIX Unleashed" description of kill signals:

10 Bus Error. Usually caused by a programming error, a bus error can be caused only by a hardware fault or a binary program file.

11 Segment violation. Caused by a program reference to an invalid memory location; can only be caused by a binary program file.

I suspect bad ram.

Before I approach Sun tech support again, I would like to know if you have seen this before, and if there is any chance it could be a software error.

Texis has been re-installed on this machine several times and it has not solved the problem.

I am running commercial webinator on a linux machine, using the exact same gw strings, and never have any problem.

hardware or software error?

Posted: Tue Nov 20, 2001 12:52 pm
by mark
Signals 10 and 11 may be caused by software or hardware. Your specific problem doesn't sound familiar. I assume you're running the multiple gw's at the same time against the same database.? Does it also happen if you use -dns=sys?

How do the webinator versions/releases compare on your sun vs. your linux?

hardware or software error?

Posted: Tue Nov 20, 2001 1:14 pm
by resume.robot
yes multiple simultaneous gw's are identical and on the same db, the only difference being that the first execution reads a list, the following executions merely spider from todo

webinator version on linux is older:

Webinator WWW Site Indexer Version 2.52 (Commercial)
Copyright(c) 1995,1996,1997,1998 Thunderstone EPI Inc.
Release: 19990218

sun:

Webinator WWW Site Indexer Version 2.56 (Commercial)
Copyright(c) 1995,1996,1997,1998,1999,2000 Thunderstone EPI Inc.
Release: 20000627


I ran -dns=sys previously but it has been a while and I don't have records. My recollection is that it did not solve the problem. Should I run a major test using this option and observe the results? If a test of -dns=sys failed, would this positively identify a hardware error?

hardware or software error?

Posted: Tue Nov 20, 2001 2:32 pm
by mark
It's hard to completely eliminate the possibility of a software glitch but gw is pretty stable and I would expect the newer version to be generally more stable. Probably the single largest operational change between your versions that would affect walk behavior is the internal dns routines. Using -dns=sys will eliminate them as a potential source of errors.

Do they all tend to die at about the same time as each other? Do you get any other messages just before the signals?

hardware or software error?

Posted: Tue Nov 20, 2001 2:45 pm
by resume.robot
No, they die individually at different times. There are no other messages I see.

When todo is emptied, all processes die normally and the reminder to index is printed.

Currently I am running a test with 8 gw processes as follows, they haven't died after an hour. I will let this run and see what happens. The original gw is still adding to todo, so there are 9 executions running.

nohup /gw -d/export/usr/data/user.new -noindex -dns=sys -a -R -r -O -fshtml -fasp -fcfm -fjsp -fxml -t7 -z5000 -v9 > nohup.11115p

hardware or software error?

Posted: Tue Nov 20, 2001 2:51 pm
by resume.robot
Here is a typical kill message:

http://www.nscl.msu.edu/~anthony/dwaresume.html
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0 Retrieving
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0
100 Document not found: http://www.nscl.msu.edu/~anthony/dwaresume.html returned code 404 (Not Found)
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0
http://lexav.nettalk.free.fr/Contact___ ... vitae.html
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0 Retrieving
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0
000 Got signal 11 - quitting now
1458: TotLinks: 10835, Links: 6/ 0, Good: 0, New: 0
000 Got signal 10 - quitting now

hardware or software error?

Posted: Tue Nov 20, 2001 3:18 pm
by mark
If you upload your list file into
ftp://ftp.thunderstone.com/pub/people/1 ... mikeclark/
we can try it here to see if we can replicate the problem or not.

hardware or software error?

Posted: Tue Nov 20, 2001 3:21 pm
by mark
p.s. please also indicate how big your database got so we know how much space we'll need.

hardware or software error?

Posted: Tue Nov 20, 2001 3:47 pm
by resume.robot
Still running with -dns=sys, total 11 gw processes have not died yet, let me see if I can kill them first. If not, then -dns=sys may be the answer.

List has 386,000 urls, is 18 Mb

Basically identical database on linux machine, du shows 1.2 Gb, ls -al shows html.tbl 550 Mb

Partially populated database on sun machine, du -k shows 400 Mb

hardware or software error?

Posted: Tue Nov 20, 2001 5:00 pm
by mark
Ok. We'll wait for your report. Let us know.