crontab misbehaviour?

r.j.michell
Posts: 12
Joined: Tue Feb 27, 2001 8:38 am

crontab misbehaviour?

Post by r.j.michell »

Hi there:

We have a crontab procedure re-rewalking our site every night at midnight. Twice I have succesfully gotten gw to pick up .shtml pages by using: bin/gw –fshtml, but upon checking up the next day, I find that the rewalk must have messed things up as the .shtml pages were no longer being picked up in a search.

Here is the batch file we run at 12:00 am:

# index-apu.bat invoked via crontab
# Change dir to webinator:
cd /www/httpd/html/webinator
# rewalk globalDB maintaining options from initial walk
bin/gw -rewalk -dglobalDB
# Domains included in globalDB:
# List of domains here
# Change owner,group,mode of globalDB after rewalk:
chown nobody globalDB
chgrp nobody globalDB
chmod 775 globalDB
# Change mode of all tables in globalDB:
chmod -R 666 /www/httpd/html/webinator/globalDB

Using -rewalk is supposed to maintain all options and sites/domains from the initial manual walk, yet this doesn't seem to be the case.

Does anyone know what's happening?
We use webinator V2.55 on RedHat 6.1

Thanks for any pointers!
Russ
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

crontab misbehaviour?

Post by mark »

What was your complete manual gw command before you did rewalk?
And, do you get any error messages from the rewalk?
r.j.michell
Posts: 12
Joined: Tue Feb 27, 2001 8:38 am

crontab misbehaviour?

Post by r.j.michell »

Complete manual gw command:

/www/httpd/html/webinator/bin/gw -fshtml -N -d/www/httpd/html/webinator/globalDB
-jhttp://ourdomain -dns=sys http://ourdomain

In the gw.log for the rewalk on globalDB NO .shtml pages were rewalked or are indicated in ANY way.

The following occasional errors are found:

* Resource temporarily unavailable
* returned code 404 (Not Found)
* Document not found: somedomain.com returned code 404 (OK)
* returned code 403 (Forbidden)
* Can't get address for host 'someexternal site.com': No such file or directory
* Can't get address for host `h641-a': No such file or directory
* Can't get address for host `someexternal site.com': Resource temporarily unavailable
* Max page size exceeded (truncated) for 'somepage.html'

These are quite normal when you consideer there are over 7000 pages to be indexed..

What could be the problem? I have halted the cronjob for now so that it doesn't screw up the search again, before I can get an answer.

Thanks.
Russ
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

crontab misbehaviour?

Post by mark »

I'm not able to replicate that behavior, but I'm using 2.56. There was a problem like you describe in versions prior to 2.52 but not since.
Check your options table with
gw -d/www/httpd/html/webinator/globalDB -st "select Name,String from options"
Mike looks like this:
Name String
version 2.56
dns sys
f html
f htm
f txt
f
f shtml
j http://mysite/test
k \alnum{2,30}
p 500000000
t 30
v 2
w 0
z 100000
Z 1024
I index.html
M 3.0
N
T text/*
URL http://mysite/test/

You could try downloading the latest version from our website.
r.j.michell
Posts: 12
Joined: Tue Feb 27, 2001 8:38 am

crontab misbehaviour?

Post by r.j.michell »

Hmmmm interesting: take a look, no mention of shtml:

Name String
version 2.55
f html
dns sys
f htm
f txt
k \alnum{2,30}
p 500000000
f
t 30
v 2
z 100000
I index.html
w 0
T text/*
URL http://domain1
M 3.0
N
URL http://domain2
URL http://domain3
URL http://domain4
j http://domain5

When I issued the original command I use '\' after each segment to prevent my telnet client from mis-justifying the text across the screen, this should make no difference right? I mean the manual search worked.....
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

crontab misbehaviour?

Post by mark »

That doesn't quite agree with the manual gw command you gave before, even disregarding the shtml issue. Try doing the initial walk to a totally new database. If the problem persists, upgrade to the newer version.

Properly used backslashes aren't a problem. You would have gotten error messages if they were improperly used.

BTW, -jhttp://hostname is pointless (but also harmless). gw always stays on the host(s) you specify. -j only has meaning if there's something after the hostname, such as -jhttp://hostname/somedir/
r.j.michell
Posts: 12
Joined: Tue Feb 27, 2001 8:38 am

crontab misbehaviour?

Post by r.j.michell »

Mark, thanks for your help thus far, this walk seems to have picked up shtml:

1: bin/gw -d/www/httpd/html/webinator/apuDB -create
2: bin/gw -fshtml -N -d/www/htpd/html/webinator/apuDB -jhttp://www.apu.ac.uk/ -dns=sys http://www.apu.ac.uk/
3:
[root@]# chmod 775 apuDB
[root@]# chmod -R 666 apuDB
[root@]# chown nobody apuDB
[root@]# chgrp nobody apuDB

//Results in succesful page search

4: bin/gw -d/www/httpd/html/webinator/apuDB -st "select Name,String from options"

//Results in

Name String
version 2.55
f html
dns sys
f htm
f txt
k \alnum{2,30}
p 500000000
f
t 30
v 2
f shtml
I index.html
w 0
z 100000
M 3.0
N
URL http://mydomain/
j http://mydomain/
T text/*

I'll have to wait for the results of the cronjob to see if this was successful or not. if not it would seem to be a problem with the -rewalk option not remembering the initial walk..

I may well be bac tomorrow!
Thanks very much for your help.

Russ