Page 1 of 1

Webinator 2 problems and a bonus walking script

Posted: Tue Sep 23, 1997 3:41 pm
by Thunderstone


This is a MIME message. If you are reading this text, you may want to
consider changing to a mail reader or gateway that understands how to
properly handle MIME multipart messages.

--=_D386C2B0.36573B77
Content-Type: text/plain
Content-Disposition: inline

I'm currently working on a large search engine based on webinator but I
always seem to run into problems while I'm walking and indexing. I
always get errors indicating that the database can't be opened etc..
I have the following settings in my /etc/system file (I'm running Solaris
2.5.1).

set enable_sm_wa = 1
set shmsys:shminfo_shmmax=0x20000000
set semsys:seminfo_semmap=64
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=3200
set semsys:seminfo_semmnu=128
set semsys:seminfo_semume=128
set semsys:seminfo_semmsl=128
set shmsys:shminfo_shmmin=32
set shmsys:shminfo_shmmni=100
set shmsys:shminfo_shmseg=32

So my question is:
With these settings how many GW's can I run safely on a 64 Meg Solaris
2.5.1 system.

On another note:
Attached is a little PERL script that aids me in running multiple GW's
across a broad list of sites I need to walk. It's still experimental so use it
at your own risk. The real purpose behind this script is:

1) Index subsets of many websites
2) processing of robots.txt for each and every site
3) Limits the maximum number of processes running gw at any one point
in time to a user defined limit.
4) Automatically index the complete database when all walks are
complete.


There are three variables in the script that need to be modified:

$url_file = "unions.txt";
$db_dir = "/opt/ns-home/docs/webinator/sample";
$max_walks = 15;

$db_dir
The directory in which your database resides in

$max_walks
The total number of active GW process's you want active at one time.

$url_file should be the name of a text file containing the URLS to the
homepages you want to visit.
It must reside in the database directory you are indexing to.
This file is designed as follows:
Each URL must be on it's own line and begin with http://.
If a URL doesn't reference an HTML file but the directory you must put a
trailing / in it. This is the proper way to indicate URL's and should be the
way you specify URL's of this type. For example
http://www.foo.bar/~dmchugh

is wrong format since the ~dmchugh is a directory it should end in a /:

http://www.foo.bar/~dmchugh/


The command run for each URL will be processed in the background.
A URL of http://members.aol.com/ferrancz/acea.htm will be run as:

/opt/ns-home/suitespot/docs/webinator/bin/gw -noindex
-d/opt/ns-home/suitespot/docs/webinator/unions -L -N -v0
-jhttp://members.aol.com/ferrancz/
http://members.aol.com/ferrancz/acea.htm &


You should also edit the file to correctly specify the path to the gw
executable (specified three times in the script). I guess I should also
have this as a variable but for now it works.

Dan McHugh
Webmaster@osstf.on.ca

|| Did you know that www.netcom.com only allows
|| Infoseek and Excite to index their pages while
|| www.netcom.ca allows everybody to index their
|| site except for the /bin directory?

--=_D386C2B0.36573B77
Content-Type: application/octet-stream; name="RINDEX"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="RINDEX"

IyEvdXNyL2Jpbi9wZXJsDQojDQojIFNjcmlwdCB0byByZWFkIGluIFVSTCdzIHRvIHRyYXZlcnNl
IHdpdGggd2ViaW5hdG9yDQojIFdyaXR0ZW4gQnk6RGFuIE1jSHVnaCwgZG1jaHVnaEBuZXRyb3Zl
ci5jb20NCiMgVGhpcyBpcyBhIGV4cGVyaW1lbnRhbCBzY3JpcHQgeW91IHVzZSBpdCBhdCB5b3Vy
IHJpc2suDQoNCiR8ID0gMTsNCg0KI1NldCB0aGUgbmFtZSBvZiB0aGUgdGV4dCBmaWxlIHRvIHJl
YWQgaG9zdHMgZnJvbSAoYXNzdW1lcyBpdCdzIGluIHRoZSBkYiBkaXJlY3RvcnkpDQojIHRoZSBk
YXRhYmFzZSB5b3UgZGlyZWN0b3J5IHlvdSB3YW50IHRvIGFkZCB0b28NCiMgYW5kIHRoZSBtYXhp
bXVtIG51bWJlciBvZiBQcm9jZXNzJ3MgdG8gd2Fsaw0KJHVybF9maWxlID0gInVuaW9ucy50eHQi
Ow0KJGRiX2RpciA9ICIvb3B0L25zLWhvbWUvZG9jcy93ZWJpbmF0b3Ivc2FtcGxlIjsNCiRtYXhf
d2Fsa3MgPSAxNTsNCg0KIyBDcmVhdGUgUGF0aCBhbmQgZmlsZSBuYW1lcw0KJHVybF9maWxlID0g
JGRiX2Rpci4iLyIuJHVybF9maWxlOw0KJGd3X3Byb2Nlc3MgPSAwOw0KI1RoaXMgc2NyaXB0IGFs
c28gYWRkcyB0d28gZXh0cmEgcHJvY2VzcyBkdWUgdG8gdGhlIGdyZXAgc3RhdGVtZW50DQokbWF4
X3dhbGtzICs9IDI7DQoNCiMgT3BlbiBVUkwgRmlsZQ0Kb3BlbiggSU5QVVQsICR1cmxfZmlsZSkg
fHwgZGllICJDYW5ub3QgT3BlbiAkdXJsX2ZpbGU6ICQhXG4iOw0KDQojIFJlYWQgVVJMIGZpbGUg
aW4gb25lIGxpbmUgYXQgYSB0aW1lLiANCndoaWxlKCAkY3VycmVudF9saW5lID0gPElOUFVUPiAp
DQp7DQokZ3dfcHJvY2Vzcz0kbWF4X3dhbGtzOw0KCSMgTG9vcCB3aGlsZSB5b3UgaGF2ZSByZWFj
aGVkIHRoZSBtYXhpbXVtIG51bWJlciBHVyBwcm9jZXNzZXMNCgl3aGlsZSggJG1heF93YWxrcyA8
PSAkZ3dfcHJvY2VzcyApDQogICAgICAgIHsNCiAgICAgICAgICAgICAgICAkZ3dfcHJvY2VzcyA9
IGBwcyAtYWVmfGdyZXAgLWMgL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9yL2Jpbi9nd2A7DQog
ICAgICAgIH0NCgljaG9tcCgkY3VycmVudF9saW5lKTsNCgkjc3RyaXAgb3V0IHRoZSB3ZWIgcGFn
ZSBmcm9tIHRoZSBtYWluIHVybCB0byBiZSB1c2VkDQoJI3RvIHdhbGsgYSBwb3J0aW9uIG9mIHRo
ZSBzaXRlDQoJJG1vZF91cmw9IHN1YnN0cigkY3VycmVudF9saW5lLDAscmluZGV4KCRjdXJyZW50
X2xpbmUsIi8iKSkgLiAiLyI7DQoJc3lzdGVtICgiL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9y
L2Jpbi9ndyAtbm9pbmRleCAtZCRkYl9kaXIgLUwgLU4gLXYwIC1qJG1vZF91cmwgJGN1cnJlbnRf
bGluZSAmIik7DQp9DQojIFdhaXQgZm9yIGFsbCBjaGlsZCBwcm9jZXNzIHRvIHRlcm1pbmF0ZQ0K
d2FpdDsNCndoaWxlKCQ/ICE9IC0xKQ0Kew0KCXdhaXQ7DQp9DQojTm93IHBlcmZvcm0gdGhlIGlu
ZGV4IG9uIHRoZSBkYXRhYmFzZQ0Kc3lzdGVtICgiL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9y
L2Jpbi9ndyAtZCRkYl9kaXIgLWluZGV4Iik7DQo=

--=_D386C2B0.36573B77--



Webinator 2 problems and a bonus walking script

Posted: Tue Sep 23, 1997 5:26 pm
by Thunderstone



The settings you supply have little to do with Webinator. It does not
use shared mem on solaris. It only needs one semaphore per database
and one connection to the semaphore per running process.

You are more likely hitting a system wide number of open files limit
or running out of memory. On solaris, each gw takes about 4 megs while
running. Each gw will take about 14 file handles (or 24 if the search
indices have been made).


This will not work as desired. Multiple unrelated gw's running on the
same database with -j options will conflict with each other and you
will lose urls. (Without the -j's it would work).

There is an (intentionally) undocumented option that will let you run
several walks at once. Use -# where # is the number of simultaneous walkers
to run. Specify all of your -j options and urls on the initial command
line. You may wish to use an option file and a url list file to
make the command line manageable.

NOTE: Running more that a few walkers at once is counter productive and
discouraged. They start to get into contention with each other for
the database and use up system resources and network bandwidth.

Webinator 2 will handle robots.txt from multiple hosts correctly.