Webinator 2 problems and a bonus walking script
Posted: Tue Sep 23, 1997 3:41 pm
This is a MIME message. If you are reading this text, you may want to
consider changing to a mail reader or gateway that understands how to
properly handle MIME multipart messages.
--=_D386C2B0.36573B77
Content-Type: text/plain
Content-Disposition: inline
I'm currently working on a large search engine based on webinator but I
always seem to run into problems while I'm walking and indexing. I
always get errors indicating that the database can't be opened etc..
I have the following settings in my /etc/system file (I'm running Solaris
2.5.1).
set enable_sm_wa = 1
set shmsys:shminfo_shmmax=0x20000000
set semsys:seminfo_semmap=64
set semsys:seminfo_semmni=128
set semsys:seminfo_semmns=3200
set semsys:seminfo_semmnu=128
set semsys:seminfo_semume=128
set semsys:seminfo_semmsl=128
set shmsys:shminfo_shmmin=32
set shmsys:shminfo_shmmni=100
set shmsys:shminfo_shmseg=32
So my question is:
With these settings how many GW's can I run safely on a 64 Meg Solaris
2.5.1 system.
On another note:
Attached is a little PERL script that aids me in running multiple GW's
across a broad list of sites I need to walk. It's still experimental so use it
at your own risk. The real purpose behind this script is:
1) Index subsets of many websites
2) processing of robots.txt for each and every site
3) Limits the maximum number of processes running gw at any one point
in time to a user defined limit.
4) Automatically index the complete database when all walks are
complete.
There are three variables in the script that need to be modified:
$url_file = "unions.txt";
$db_dir = "/opt/ns-home/docs/webinator/sample";
$max_walks = 15;
$db_dir
The directory in which your database resides in
$max_walks
The total number of active GW process's you want active at one time.
$url_file should be the name of a text file containing the URLS to the
homepages you want to visit.
It must reside in the database directory you are indexing to.
This file is designed as follows:
Each URL must be on it's own line and begin with http://.
If a URL doesn't reference an HTML file but the directory you must put a
trailing / in it. This is the proper way to indicate URL's and should be the
way you specify URL's of this type. For example
http://www.foo.bar/~dmchugh
is wrong format since the ~dmchugh is a directory it should end in a /:
http://www.foo.bar/~dmchugh/
The command run for each URL will be processed in the background.
A URL of http://members.aol.com/ferrancz/acea.htm will be run as:
/opt/ns-home/suitespot/docs/webinator/bin/gw -noindex
-d/opt/ns-home/suitespot/docs/webinator/unions -L -N -v0
-jhttp://members.aol.com/ferrancz/
http://members.aol.com/ferrancz/acea.htm &
You should also edit the file to correctly specify the path to the gw
executable (specified three times in the script). I guess I should also
have this as a variable but for now it works.
Dan McHugh
Webmaster@osstf.on.ca
|| Did you know that www.netcom.com only allows
|| Infoseek and Excite to index their pages while
|| www.netcom.ca allows everybody to index their
|| site except for the /bin directory?
--=_D386C2B0.36573B77
Content-Type: application/octet-stream; name="RINDEX"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="RINDEX"
IyEvdXNyL2Jpbi9wZXJsDQojDQojIFNjcmlwdCB0byByZWFkIGluIFVSTCdzIHRvIHRyYXZlcnNl
IHdpdGggd2ViaW5hdG9yDQojIFdyaXR0ZW4gQnk6RGFuIE1jSHVnaCwgZG1jaHVnaEBuZXRyb3Zl
ci5jb20NCiMgVGhpcyBpcyBhIGV4cGVyaW1lbnRhbCBzY3JpcHQgeW91IHVzZSBpdCBhdCB5b3Vy
IHJpc2suDQoNCiR8ID0gMTsNCg0KI1NldCB0aGUgbmFtZSBvZiB0aGUgdGV4dCBmaWxlIHRvIHJl
YWQgaG9zdHMgZnJvbSAoYXNzdW1lcyBpdCdzIGluIHRoZSBkYiBkaXJlY3RvcnkpDQojIHRoZSBk
YXRhYmFzZSB5b3UgZGlyZWN0b3J5IHlvdSB3YW50IHRvIGFkZCB0b28NCiMgYW5kIHRoZSBtYXhp
bXVtIG51bWJlciBvZiBQcm9jZXNzJ3MgdG8gd2Fsaw0KJHVybF9maWxlID0gInVuaW9ucy50eHQi
Ow0KJGRiX2RpciA9ICIvb3B0L25zLWhvbWUvZG9jcy93ZWJpbmF0b3Ivc2FtcGxlIjsNCiRtYXhf
d2Fsa3MgPSAxNTsNCg0KIyBDcmVhdGUgUGF0aCBhbmQgZmlsZSBuYW1lcw0KJHVybF9maWxlID0g
JGRiX2Rpci4iLyIuJHVybF9maWxlOw0KJGd3X3Byb2Nlc3MgPSAwOw0KI1RoaXMgc2NyaXB0IGFs
c28gYWRkcyB0d28gZXh0cmEgcHJvY2VzcyBkdWUgdG8gdGhlIGdyZXAgc3RhdGVtZW50DQokbWF4
X3dhbGtzICs9IDI7DQoNCiMgT3BlbiBVUkwgRmlsZQ0Kb3BlbiggSU5QVVQsICR1cmxfZmlsZSkg
fHwgZGllICJDYW5ub3QgT3BlbiAkdXJsX2ZpbGU6ICQhXG4iOw0KDQojIFJlYWQgVVJMIGZpbGUg
aW4gb25lIGxpbmUgYXQgYSB0aW1lLiANCndoaWxlKCAkY3VycmVudF9saW5lID0gPElOUFVUPiAp
DQp7DQokZ3dfcHJvY2Vzcz0kbWF4X3dhbGtzOw0KCSMgTG9vcCB3aGlsZSB5b3UgaGF2ZSByZWFj
aGVkIHRoZSBtYXhpbXVtIG51bWJlciBHVyBwcm9jZXNzZXMNCgl3aGlsZSggJG1heF93YWxrcyA8
PSAkZ3dfcHJvY2VzcyApDQogICAgICAgIHsNCiAgICAgICAgICAgICAgICAkZ3dfcHJvY2VzcyA9
IGBwcyAtYWVmfGdyZXAgLWMgL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9yL2Jpbi9nd2A7DQog
ICAgICAgIH0NCgljaG9tcCgkY3VycmVudF9saW5lKTsNCgkjc3RyaXAgb3V0IHRoZSB3ZWIgcGFn
ZSBmcm9tIHRoZSBtYWluIHVybCB0byBiZSB1c2VkDQoJI3RvIHdhbGsgYSBwb3J0aW9uIG9mIHRo
ZSBzaXRlDQoJJG1vZF91cmw9IHN1YnN0cigkY3VycmVudF9saW5lLDAscmluZGV4KCRjdXJyZW50
X2xpbmUsIi8iKSkgLiAiLyI7DQoJc3lzdGVtICgiL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9y
L2Jpbi9ndyAtbm9pbmRleCAtZCRkYl9kaXIgLUwgLU4gLXYwIC1qJG1vZF91cmwgJGN1cnJlbnRf
bGluZSAmIik7DQp9DQojIFdhaXQgZm9yIGFsbCBjaGlsZCBwcm9jZXNzIHRvIHRlcm1pbmF0ZQ0K
d2FpdDsNCndoaWxlKCQ/ICE9IC0xKQ0Kew0KCXdhaXQ7DQp9DQojTm93IHBlcmZvcm0gdGhlIGlu
ZGV4IG9uIHRoZSBkYXRhYmFzZQ0Kc3lzdGVtICgiL29wdC9ucy1ob21lL2RvY3Mvd2ViaW5hdG9y
L2Jpbi9ndyAtZCRkYl9kaXIgLWluZGV4Iik7DQo=
--=_D386C2B0.36573B77--