I'm trying to retrieve a large amount data from the web for my database. I'm also running multiple instances of my retrieval script, but I'm finding that only using one PC to collect the data is not fast enough for me. From previous posts, I know that I can't share the database across a network for multiple PC's to retrieve data into a single database. Can you recommend a way to speed up my data retrieval process? Also, how can multiple developers use the same database without being able to share a network drive for the database?
A single machine properly configured is adequate to index a vast amount of information. More often than not people point at the wrong things as the speed culprits. Here are some things to look at first:
1: DNS performance. If you're grabbing a lot of data from different domains, name resolution can be a real drag. Solution: make sure you've got local, FAST DNS, or even better become a Top-level provider.
2: Network performance. You might have a really good network by anyone's standards, but 90% of websites are hosted by vendors who suck. What this translates to in the real world is that on average you can not expect any greater than 8Kbytes /sec from the joe-average remote webserver. Solution: run MANAGED multiple invocations of the crawling software.
2a. If the data being gathered is provided by a remote ASP or Perl application, then there may be nothing to do but wait. You can cripple many ASP/Perl driven websites with even the nicest single threaded crawlers. The problem here is that a good percentage of people who write web applications aren't really that qualified to do so, and their servers work ok for the 100 hits/day they usually get, but melt when faced with 100 hits/minute.
3: If you're using the Webintor "gw" program to do global crawling, DONT. It wasn't designed for this. It was designed to be a good Enterprise crawler but its (intentionally) missing the componentry required to make it good at Global indexing. Solution: Become creative with the "scripted crawler": ftp://ftp.thunderstone.com/pub/dowalk_beta . This software is a big building block to a managed global crawler. You can also hire us to write the thing for you.
We've run single machine crawling applications that will easily fetch more than 1,000,000 pages / day. But in direct answer to your question, the best way for many machines to connect to one database is via the Client-Server 'C' library. A less efficient way is via ODBC or the Perl DBI driver.
BTW: No real database (Oracle, MSSQL etc...) can be used in read/write mode from multiple machines via simple shared network drives.