speed of data retrieval

sourceuno · Post by **sourceuno** » Tue Jun 05, 2001 5:44 pm

I'm trying to retrieve a large amount data from the web for my database. I'm also running multiple instances of my retrieval script, but I'm finding that only using one PC to collect the data is not fast enough for me. From previous posts, I know that I can't share the database across a network for multiple PC's to retrieve data into a single database. Can you recommend a way to speed up my data retrieval process? Also, how can multiple developers use the same database without being able to share a network drive for the database?

bart · Post by **bart** » Tue Jun 05, 2001 7:13 pm

A single machine properly configured is adequate to index a vast amount of information. More often than not people point at the wrong things as the speed culprits. Here are some things to look at first:

1: DNS performance. If you're grabbing a lot of data from different domains, name resolution can be a real drag. Solution: make sure you've got local, FAST DNS, or even better become a Top-level provider.

2: Network performance. You might have a really good network by anyone's standards, but 90% of websites are hosted by vendors who suck. What this translates to in the real world is that on average you can not expect any greater than 8Kbytes /sec from the joe-average remote webserver. Solution: run MANAGED multiple invocations of the crawling software.

2a. If the data being gathered is provided by a remote ASP or Perl application, then there may be nothing to do but wait. You can cripple many ASP/Perl driven websites with even the nicest single threaded crawlers. The problem here is that a good percentage of people who write web applications aren't really that qualified to do so, and their servers work ok for the 100 hits/day they usually get, but melt when faced with 100 hits/minute.

3: If you're using the Webintor "gw" program to do global crawling, DONT. It wasn't designed for this. It was designed to be a good Enterprise crawler but its (intentionally) missing the componentry required to make it good at Global indexing. Solution: Become creative with the "scripted crawler": ftp://ftp.thunderstone.com/pub/dowalk_beta . This software is a big building block to a managed global crawler. You can also hire us to write the thing for you.

We've run single machine crawling applications that will easily fetch more than 1,000,000 pages / day. But in direct answer to your question, the best way for many machines to connect to one database is via the Client-Server 'C' library. A less efficient way is via ODBC or the Perl DBI driver.

BTW: No real database (Oracle, MSSQL etc...) can be used in read/write mode from multiple machines via simple shared network drives.

Post by **mark** » Tue Jun 05, 2001 9:29 pm

My 2 cents about developers... On a unix based system, as many developers as needed may simply log into the server and work quite happily on the same database. And remote maintenance is as easy as local. No server-taxing remote control programs required.

If I recall correctly, the latest version of Win2k comes with a telnet server (or at least you can get one) making remote login to the server work more like a unix system so multiple developers may work on the same machine at the same time.

sourceuno · Post by **sourceuno** » Wed Jun 06, 2001 2:10 pm

Thanks Bart and Mark for your input. Is there a way in a Vortex script to be able to call a common DSN in ODBC so that I may test out the efficiency of being able to connect a database through ODBC? Do you have any code samples showing the use of ODBC? And how easy is it to be able to connect to a database via the C library?

Post by **mark** » Wed Jun 06, 2001 2:50 pm

We only provide the ODBC data source. The client is entirely up to you.

C depends on how fluent you are. See sqlex1.c.

iclbloom · Post by **iclbloom** » Sat Jul 07, 2001 4:15 pm

I have a Texis db linked via ODBC to Access, should I be able to update the tables from w/in Access? I get "record set is not updatable"

Thanks
Leon Bloom

bart · Post by **bart** » Sat Jul 07, 2001 4:19 pm

Done correctly , an ODBC update should be possible. Perhaps you are having a permissions issue.

Post by **John** » Sat Jul 07, 2001 7:00 pm

Access can be very picky about indexes on the table. It likes to have a unique index on the table to identify the row being updated.