How to load 30 million rows into a database

mjacobson · Post by **mjacobson** » Thu Sep 20, 2007 11:12 am

I have a text file that lists ~30 million URLs, 1 per line. I want to load it into a Texis database table where I have an id,url column. I tried to use the readln and do an insert into the table but after 14 hours, I have only loaded 6 million URLs. Is there a faster way?

Post by **mark** » Thu Sep 20, 2007 11:48 am

Have you tried command line timport?
A schema like this should do it:

database mydatabase
table mytable
recexpr >>[^\n]+\n
field Url varchar(80) 1 ''

Post by **John** » Thu Sep 20, 2007 11:49 am

Using timport stand alone executable should be faster. Do you have any indexes on the table? Also we have noticed that if the table is on a Veritas file system it will slow down as the table gets bigger. We do have some workarounds for that that should help, but may currently require a program using the C API.

On a normal filesystem using timport you should be able to load all the records in about 90 minutes, and using the C API maybe 30-45 minutes.

Post by **John** » Thu Sep 20, 2007 11:51 am

Just to confirm, on the <READLN> did you use the ROW option, and use NOVARS on the SQL insert? Using those options will help ensure that you don't keep excessive amounts of data in RAM, which can also slow things down.

mjacobson · Post by **mjacobson** » Thu Sep 20, 2007 12:02 pm

Thanks, I will try the timport option. I did use the ROW option in <readln> and the NOVARS on the SQL insert.

mjacobson · Post by **mjacobson** » Thu Sep 20, 2007 12:48 pm

I just tried the timport as outlined above and I get the following error. It is a similar error as when I tried to use <READ $file>.

200 Reading schema ./timportSchema.txt
800 Inserting default "id" field into table
800 Statement: "insert into googleurls values(counter,?);"
200 Connecting to server
200 Opening database /webroot/google/db
200 Verifying schema
200 Loading data
800 File: "./urls.txt"
011 Cannot allocate memory In the function: openrec
204 0 records added
204 0.0 records/sec

mjacobson · Post by **mjacobson** » Thu Sep 20, 2007 12:56 pm

Got it working by adding the following lines to the schema

recdelim \x0a
multiple
trimspace

mjacobson · Post by **mjacobson** » Wed Sep 26, 2007 2:30 pm

Working on a vortex script to automate this process. My current process uses a Perl script to extract the Host portion of the URL. I would like to do this in a vortex script using <TIMPORT>

The problem I am having is coming up with a recexpr to get the entire URL and the host portion of the URL. So if I have http://somesite.com/dir/file.html I would want to get

* somesite.com
* http://somesite.com/dir/file.html

Post by **mark** » Wed Sep 26, 2007 3:52 pm

recexpr >>http=s?://=[^/\n]+[^\n]*\n
field Host varchar(20) 4 ''
field Url varchar(80) 1-5 ''

mjacobson · Post by **mjacobson** » Thu Sep 27, 2007 10:07 am

Thanks Mark. I really need to learn Texis's regular expressions.