How to load 30 million rows into a database

mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

I have a text file that lists ~30 million URLs, 1 per line. I want to load it into a Texis database table where I have an id,url column. I tried to use the readln and do an insert into the table but after 14 hours, I have only loaded 6 million URLs. Is there a faster way?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

How to load 30 million rows into a database

Post by mark »

Have you tried command line timport?
A schema like this should do it:

database mydatabase
table mytable
recexpr >>[^\n]+\n
field Url varchar(80) 1 ''
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

How to load 30 million rows into a database

Post by John »

Using timport stand alone executable should be faster. Do you have any indexes on the table? Also we have noticed that if the table is on a Veritas file system it will slow down as the table gets bigger. We do have some workarounds for that that should help, but may currently require a program using the C API.

On a normal filesystem using timport you should be able to load all the records in about 90 minutes, and using the C API maybe 30-45 minutes.
John Turnbull
Thunderstone Software
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

How to load 30 million rows into a database

Post by John »

Just to confirm, on the <READLN> did you use the ROW option, and use NOVARS on the SQL insert? Using those options will help ensure that you don't keep excessive amounts of data in RAM, which can also slow things down.
John Turnbull
Thunderstone Software
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

Thanks, I will try the timport option. I did use the ROW option in <readln> and the NOVARS on the SQL insert.
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

I just tried the timport as outlined above and I get the following error. It is a similar error as when I tried to use <READ $file>.

200 Reading schema ./timportSchema.txt
800 Inserting default "id" field into table
800 Statement: "insert into googleurls values(counter,?);"
200 Connecting to server
200 Opening database /webroot/google/db
200 Verifying schema
200 Loading data
800 File: "./urls.txt"
011 Cannot allocate memory In the function: openrec
204 0 records added
204 0.0 records/sec
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

Got it working by adding the following lines to the schema

recdelim \x0a
multiple
trimspace
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

Working on a vortex script to automate this process. My current process uses a Perl script to extract the Host portion of the URL. I would like to do this in a vortex script using <TIMPORT>

The problem I am having is coming up with a recexpr to get the entire URL and the host portion of the URL. So if I have http://somesite.com/dir/file.html I would want to get

* somesite.com
* http://somesite.com/dir/file.html
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

How to load 30 million rows into a database

Post by mark »

recexpr >>http=s?://=[^/\n]+[^\n]*\n
field Host varchar(20) 4 ''
field Url varchar(80) 1-5 ''
mjacobson
Posts: 204
Joined: Fri Feb 08, 2002 3:35 pm

How to load 30 million rows into a database

Post by mjacobson »

Thanks Mark. I really need to learn Texis's regular expressions.
Post Reply