timport csv files with dirty data/invalid line

Post Reply
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

timport csv files with dirty data/invalid line

Post by barry.marcus »

I've got a timport conundrum... We are given CSV files that we need to import into one of our database tables. The problem is that the first line of the file does not really belong, and needs to be ignored. It is not the header line of the file; that is the second line of the file. If that first line was not there, I would have no issue whatsoever. I have the schema that works to import the data, and it works perfectly. My question is this... Is there a way to specify in the schema file -- or perhaps another way -- that a specific line of the data is to be ignored, that it is NOT a part of the data. (That is, almost like a "comment" in the data.) It is NOT an option for the user to edit the file and delete the offending line. I suppose I could write code that reads the file line-by-line and writes each line except the first to a separate file, and then use that resulting file as the data. But I'm wondering if there is a more elegant solution.

Here is a typical file:

"Thomson Innovation Patent Export, 2014-02-24 18:40:17 -0600 "
Publication Number,Assignee - Original,Title (English),Publication Date
"WO2014028280A2","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","CONDUCTIVE, METALLIC AND SEMICONDUCTOR INK COMPOSITIONS","2014-02-20"
"WO2014015074A1","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","DIATOMACEOUS ENERGY STORAGE DEVICES","2014-01-23"
"WO2014014758A2","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","IONIC GEL SEPARATION LAYER FOR ENERGY STORAGE DEVICES AND PRINTABLE COMPOSITIONS THEREFOR","2014-01-23"
"WO2014004712A1","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","SYSTEMS AND METHODS FOR FABRICATION OF NANOSTRUCTURES","2014-01-03"
"WO2013147986A1","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","LED LAMP USING BLUE AND CYAN LEDS AND A PHOSPHOR","2013-10-03"
"WO2013126180A8","NTHDEGREE TECHNOLOGIES WORLDWIDE INC.","ACTIVE LED MODULE","2014-02-20"

If it helps, the first line of every file always begins with

"Thomson Innovation Patent Export, ..."

but may have a different date and time stamp. The schema file that I would use if that first line was not there is very simple, and is built on the fly by our code. It looks like this:

database D:/database1
table tempTable530d1b353
csv

field PATN_WKU varchar 1
field OWNR_NAM varchar 2
field PATN_TTL varchar 3
field PATN_ISD varchar 4

To my thinking it would be an easier problem if the extraneous line weren't the first line... As it is, it's not even the header of the file. Any suggestions how to deal with files like this?

Thanks.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

timport csv files with dirty data/invalid line

Post by mark »

If command line timport you'll have ro preprocess the file to ignore the line. If vortex you could skip the unwanted line inside the timport loop.
barry.marcus
Posts: 288
Joined: Thu Nov 16, 2006 1:05 pm

timport csv files with dirty data/invalid line

Post by barry.marcus »

Geez... I was not even aware of the Vortex version of TIMPORT!?! I was focused on calling TIMPORT with EXEC, but the Vortex version is exactly what I need here. I guess I should "read the manual"!

Thanks!
Post Reply