I have some problems with parsing huge csv file into mysql databse.
Csv file looks like this:
ref1 data1 data2 data3...
ref1 data4 data5 data6...
ref2 data1 data2 data3 data4 data5..
ref2 data12 data13 data14
ref2 data21 data22...
.
.
.
Csv file has about 1 milion lines or about 7MB in zip file or about 150MB unzip.
My job is to parse the data from csv into mysql, but only the data/lines when references matches. Another problem is, that from multiple lines in csv i must parse it in only one line in mysql for one reference.
I tryed to do this with csv.reader and for loops on each references, but is ultra slow.
with con:
cur.execute("SELECT ref FROM users")
user=cur.fetchall()
for i in range(len(user)):
with open('hugecsv.csv', mode='rb') as f:
reader = csv.reader(f, delimiter=';')
for row in reader:
if(str(user[i][0])==row[0]):
writer.writerow(row)
So i have all references which i would like to parse, in my list user. Which is the fastes way to parse?
Please help!
The first obvious bottleneck is that you are reopening and scanning the whole CSV file for each user in your database. Doing a single pass on the csv would be faster :
# faster lookup on users
cur.execute ("select ref from users")
users = set(row[0] for row in cur.fetchall())
with open("your/file.CSV") as f:
r = reader(f)
for row in r:
if row[0] in users:
do_something_with(row)
Use:
LOAD DATA INFILE 'EF_PerechenSkollekciyami.csv' TO `TABLE_NAME` FIELDS TERMINATED BY ';'
This is an internal query command in mysql.
I don't recommend you to use tabs to separate columns, and recommend you to change this by sed to ; or something another character. But you can try with tabs too.
You haven't included all your logic. If you just want to import everything into a single table,
cur.execute("LOAD DATA INFILE 'path_to_file.csv' INTO TABLE my_table;")
MySQL does it directly. You can't get any faster than that.
Documentation
Related
I started created a database with postgresql and I am currently facing a problem when I want to copy the data from my csv file to my database
Here is my code:
connexion = psycopg2.connect(dbname= "db_test" , user = "postgres", password ="passepasse" )
connexion.autocommit = True
cursor = connexion.cursor()
cursor.execute("""CREATE TABLE vocabulary(
fname integer PRIMARY KEY,
label text,
mids text
)""")
with open (r'C:\mypathtocsvfile.csv', 'r') as f:
next(f) # skip the header row
cursor.copy_from(f, 'vocabulary', sep=',')
connexion.commit()
I asked to allocate 4 column to store my csv data, the problem is that datas in my csv are stored like this:
fname,labels,mids,split
64760,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
16399,"Electric_guitar,Guitar,Plucked_string_instrument,Musical_instrument,Music","/m/02sgy,/m/0342h,/m/0fx80y,/m/04szw,/m/04rlf",train
...
There is comas inside my columns label and mids, thats why i get the following error:
BadCopyFileFormat: ERROR: additional data after the last expected column
Which alternativ should I use to copy data from this csv file?
ty
if the file is small, then the easiest way is to open the file in LibreOffice and save the file with a new separetor.
I usually use ^.
If the file is large, write a script to replace ," and "," on ^" and "^", respectively.
COPY supports csv as a format, which already does what you want. But to access it via psycopg2, I think you will need to use copy_expert rather than copy_from.
cursor.copy_expert('copy vocabulary from stdin with csv', f)
I'm using the following code to query a SQL Server DB, and storing the returned results in a CSV file.
import pypyodbc
import csv
connection = pypyodbc.connect('Driver={SQL Server};'
'Server=localhost;'
'Database=testdb;')
cursor = connection.cursor()
SQLCommand = (""" SELECT A as First,
SELECT B as Second,
FROM AB """)
cursor.execute(SQLCommand)
results = cursor.fetchall()
myfile = open('test.csv', 'w')
wr = csv.writer(myfile,dialect='excel')
wr.writerow(results)
connection.close()
The SQL command is just a sample, my query contains a lot more columns, this is just for example sake.
With this code, my CSV looks like this:
But I want my CSV to look like so, and plus I want the headers to show as well, like this:
I'm guessing the formatting needs to be done within the 'csv.writer' part of the code but I cant seem to figure it out. Can someone please guide me?
You are seeing that strange output because fetchall returns multiple rows of output but you are using writerow instead of writerows to dump them out. You need to use writerow to output a single line of column headings, followed by writerows to output the actual results:
with open(r'C:\Users\Gord\Desktop\test.csv', 'w', newline='') as myfile:
wr = csv.writer(myfile)
wr.writerow([x[0] for x in cursor.description]) # column headings
wr.writerows(cursor.fetchall())
cursor.close()
connection.close()
In my use case, I have a csv stored as a string and I want to load it into a MySQL table. Is there a better way than saving the string as a file, use LOAD DATA INFILE, and then deleting the file? I find this answer but it's for JDBC and I haven't find a Python equivalent to it.
Yes what you describe is very possible! Say, for example, that your csv file has three columns:
import MySQLdb
conn = MySQLdb.connect('your_connection_string')
cur = conn.cursor()
with open('yourfile.csv','rb') as fin:
for row in fin:
cur.execute('insert into yourtable (col1,col2,col3) values (%s,%s,%s)',row)
cur.close(); conn.close()
When my flask application starts up, it needs to add a bunch of data to Postgres via SQLalchemy. I'm looking for a good way to do this. The data is in TSV format, and I already have a SQLalchemy db.model schema for it. Right now:
for datafile in datafiles:
with open(datafile,'rb') as file:
# reader = csv.reader(file, delimiter='\t')
reader = csv.DictReader(file, delimiter='\t', fieldnames=[])
OCRs = # somehow efficiently convert to list of dicts...
db.engine.execute(OpenChromatinRegion.__table__.insert(), OCRs)
Is there a better, more direct way? Otherwise, what is the best way of generating OCRs ?
The solution suggested here seems clunky.
import csv
from collections import namedtuple
fh = csv.reader(open(you_file, "rU"), delimiter=',', dialect=csv.excel_tab)
headers = fh.next()
Row = namedtuple('Row', headers)
OCRs = [Row._make(i)._asdict() for i in fh]
db.engine.execute(OpenChromatinRegion.__table__.insert(), OCRs)
# plus your loop for multiple files and exception handling of course =)
I have a CSV file which has over a million rows and I am trying to parse this file and insert the rows into the DB.
with open(file, "rb") as csvfile:
re = csv.DictReader(csvfile)
for row in re:
//insert row['column_name'] into DB
For csv files below 2 MB this works well but anything more than that ends up eating my memory. It is probably because i store the Dictreader's contents in a list called "re" and it is not able to loop over such a huge list. I definitely need to access the csv file with its column names which is why I chose dictreader since it easily provides column level access to my csv files. Can anyone tell me why this is happening and how can this be avoided?
The DictReader does not load the whole file in memory but read it by chunks as explained in this answer suggested by DhruvPathak.
But depending on your database engine, the actual write on disk may only happen at commit. That means that the database (and not the csv reader) keeps all data in memory and at end exhausts it.
So you should try to commit every n records, with n typically between 10 an 1000 depending on the size of you lines and the available memory.
If you don't need the entire columns at once, you can simply read the file line by line like you would with a text file and parse each row. The exact parsing depends on your data format but you could do something like:
delimiter = ','
with open(filename, 'r') as fil:
headers = fil.next()
headers = headers.strip().split(delimiter)
dic_headers = {hdr: headers.index(hdr) for hdr in headers}
for line in fil:
row = line.strip().split(delimiter)
## do something with row[dic_headers['column_name']]
This is a very simple example but it can be more elaborate. For example, this does not work if your data contains ,.