How to insert a CSV file data into MYSQL using Python efficiently? - python

I have a CSV input file with aprox. 4 million records.
The insert is running since +2hours and still has not finished.
The Database is still empty.
Any suggestions on how to to actually insert the values (using insert into) and faster, like breaking the insert in chunks?
I'm pretty new to python.
csv file example
43293,cancelled,1,0.0,
1049007,cancelled,1,0.0,
438255,live,1,0.0,classA
1007255,xpto,1,0.0,
python script
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csv_data:
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
cur.execute(insert_str, row)
conn.commit()
finally:
conn.close()
UPDATE:
Thanks for all the inputs.
As suggested, I tried a counter to insert in batches of 100 and a smaller csv data set (1000 lines).
The problem now is only 100 records are inserted, although the counter passes 10 x 100 several times.
code change:
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
count = 0
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csv_data:
count += 1
print(count)
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
if count >= 100:
cur.execute(insert_str, row)
print("count100")
conn.commit()
count = 0
if not row:
cur.execute(insert_str, row)
conn.commit()
finally:
conn.close()

There are many ways to optimise this insert. Here are some ideas:
You have a for loop over the entire dataset. You can do a commit() every 100 or so
You can insert many rows into one insert
you can combine the two and make a multi-row insert every 100 rows on your CSV
If python is not a requirement for you can do it directly using MySQL as it's explained here. (If you must do it using python, you can still prepare that statement in python and avoid looping through the file manually).
Examples:
for number 2 in the list, the code will have the following structure:
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
to_insert = []
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES "
template = '(%s, %s, %s, %s, %s)'
count = 0
for row in csv_data:
count += 1
to_insert.append(tuple(row))
if count % 100 == 0:
query = insert_str + '\n'.join([template % r for r in to_insert])
cur.execute(query)
to_insert = []
conn.commit()
query = insert_str + '\n'.join(template % to_insert)
cur.execute(query)
conn.commit()
finally:
conn.close()

Here. Try this snippet and let me know if it worked using executemany().
with open(xing_csv_input, newline='') as csvfile:
csv_data = tuple(csv.reader(csvfile, delimiter=',', quotechar='"'))
csv_data = (row for row in csv_data)
query = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
try:
cur.executemany(query, csv_data)
conn.commit()
except:
conn.rollback()

Related

How to properly insert data into a postgres table?

I'm using the following to try and insert a record into a postgresql database table, but it's not working. I don't get any errors, but there are no records in the table. I do commit at the end of my code as well. What could it possibly be? It's frustrating since I am not getting any syntax error in my code.
Excuse me for my lack of knowledge, As I am fairly new to Python.
Here is the code:
connection = psycopg2.connect(host=config.DB_HOST, database=config.DB_NAME, user=config.DB_USER, password=config.DB_PASS, port=config.DB_PORT)
cursor = connection.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor.execute("select * from stock where is_etf = TRUE")
etfs = cursor.fetchall()
dates = ['2022-08-23', '2022-08-24']
for current_date in dates:
for etf in etfs:
print(etf['symbol'])
with open(f"new/{current_date}/{etf['symbol']}.csv") as f:
reader = csv.reader(f)
next(reader)
for row in reader:
if len(row) == 8:
ticker = row[3]
if ticker:
shares = row[5]
weight = row[7]
cursor.execute("""
SELECT * FROM stock WHERE symbol = %s
""", (ticker,))
stock = cursor.fetchone()
if stock:
cursor.execute("""
INSERT INTO etf_holding (etf_id, holding_id, dt, shares, weight)
VALUES (%s, %s, %s, %s, %s)
""", (etf['id'], stock['id'], current_date, shares, weight))
connection.commit()
as a result I get this:
This is the unpopulated table:

Python code for inserting a CSV file into MySQL table not working as expected

I am new to Python and having some basic problems. I am trying to insert into an existing MySQL table (holding) from a csv file (test.csv), skip the header row and then return the number of rows inserted. Both tables have identical column headings.
The code below is not inserting the data from the csv file, but just inserting the values (which are my column heading, I thought I needed to declare these in the values), so this method is wrong as it only seems to add the values manually and not inert data from the csv file. Can someone tell me what I'm doing wrong please?
This is the code:
`import csv
try:
mydb = mysql.connector.connect(
host=localhost, port=5306,
user="XX",
passwd="XX",
database="python_test",
auth_plugin='mysql_native_password'
)
# This skips the first row of the CSV file.
with open((r'c:\TELS\\Uploaded\test.csv')) as f:
reader = csv.reader(f)
#next(reader) # skip header
data = [r for r in reader]
data.pop(0) # remove header
mycur = mydb.cursor()
query = "INSERT INTO hold (`Name`, `Address`, `Age`, `DOB) VALUES (%s, %s, %s, %S)"
values = (`Name`, `Address`, `Age`, `DOB`)
mycur.execute(query, values)
mydb.commit()
print(mycur.rowcount, "records inserted.")
#close the connection to the database.
mycur.close()`
The content of your values variable is what will be inserted so you need to map it to your data variable.
You could either do it one row at a time...
query = "INSERT INTO hold (`Name`, `Address`, `Age`, `DOB`) VALUES (%s, %s, %s, %s)"
# also notice how i converted your last %S into %s
for row in data:
mycur.execute(query, row)
mydb.commit()
mycur.close()
Or use the executemany() function :
query = "INSERT INTO hold (`Name`, `Address`, `Age`, `DOB`) VALUES (%s, %s, %s, %s)"
mycur.executemany(query, data)
mydb.commit()
print(mycur.rowcount, "records inserted.")
mycur.close()

Import Data from .csv file into mysql using python

I am trying to import data from two columns of a .csv file (time hh:mm, float). I created a database and a table in mysql.
import mysql.connector
import csv
mydb = mysql.connector.connect(host='127.0.0.1',
user= 'xxx',
passwd='xxx',
db='pv_datenbank')
cursor = mydb.cursor()
# get rid of the '' at the beginning of the .csv file
s = open('Sonneneinstrahlung.csv', mode='r', encoding='utf-8-sig').read()
open('Sonneneinstrahlung.csv', mode='w', encoding='utf-8').write(s)
print(s)
with open('Sonneneinstrahlung.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
sql = """INSERT INTO einstrahlung ('Uhrzeit', 'Einstrahlungsdaten') VALUES (%s, %s)"""
for row in csv_reader:
print(row)
print(cursor.rowcount, "was inserted.")
cursor.executemany(sql, csv_reader)
#cursor.execute(sql, row, multi=True)
mydb.commit()
mydb.close()
If I run the program with executemany(), result is the following:
['01:00', '1']
'-1 was inserted.'
and after this I do get the error code: Not all parameters were used again.
When I try the same thing with the execute() operator, no error is shown, but the data is not inserted in the table of my database.
Here you can see the input data:
executemany takes a statement and a sequence of sets of parameters.
Try this:
with open('Sonneneinstrahlung.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
sql = """INSERT INTO einstrahlung (Uhrzeit, Einstrahlungsdaten) VALUES (%s, %s)"""
cursor.executemany(sql, csv_reader)
mydb.commit()

How to insert rows of information into different rows in a mysql database?

I don't know how to insert a list of info into a mysql database.
I'm trying to insert rows data into a database but it is simply inserting the last row three times. The list is named "t" and it is a tuple .
Data:
11/04/19,17:33,33.4,55
11/04/19,17:34,22.9,57
11/04/19,17:35,11.9,81
Code:
import mysql.connector
sql = mysql.connector.connect(
host=' ',
user=' ',
password=' ',
db=" "
)
cursor = sql.cursor()
f = open("C:\Cumulus\data\Apr19log.txt","r")
while True:
s = f.readline()
list=[]
if (s != ""):
t=s.split(',')
for item in t:
list.append(item)
else:
break;
sqllist = """INSERT INTO station_fenelon (variable, date,
time,outside_temp, outside_humidity)
VALUES (%s, %s, %s, %s, %s)"""
record =[(1, t[0], t[1], t[2],t[3]),
(2, t[0], t[1], t[2],t[3]),
(3, t[0], t[1], t[2],t[3])]
cursor.executemany(sqllist, record)
sql.commit()
I want to create three rows in the database with this list of information but is is only showing the last row of information in the database.
Try this.
import mysql.connector
sql = mysql.connector.connect(host='',user='',password='',db='')
cursor = sql.cursor()
f = open("C:\Cumulus\data\Apr19log.txt","r")
st=[i.strip().split(',') for i in f.readlines()]
sqllist = """INSERT INTO station_fenelon (variable, date, time, outside_temp, outside_humidity) VALUES (%s, %s, %s, %s, %s)"""
record = [(i+1, j[0], j[1], j[2], j[3]) for i, j in enumerate(st)]
cursor.executemany(sqllist, record)
sql.commit()

How can I insert rows into mysql table faster using python?

I am trying to find a faster method to insert data into my table, the table should end up with over 100 million rows, I have been running my code for 24 hours nearly and the table currently only has 9 million rows entered and is still in progress.
My code currently reads 300 csv files at a time, and stores the data in a list, it gets filtered for duplicate rows, then I use a for loop to place an entry in the list as a tuple and update the table one tuple at a time. This method just takes too long, is there a way for me to bulk insert all rows? I have tried looking online but the methods I am reading do not seem to help in my situation.
Many thanks,
David
import glob
import os
import csv
import mysql.connector
# MYSQL logon
mydb = mysql.connector.connect(
host="localhost",
user="David",
passwd="Sword",
database="twitch"
)
mycursor = mydb.cursor()
# list for strean data file names
streamData=[]
# This function obtains file name list from a folder, this is to open files
in other functions
def getFileNames():
global streamData
global topGames
# the folders to be scanned
#os.chdir("D://UG_Project_Data")
os.chdir("E://UG_Project_Data")
# obtains stream data file names
for file in glob.glob("*streamD*"):
streamData.append(file)
return
# List to store stream data from csv files
sData = []
# Function to read all streamData csv files and store data in a list
def streamsToList():
global streamData
global sData
# Same as gamesToList
index = len(streamData)
num = 0
theFile = streamData[0]
for x in range(index):
if (num == 301):
filterStreams(sData)
num = 0
sData.clear()
try:
theFile = streamData[x]
timestamp = theFile[0:15]
dateTime = timestamp[4:8]+"-"+timestamp[2:4]+"-"+timestamp[0:2]+"T"+timestamp[9:11]+":"+timestamp[11:13]+":"+timestamp[13:15]+"Z"
with open (theFile, encoding="utf-8-sig") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
if (row != []):
col1 = row[0]
col2 = row[1]
col3 = row[2]
col4 = row[3]
col5 = row[4]
col6 = row[5]
col7 = row[6]
col8 = row[7]
col9 = row[8]
col10 = row[9]
col11 = row[10]
col12 = row[11]
col13 = dateTime
temp = col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13
sData.append(temp)
except:
print("Problem file:")
print(theFile)
print(num)
num +=1
return
def filterStreams(self):
sData = self
dataSet = set(tuple(x) for x in sData)
sData = [ list (x) for x in dataSet ]
return createStreamDB(sData)
# Function to create a table of stream data
def createStreamDB(self):
global mydb
global mycursor
sData = self
tupleList = ()
for x in sData:
tupleList = tuple(x)
sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = tupleList
try:
mycursor.execute(sql, val)
mydb.commit()
except:
test = 1
return
if __name__== '__main__':
getFileNames()
streamsToList()
filterStreams(sData)
If some of your rows succeeds but the some fails, Do you want your database to be left in a corrupt state? if no, try to commit out of the loop. like this:
for x in sData:
tupleList = tuple(x)
sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = tupleList
try:
mycursor.execute(sql, val)
except:
# do some thing
pass
try:
mydb.commit()
except:
test = 1
And if you don't. try to load your cvs file into your mysql directly.
LOAD DATA INFILE "/home/your_data.csv"
INTO TABLE CSVImport
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Also, to make you more clear. I've define three ways to insert those data, if you insistent to use python, since you have some processing with your data.
Bad way
In [18]: def inside_loop():
...: start = time.time()
...: for i in range(10000):
...: mycursor = mydb.cursor()
...: sql = "insert into t1(name, age)values(%s, %s)"
...: try:
...: mycursor.execute(sql, ("frank", 27))
...: mydb.commit()
...: except:
...: print("Failure..")
...: print("cost :{}".format(time.time() - start))
...:
Time cost:
In [19]: inside_loop()
cost :5.92155909538269
Okay way
In [9]: def outside_loop():
...: start = time.time()
...: for i in range(10000):
...: mycursor = mydb.cursor()
...: sql = "insert into t1(name, age)values(%s, %s)"
...: try:
...: mycursor.execute(sql, ["frank", 27])
...: except:
...: print("do something ..")
...:
...: try:
...: mydb.commit()
...: except:
...: print("Failure..")
...: print("cost :{}".format(time.time() - start))
Time cost:
In [10]: outside_loop()
cost :0.9959311485290527
Maybe, there are still having some better way, even best. (i.e, use pandas to process your data. and try redesign your table ...)
You might like my presentation Load Data Fast! in which I compared different methods of inserting bulk data, and did benchmarks to see which was the fastest method.
Inserting one row at a time, committing a transaction for each row, is about the worst way you can do it.
Using LOAD DATA INFILE is fastest by a wide margin. Although there are some configuration changes you need to make on a default MySQL instance to allow it to work. Read the MySQL documentation about options secure_file_priv and local_infile.
Even without using LOAD DATA INFILE, you can do much better. You can insert multiple rows per INSERT, and you can execute multiple INSERT statements per transaction.
I wouldn't try to INSERT the whole 100 million rows in a single transaction, though. My habit is to commit about once every 10,000 rows.

Categories