How can I insert rows into mysql table faster using python? - python

I am trying to find a faster method to insert data into my table, the table should end up with over 100 million rows, I have been running my code for 24 hours nearly and the table currently only has 9 million rows entered and is still in progress.
My code currently reads 300 csv files at a time, and stores the data in a list, it gets filtered for duplicate rows, then I use a for loop to place an entry in the list as a tuple and update the table one tuple at a time. This method just takes too long, is there a way for me to bulk insert all rows? I have tried looking online but the methods I am reading do not seem to help in my situation.
Many thanks,
David
import glob
import os
import csv
import mysql.connector
# MYSQL logon
mydb = mysql.connector.connect(
host="localhost",
user="David",
passwd="Sword",
database="twitch"
)
mycursor = mydb.cursor()
# list for strean data file names
streamData=[]
# This function obtains file name list from a folder, this is to open files
in other functions
def getFileNames():
global streamData
global topGames
# the folders to be scanned
#os.chdir("D://UG_Project_Data")
os.chdir("E://UG_Project_Data")
# obtains stream data file names
for file in glob.glob("*streamD*"):
streamData.append(file)
return
# List to store stream data from csv files
sData = []
# Function to read all streamData csv files and store data in a list
def streamsToList():
global streamData
global sData
# Same as gamesToList
index = len(streamData)
num = 0
theFile = streamData[0]
for x in range(index):
if (num == 301):
filterStreams(sData)
num = 0
sData.clear()
try:
theFile = streamData[x]
timestamp = theFile[0:15]
dateTime = timestamp[4:8]+"-"+timestamp[2:4]+"-"+timestamp[0:2]+"T"+timestamp[9:11]+":"+timestamp[11:13]+":"+timestamp[13:15]+"Z"
with open (theFile, encoding="utf-8-sig") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
if (row != []):
col1 = row[0]
col2 = row[1]
col3 = row[2]
col4 = row[3]
col5 = row[4]
col6 = row[5]
col7 = row[6]
col8 = row[7]
col9 = row[8]
col10 = row[9]
col11 = row[10]
col12 = row[11]
col13 = dateTime
temp = col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12, col13
sData.append(temp)
except:
print("Problem file:")
print(theFile)
print(num)
num +=1
return
def filterStreams(self):
sData = self
dataSet = set(tuple(x) for x in sData)
sData = [ list (x) for x in dataSet ]
return createStreamDB(sData)
# Function to create a table of stream data
def createStreamDB(self):
global mydb
global mycursor
sData = self
tupleList = ()
for x in sData:
tupleList = tuple(x)
sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = tupleList
try:
mycursor.execute(sql, val)
mydb.commit()
except:
test = 1
return
if __name__== '__main__':
getFileNames()
streamsToList()
filterStreams(sData)

If some of your rows succeeds but the some fails, Do you want your database to be left in a corrupt state? if no, try to commit out of the loop. like this:
for x in sData:
tupleList = tuple(x)
sql = "INSERT INTO streams (id, user_id, user_name, game_id, community_ids, type, title, viewer_count, started_at, language, thumbnail_url, tag_ids, time_stamp) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
val = tupleList
try:
mycursor.execute(sql, val)
except:
# do some thing
pass
try:
mydb.commit()
except:
test = 1
And if you don't. try to load your cvs file into your mysql directly.
LOAD DATA INFILE "/home/your_data.csv"
INTO TABLE CSVImport
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
Also, to make you more clear. I've define three ways to insert those data, if you insistent to use python, since you have some processing with your data.
Bad way
In [18]: def inside_loop():
...: start = time.time()
...: for i in range(10000):
...: mycursor = mydb.cursor()
...: sql = "insert into t1(name, age)values(%s, %s)"
...: try:
...: mycursor.execute(sql, ("frank", 27))
...: mydb.commit()
...: except:
...: print("Failure..")
...: print("cost :{}".format(time.time() - start))
...:
Time cost:
In [19]: inside_loop()
cost :5.92155909538269
Okay way
In [9]: def outside_loop():
...: start = time.time()
...: for i in range(10000):
...: mycursor = mydb.cursor()
...: sql = "insert into t1(name, age)values(%s, %s)"
...: try:
...: mycursor.execute(sql, ["frank", 27])
...: except:
...: print("do something ..")
...:
...: try:
...: mydb.commit()
...: except:
...: print("Failure..")
...: print("cost :{}".format(time.time() - start))
Time cost:
In [10]: outside_loop()
cost :0.9959311485290527
Maybe, there are still having some better way, even best. (i.e, use pandas to process your data. and try redesign your table ...)

You might like my presentation Load Data Fast! in which I compared different methods of inserting bulk data, and did benchmarks to see which was the fastest method.
Inserting one row at a time, committing a transaction for each row, is about the worst way you can do it.
Using LOAD DATA INFILE is fastest by a wide margin. Although there are some configuration changes you need to make on a default MySQL instance to allow it to work. Read the MySQL documentation about options secure_file_priv and local_infile.
Even without using LOAD DATA INFILE, you can do much better. You can insert multiple rows per INSERT, and you can execute multiple INSERT statements per transaction.
I wouldn't try to INSERT the whole 100 million rows in a single transaction, though. My habit is to commit about once every 10,000 rows.

Related

Python DataFrame to MYSQL: TypeError: not enough arguments for format string

Been playing with this for 14 hours (I am a beginner)
Data is pulled from one database table to search on yahoo for all the data on that ticker and then its "meant" to upload it.
I orginally had it as panda df but got "ambiguous error" so I have now put it as [] again. New error. I rack my brains :( However, it does work if I leave it blank.
from __future__ import print_function
import yfinance as yf
import pandas as pd
import datetime
import warnings
import MySQLdb as mdb
import requests
import numpy as np
import MySQLdb as mdb
import requests
# Obtain a database connection to the MySQL instance
con = mdb.connect("localhost","sec_user","","securities_master")
def obtain_list_of_db_tickers():
"""
Obtains a list of the ticker symbols in the database.
"""
with con:
cur = con.cursor()
cur.execute("SELECT id, ticker FROM symbol")
data = cur.fetchall()
print(data)
return [(d[0], d[1]) for d in data]
def get_daily_historic_data_yahoo(ticker):
blow = yf.download(ticker)
data = []
data.append(yf.download(ticker).reset_index())
return data
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
'''
Takes a list of tuples of daily data and adds it to the MySQL database.
Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with adj_close and volume)
'''
# Create the time now
now = datetime.datetime.utcnow()
df = pd.DataFrame(data=daily_data[0])
df.insert(0, 'data_vendor_id', data_vendor_id)
df.insert(1, 'symbol_id', symbol_id)
df.insert(3, 'created_date', now)
df.insert(4, 'last_updated_date', now)
daily_data = []
daily_data.append(df)
#df = daily_data
# Amend the data to include the vendor ID and symbol ID
# Connect to the MySQL instance
db_host = 'localhost'
db_user = ''
db_pass = ''
db_name = 'securities_master'
con = mdb.connect("localhost", "sec_user", "", "securities_master"
# host=db_host, user=db_user, passwd=db_pass, db=db_name
)
try:
mdb.connect
# If connection is not successful
except:
print("Can't connect to database")
return 0
# If Connection Is Successful
print("Connected")
final_str = """INSERT INTO daily_price (data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price, close_price, volume, adj_close_price) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"""
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
con.commit()
if __name__ == "__main__":
# This ignores the warnings regarding Data Truncation
# from the Yahoo precision to Decimal(19,4) datatypes
warnings.filterwarnings('ignore')
# Loop over the tickers and insert the daily historical
# data into the database
tickers = obtain_list_of_db_tickers()
lentickers = len(tickers)
for i, t in enumerate(tickers):
print(
"Adding data for %s: %s out of %s" %
(t[1], i+1, lentickers)
)
yf_data = get_daily_historic_data_yahoo(t[1])
insert_daily_data_into_db('1', t[0], yf_data)
print("Successfully added Yahoo Finance pricing data to DB.")
Errors
Traceback (most recent call last):
File "/home/quant/price_retrieval.py", line 106, in <module>
insert_daily_data_into_db('1', t[0], yf_data)
File "/home/quant/price_retrieval.py", line 88, in insert_daily_data_into_db
cur.executemany(final_str, daily_data)
File "/home/quant/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 230, in executemany
return self._do_execute_many(
File "/home/quant/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 255, in _do_execute_many
v = values % escape(next(args), conn)
TypeError: not enough arguments for format string
I'm no data scientist so there's probably a more elegant way to fix it directly with pandas. But the way I usually work with MySQL (and really any SQL drivers) is to give it lists of python tuples.
If you parse each row of the pandas data frame with for row in df.itertuples(): and craft each tuple carefully - making sure the types match the SQL table, all should work ;)
Example:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
'''
Takes a list of tuples of daily data and adds it to the MySQL database.
Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with adj_close and volume)
'''
# Create the time now
now = datetime.datetime.utcnow()
df = pd.DataFrame(data=daily_data[0])
daily_data = []
created_date = now
last_updated_date = now
for row in df.itertuples():
_index = row[0] # discard
date = row[1]
open = row[2]
high = row[3]
low = row[4]
close = row[5]
adj_close_price = row[6]
volume = row[7]
daily_data.append((int(data_vendor_id), symbol_id, date, created_date, last_updated_date, open, high, low, close, volume, adj_close_price))
# Connect to the MySQL instance
con = mdb.connect(host="localhost", user="user", password="yourpassword",
db="yourdbname", port=3306)
final_str = """
INSERT INTO daily_price (data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price, close_price, volume, adj_close_price)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
"""
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
con.commit()
I've tried not to tamper with your existing code too much. Just enough to make it work.
I think what was happening there for you was that you're technically passing it a list of pandas dataframes with only a single pandas dataframe in the list. Instead what you want is a list of tuples with 11 fields to unpack per tuple.
Maybe you mean to pass the dataframe directly i.e. not contained inside of a list but I still don't think that would be right because 1) there's an "Index" column in the dataframe which would give erroneous results 2) you'd need to call some methods on the dataframe to retrieve only the values (not the headers to the columns) and transform it to the correct list of tuples. It's probably very doable but I will leave that to you to find out.
I am also assuming your table schema is something like this:
CREATE TABLE IF NOT EXISTS daily_price (
data_vendor_id INT,
symbol_id INT,
price_date DATETIME,
created_date DATETIME,
last_updated_date TIMESTAMP,
open_price VARCHAR(256),
high_price VARCHAR(256),
low_price VARCHAR(256),
close_price VARCHAR(256),
volume INT,
adj_close_price VARCHAR(256)
);

How to insert a CSV file data into MYSQL using Python efficiently?

I have a CSV input file with aprox. 4 million records.
The insert is running since +2hours and still has not finished.
The Database is still empty.
Any suggestions on how to to actually insert the values (using insert into) and faster, like breaking the insert in chunks?
I'm pretty new to python.
csv file example
43293,cancelled,1,0.0,
1049007,cancelled,1,0.0,
438255,live,1,0.0,classA
1007255,xpto,1,0.0,
python script
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csv_data:
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
cur.execute(insert_str, row)
conn.commit()
finally:
conn.close()
UPDATE:
Thanks for all the inputs.
As suggested, I tried a counter to insert in batches of 100 and a smaller csv data set (1000 lines).
The problem now is only 100 records are inserted, although the counter passes 10 x 100 several times.
code change:
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
count = 0
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in csv_data:
count += 1
print(count)
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
if count >= 100:
cur.execute(insert_str, row)
print("count100")
conn.commit()
count = 0
if not row:
cur.execute(insert_str, row)
conn.commit()
finally:
conn.close()
There are many ways to optimise this insert. Here are some ideas:
You have a for loop over the entire dataset. You can do a commit() every 100 or so
You can insert many rows into one insert
you can combine the two and make a multi-row insert every 100 rows on your CSV
If python is not a requirement for you can do it directly using MySQL as it's explained here. (If you must do it using python, you can still prepare that statement in python and avoid looping through the file manually).
Examples:
for number 2 in the list, the code will have the following structure:
def csv_to_DB(xing_csv_input, db_opts):
print("Inserting csv file {} to database {}".format(xing_csv_input, db_opts['host']))
conn = pymysql.connect(**db_opts)
cur = conn.cursor()
try:
with open(xing_csv_input, newline='') as csvfile:
csv_data = csv.reader(csvfile, delimiter=',', quotechar='"')
to_insert = []
insert_str = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES "
template = '(%s, %s, %s, %s, %s)'
count = 0
for row in csv_data:
count += 1
to_insert.append(tuple(row))
if count % 100 == 0:
query = insert_str + '\n'.join([template % r for r in to_insert])
cur.execute(query)
to_insert = []
conn.commit()
query = insert_str + '\n'.join(template % to_insert)
cur.execute(query)
conn.commit()
finally:
conn.close()
Here. Try this snippet and let me know if it worked using executemany().
with open(xing_csv_input, newline='') as csvfile:
csv_data = tuple(csv.reader(csvfile, delimiter=',', quotechar='"'))
csv_data = (row for row in csv_data)
query = "INSERT INTO table_x (ID, desc, desc_version, val, class) VALUES (%s, %s, %s, %s, %s)"
try:
cur.executemany(query, csv_data)
conn.commit()
except:
conn.rollback()

How to insert rows of information into different rows in a mysql database?

I don't know how to insert a list of info into a mysql database.
I'm trying to insert rows data into a database but it is simply inserting the last row three times. The list is named "t" and it is a tuple .
Data:
11/04/19,17:33,33.4,55
11/04/19,17:34,22.9,57
11/04/19,17:35,11.9,81
Code:
import mysql.connector
sql = mysql.connector.connect(
host=' ',
user=' ',
password=' ',
db=" "
)
cursor = sql.cursor()
f = open("C:\Cumulus\data\Apr19log.txt","r")
while True:
s = f.readline()
list=[]
if (s != ""):
t=s.split(',')
for item in t:
list.append(item)
else:
break;
sqllist = """INSERT INTO station_fenelon (variable, date,
time,outside_temp, outside_humidity)
VALUES (%s, %s, %s, %s, %s)"""
record =[(1, t[0], t[1], t[2],t[3]),
(2, t[0], t[1], t[2],t[3]),
(3, t[0], t[1], t[2],t[3])]
cursor.executemany(sqllist, record)
sql.commit()
I want to create three rows in the database with this list of information but is is only showing the last row of information in the database.
Try this.
import mysql.connector
sql = mysql.connector.connect(host='',user='',password='',db='')
cursor = sql.cursor()
f = open("C:\Cumulus\data\Apr19log.txt","r")
st=[i.strip().split(',') for i in f.readlines()]
sqllist = """INSERT INTO station_fenelon (variable, date, time, outside_temp, outside_humidity) VALUES (%s, %s, %s, %s, %s)"""
record = [(i+1, j[0], j[1], j[2], j[3]) for i, j in enumerate(st)]
cursor.executemany(sqllist, record)
sql.commit()

Reading changes with dbf library in python

I am trying to make a program that takes changes in a dbf file then uploads them. I have got it to read the dbf file and upload them to a mysql database but its a 50 minuite upload. I have tried to get it to only upload fields that have been changed. The problem I have is that it seems i need to close and re-open the dbf file. If someone makes a change whilst its doing this, it doesnt notice theres been a change.
Is there a better/right way of doing this:
import time
import dbf
import MySQLdb
import os
source_path = r"\\path\to\file"
file_name = "\\test.Dbf"
print "Found Source DBF"
source = dbf.Table(source_path + file_name)
source.open()
print "Opened DBF"
updated = list(source)
print "Copied Source"
db = MySQLdb.connect(host = "myHost.com", port=3306, user = "user", passwd = "pass", db = "database")
cur = db.cursor()
print "Connected to database"
try:
cur.execute("DROP TABLE IF EXISTS dbftomysql")
except:
db.rollback()
print "Dropped old table"
sql = """CREATE TABLE table(
col1 VARCHAR(200) NOT NULL,
col2 VARCHAR(200),
col3 VARCHAR(200),
col4 NUMERIC(15,2),
col5 VARCHAR(200) )"""
cur.execute(sql)
print "Created new table"
for i, s in zip(source, updated):
query = """INSERT table SET col1 = %s, col2 = %s, col3 = %s, col4 = %s, col5 = %s"""
values = (i["col1"], i ["col2 "], i["col3"], i["col4"], i["col5"])
cur.execute(query, values)
db.commit()
print i["col1"], i ["col2 "], i["col3"], i["col4"], i["col5"]
print "First Upload Completed"
while True:
for i, s in zip(source, updated):
if i["col1"] != s["col1"]:
print i["col1"] + " col1Updated"
query = """UPDATE table SET col1= %s WHERE col1= %s"""
values = (i["col1"], s["col1"])
try:
cur.execute(query, values)
db.commit()
except:
db.rollback()
print "No connection to database"
if i["col2"] != s["col2"]:
print i["col2"] + " col2 Updated for " + i["col1"]
query = """UPDATE table SET col2 = %s WHERE col1= %s OR col1= %s"""
values = (i["col2"], i["col1"], s["col1"])
try:
cur.execute(query, values)
db.commit()
except:
db.rollback()
print "No connection to database"
#ect
updated = list(source)
source.close()
source.open()
time.sleep(0.2)
The dbf library will only fetch the record from the dbf file if it doesn't already exist in memory; when you do
updated = list(source)
you are effectively freezing all the rows because updated is a list of records (not a list of lists or a list of tuples; this means that when you later try to compare source and updated you are comparing the same data.
In order to make updated be a separate entity from source try
updated = [tuple(row) for row in source]
which will give you a list of tuples, or
updated = [scatter(row, dict) for row in source]
which will give you a list of dicts, which is what you need for your field comparison code further down.

MySQL not accepting executemany() INSERT, running Python from Excel (datanitro)

I HAVE ADDED MY OWN ANSWER THAT WORKS BUT OPEN TO IMPROVEMENTS
After seeing a project at datanitro. I took on getting a connection to MySQL (they use SQLite) and I was able to import a small test table into Excel from MySQL.
Inserting new updated data from the Excel sheet was this next task and so far I can get one row to work like so...
import MySQLdb
db = MySQLdb.connect("xxx","xxx","xxx","xxx")
c = db.cursor()
c.execute("""INSERT INTO users (id, username, password, userid, fname, lname)
VALUES (%s, %s, %s, %s, %s, %s);""",
(Cell(5,1).value,Cell(5,2).value,Cell(5,3).value,Cell(5,4).value,Cell(5,5).value,Cell(5,6).value,))
db.commit()
db.close()
...but attempts at multiple rows will fail. I suspect either issues while traversing rows in Excel. Here is what I have so far...
import MySQLdb
db = MySQLdb.connect(host="xxx.com", user="xxx", passwd="xxx", db="xxx")
c = db.cursor()
c.execute("select * from users")
usersss = c.fetchall()
updates = []
row = 2 # starting row
while True:
data = tuple(CellRange((row,1),(row,6)).value)
if data[0]:
if data not in usersss: # new record
updates.append(data)
row += 1
else: # end of table
break
c.executemany("""INSERT INTO users (id, username, password, userid, fname, lname) VALUES (%s, %s, %s, %s, %s, %s)""", updates)
db.commit()
db.close()
...as of now, I don't get any errors, but my new line is not added (id 3). This is what my table looks like in Excel...
The database holds the same structure, minus id 3. There has to be a simpler way to traverse the rows and pull the unique content for INSERT, but after 6 hours trying different things (and 2 new Python books) I am going to ask for help.
If I run either...
print '[%s]' % ', '.join(map(str, updates))
or
print updates
my result is
[]
So this is likely not passing any data to MySQL in the first place.
LATEST UPDATE AND WORKING SCRIPT
Not exactly what I want, but this has worked for me...
c = db.cursor()
row = 2
while Cell(row,1).value != None:
c.execute("""INSERT IGNORE INTO users (id, username, password, userid, fname, lname)
VALUES (%s, %s, %s, %s, %s, %s);""",
(CellRange((row,1),(row,6)).value))
row = row + 1
Here is your problem:
while True:
if data[0]:
...
else:
break
Your first id is 0, so in the first iteration of the loop data[0] will be falsely and your loop will exit, without ever adding any data. What you probably ment is:
while True:
if data[0] is not None:
...
else:
break
I ended up finding a solution that gets me an Insert on new and allows for UPDATE of those that are changed. Not exactly a Python selection based on a single query, but will do.
import MySQLdb
db = MySQLdb.connect("xxx","xxx","xxx","xxx")
c = db.cursor()
row = 2
while Cell(row,1).value is not None:
c.execute("INSERT INTO users (id, username, password, \
userid, fname, lname) \
VALUES (%s, %s, %s, %s, %s, %s) \
ON DUPLICATE KEY UPDATE \
id=VALUES(id), username=VALUES(username), password=VALUES(password), \
userid=VALUES(userid), fname=VALUES(fname), lname=VALUES(lname);",
(CellRange((row,1),(row,6)).value))
row = row + 1
db.commit()
db.close()

Categories