Write Large Pandas DataFrames to SQL Server database - python

I have 74 relatively large Pandas DataFrames (About 34,600 rows and 8 columns) that I am trying to insert into a SQL Server database as quickly as possible. After doing some research, I learned that the good ole pandas.to_sql function is not good for such large inserts into a SQL Server database, which was the initial approach that I took (very slow - almost an hour for the application to complete vs about 4 minutes when using mysql database.)
This article, and many other StackOverflow posts have been helpful in pointing me in the right direction, however I've hit a roadblock:
I am trying to use SQLAlchemy's Core rather than the ORM for reasons explained in the link above. So, I am converting the dataframe to a dictionary, using pandas.to_dict and then doing an execute() and insert():
self._session_factory.engine.execute(
TimeSeriesResultValues.__table__.insert(),
data)
# 'data' is a list of dictionaries.
The problem is that insert is not getting any values -- they appear as a bunch of empty parenthesis and I get this error:
(pyodbc.IntegretyError) ('23000', "[23000] [FreeTDS][SQL Server]Cannot
insert the value NULL into the column...
There are values in the list of dictionaries that I passed in, so I can't figure out why the values aren't showing up.
EDIT:
Here's the example I'm going off of:
def test_sqlalchemy_core(n=100000):
init_sqlalchemy()
t0 = time.time()
engine.execute(
Customer.__table__.insert(),
[{"name": 'NAME ' + str(i)} for i in range(n)]
)
print("SQLAlchemy Core: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")

I've got some sad news for you, SQLAlchemy actually doesn't implement bulk imports for SQL Server, it's actually just going to do the same slow individual INSERT statements that to_sql is doing. I would say that your best bet is to try and script something up using the bcp command line tool. Here is a script that I've used in the past, but no guarantees:
from subprocess import check_output, call
import pandas as pd
import numpy as np
import os
pad = 0.1
tablename = 'sandbox.max.pybcp_test'
overwrite=True
raise_exception = True
server = 'P01'
trusted_connection= True
username=None
password=None
delimiter='|'
df = pd.read_csv('D:/inputdata.csv', encoding='latin', error_bad_lines=False)
def get_column_def_sql(col):
if col.dtype == object:
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
elif np.issubdtype(col.dtype, float):
return'[{}] float'.format(col.name)
elif np.issubdtype(col.dtype, int):
return '[{}] int'.format(col.name)
else:
if raise_exception:
raise NotImplementedError('data type {} not implemented'.format(col.dtype))
else:
print('Warning: cast column {} as varchar; data type {} not implemented'.format(col, col.dtype))
width = col.str.len().max() * (1+pad)
return '[{}] varchar({})'.format(col.name, int(width))
def create_table(df, tablename, server, trusted_connection, username, password, pad):
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
col_defs = []
for col in df:
col_defs += [get_column_def_sql(df[col])]
query_string = 'CREATE TABLE {}\n({})\nGO\nQUIT'.format(tablename, ',\n'.join(col_defs))
if overwrite == True:
query_string = "IF OBJECT_ID('{}', 'U') IS NOT NULL DROP TABLE {};".format(tablename, tablename) + query_string
query_file = 'c:\\pybcp_tempqueryfile.sql'
with open (query_file,'w') as f:
f.write(query_string)
if trusted_connection:
login_string = '-E'
else:
login_string = '-U {} -P {}'.format(username, password)
o = call('sqlcmd -S {} {} -i {}'.format(server, login_string, query_file), shell=True)
if o != 0:
raise BaseException("Failed to create table")
# o = call('del {}'.format(query_file), shell=True)
def call_bcp(df, tablename):
if trusted_connection:
login_string = '-T'
else:
login_string = '-U {} -P {}'.format(username, password)
temp_file = 'c:\\pybcp_tempqueryfile.csv'
#remove the delimiter and change the encoding of the data frame to latin so sql server can read it
df.loc[:,df.dtypes == object] = df.loc[:,df.dtypes == object].apply(lambda col: col.str.replace(delimiter,'').str.encode('latin'))
df.to_csv(temp_file, index = False, sep = '|', errors='ignore')
o = call('bcp sandbox.max.pybcp_test2 in c:\pybcp_tempqueryfile.csv -S "localhost" -T -t^| -r\n -c')

This just recently been updated as of SQLAchemy ver: 1.3.0 just in case anyone else needs to know. Should make your dataframe.to_sql statement much faster.
https://docs.sqlalchemy.org/en/latest/changelog/migration_13.html#support-for-pyodbc-fast-executemany
engine = create_engine(
"mssql+pyodbc://scott:tiger#mssql2017:1433/test?driver=ODBC+Driver+13+for+SQL+Server",
fast_executemany=True)

Related

Why pymysql not insert record into table?

I am pretty new in python developing. I have a long python script what "clone" a database and add additional stored functions and procedures. Clone means copy only the schema of DB.These steps work fine.
My question is about pymysql insert exection:
I have to copy some table contents into the new DB. I don't get any sql error. If I debug or print the created INSERT INTO command is correct (I've tested it in an sql editor/handler). The insert execution is correct becuse the result contain the exact row number...but all rows are missing from destination table in dest.DB...
(Ofcourse DB_* variables have been definied!)
import pymysql
liveDbConn = pymysql.connect(DB_HOST, DB_USER, DB_PWD, LIVE_DB_NAME)
testDbConn = pymysql.connect(DB_HOST, DB_USER, DB_PWD, TEST_DB_NAME)
tablesForCopy = ['role', 'permission']
for table in tablesForCopy:
with liveDbConn.cursor() as liveCursor:
# Get name of columns
liveCursor.execute("DESCRIBE `%s`;" % (table))
columns = '';
for column in liveCursor.fetchall():
columns += '`' + column[0] + '`,'
columns = columns.strip(',')
# Get and convert values
values = ''
liveCursor.execute("SELECT * FROM `%s`;" % (table))
for result in liveCursor.fetchall():
data = []
for item in result:
if type(item)==type(None):
data.append('NULL')
elif type(item)==type('str'):
data.append("'"+item+"'")
elif type(item)==type(datetime.datetime.now()):
data.append("'"+str(item)+"'")
else: # for numeric values
data.append(str(item))
v = '(' + ', '.join(data) + ')'
values += v + ', '
values = values.strip(', ')
print("### table: %s" % (table))
testDbCursor = testDbConn.cursor()
testDbCursor.execute("INSERT INTO `" + TEST_DB_NAME + "`.`" + table + "` (" + columns + ") VALUES " + values + ";")
print("Result: {}".format(testDbCursor._result.message))
liveDbConn.close()
testDbConn.close()
Result is:
### table: role
Result: b"'Records: 16 Duplicates: 0 Warnings: 0"
### table: permission
Result: b'(Records: 222 Duplicates: 0 Warnings: 0'
What am I doing wrong? Thanks!
You have 2 main issues here:
You don't use conn.commit() (which would be either be liveDbConn.commit() or testDbConn.commit() here). Changes to the database will not be reflected without committing those changes. Note that all changes need committing but SELECT, for example, does not.
Your query is open to SQL Injection. This is a serious problem.
Table names cannot be parameterized, so there's not much we can do about that, but you'll want to parameterize your values. I've made multiple corrections to the code in relation to type checking as well as parameterization.
for table in tablesForCopy:
with liveDbConn.cursor() as liveCursor:
liveCursor.execute("SELECT * FROM `%s`;" % (table))
name_of_columns = [item[0] for item in liveCursor.description]
insert_list = []
for result in liveCursor.fetchall():
data = []
for item in result:
if item is None: # test identity against the None singleton
data.append('NULL')
elif isinstance(item, str): # Use isinstance to check type
data.append(item)
elif isinstance(item, datetime.datetime):
data.append(item.strftime('%Y-%m-%d %H:%M:%S'))
else: # for numeric values
data.append(str(item))
insert_list.append(data)
testDbCursor = testDbConn.cursor()
placeholders = ', '.join(['`%s`' for item in insert_list[0]])
testDbCursor.executemany("INSERT INTO `{}.{}` ({}) VALUES ({})".format(
TEST_DB_NAME,
table,
name_of_columns,
placeholders),
insert_list)
testDbConn.commit()
From this github thread, I notice that executemany does not work as expected in psycopg2; it instead sends each entry as a single query. You'll need to use execute_batch:
from psycopg2.extras import execute_batch
execute_batch(testDbCursor,
"INSERT INTO `{}.{}` ({}) VALUES ({})".format(TEST_DB_NAME,
table,
name_of_columns,
placeholders),
insert_list)
testDbConn.commit()
How to insert data into table using python pymsql
Find my solution below
import pymysql
import datetime
# Create a connection object
dbServerName = "127.0.0.1"
port = 8889
dbUser = "root"
dbPassword = ""
dbName = "blog_flask"
# charSet = "utf8mb4"
conn = pymysql.connect(host=dbServerName, user=dbUser, password=dbPassword,db=dbName, port= port)
try:
# Create a cursor object
cursor = conn.cursor()
# Insert rows into the MySQL Table
now = datetime.datetime.utcnow()
my_datetime = now.strftime('%Y-%m-%d %H:%M:%S')
cursor.execute('INSERT INTO posts (post_id, post_title, post_content, \
filename,post_time) VALUES (%s,%s,%s,%s,%s)',(5,'title2','description2','filename2',my_datetime))
conn.commit()
except Exception as e:
print("Exeception occured:{}".format(e))
finally:
conn.close()

Batch downloading of table using cx_oracle

I need to download a large table from an oracle database into a python server, using cx_oracle to do so. However, the ram is limited on the python server and so I need to do it in a batch way.
I know already how to do generally a whole table
usr = ''
pwd = ''
tns = '(Description = ...'
orcl = cx_Oracle.connect(user, pwd, tns)
curs = orcl.cursor()
printHeader=True
tabletoget = 'BIGTABLE'
sql = "SELECT * FROM " + "SCHEMA." + tabletoget
curs.execute(sql)
data = pd.read_sql(sql, orcl)
data.to_csv(tabletoget + '.csv'
I'm not sure what to do though to load say a batch of 10000 rows at a time and then save it off to a csv and then rejoin.
You can use cx_Oracle directly to perform this sort of batch:
curs.arraysize = 10000
curs.execute(sql)
while True:
rows = cursor.fetchmany()
if rows:
write_to_csv(rows)
if len(rows) < curs.arraysize:
break
If you are using Oracle Database 12c or higher you can also use the OFFSET and FETCH NEXT ROWS options, like this:
offset = 0
numRowsInBatch = 10000
while True:
curs.execute("select * from tabletoget offset :offset fetch next :nrows only",
offset=offset, nrows=numRowsInBatch)
rows = curs.fetchall()
if rows:
write_to_csv(rows)
if len(rows) < numRowsInBatch:
break
offset += len(rows)
This option isn't as efficient as the first one and involves giving the database more work to do but it may be better for you depending on your circumstances.
None of these examples use pandas directly. I am not particularly familiar with that package, but if you (or someone else) can adapt this appropriately, hopefully this will help!
You can achieve your result like this. Here I am loading data to df.
import cx_Oracle
import time
import pandas
user = "test"
pw = "test"
dsn="localhost:port/TEST"
con = cx_Oracle.connect(user,pw,dsn)
start = time.time()
cur = con.cursor()
cur.arraysize = 10000
try:
cur.execute( "select * from test_table" )
names = [ x[0] for x in cur.description]
rows = cur.fetchall()
df=pandas.DataFrame( rows, columns=names)
print(df.shape)
print(df.head())
finally:
if cur is not None:
cur.close()
elapsed = (time.time() - start)
print(elapsed, "seconds")

getting only updated data from database

I have to get the recently updated data from database. For the purpose of solving it, I have saved the last read row number into shelve of python. The following code works for a simple query like select * from rows. My code is:
from pyodbc import connect
from peewee import *
import random
import shelve
import connection
d = shelve.open("data.shelve")
db = SqliteDatabase("data.db")
class Rows(Model):
valueone = IntegerField()
valuetwo = IntegerField()
class Meta:
database = db
def CreateAndPopulate():
db.connect()
db.create_tables([Rows],safe=True)
with db.atomic():
for i in range(100):
row = Rows(valueone=random.randrange(0,100),valuetwo=random.randrange(0,100))
row.save()
db.close()
def get_last_primay_key():
return d.get('max_row',0)
def doWork():
query = "select * from rows" #could be anything
conn = connection.Connection("localhost","","SQLite3 ODBC Driver","data.db","","")
max_key_query = "SELECT MAX(%s) from %s" % ("id", "rows")
max_primary_key = conn.fetch_one(max_key_query)[0]
print "max_primary_key " + str(max_primary_key)
last_primary_key = get_last_primay_key()
print "last_primary_key " + str(last_primary_key)
if max_primary_key == last_primary_key:
print "no new records"
elif max_primary_key > last_primary_key:
print "There are some datas"
optimizedQuery = query + " where id>" + str(last_primary_key)
print query
for data in conn.fetch_all(optimizedQuery):
print data
d['max_row'] = max_primary_key
# print d['max_row']
# CreateAndPopulate() # to populate data
doWork()
While the code will work for a simple query without where clause, but the query can be anything from simple to complex, having joins and multiple where clauses. If so, then the portion where I'm adding where will fail. How can I get only last updated data from database whatever be the query?
PS: I cannot modify database. I just have to fetch from it.
Use an OFFSET clause. For example:
SELECT * FROM [....] WHERE [....] LIMIT -1 OFFSET 1000
In your query, replace 1000 with a parameter bound to your shelve variable. That will skip the top "shelve" number of rows and only grab newer ones. You may want to consider a more robust refactor eventually, but good luck.

Python, A MySQL statement works when I put in actual value of variable, but not when using variable?

Have the following code, it is part of a script used to read stdin, and process logs.
jobId = loglist[19]
deliveryCount += 1
dbcur.execute('UPDATE campaign_stat_delivered SET pmta_delivered = pmta_delivered + %s WHERE id = %s') % (deliveryCount,jobId)
dbcon.commit()
dbcon.close()
I can run the following:
dbcur.execute('UPDATE campaign_stat_delivered SET pmta_delivered = pmta_delivered + 1 WHERE id=1')
dbcon.commit()
dbcon.close()
and it will work. Not really sure whats going on, and its hard for me to test quickly because I can't actually see the script running since my program feeds directly into it. I have to make changes, restart program that feeds, send an email, then check database. Have other scripts, and am able to use variables in SQL statements with no problem.
Any suggestions as to what may be going on? And, any suggestions on how I can test quicker?
full code:
import os
import sys
import time
import MySQLdb
import csv
if __name__=="__main__":
dbcon = MySQLdb.connect(host="tattoine.mktrn.net", port=3306, user="adki", passwd="pKhL9vrMN8BsFrJ5", db="adki")
dbcur = dbcon.cursor()
#type, timeLogged,timeQueued,orig,rcpt,orcpt,dsnAction,dsnStatus,dsnDiag,dsnMta,bounceCat,srcType,srcMta,dlvType,dlvSourceIp,dlvDestinationIp,dlvEsmtpAvailable,dlvSize,vmta,jobId,envId,queue,vmtaPool
while True:
line = sys.stdin.readline()
fwrite = open("debug.log","w")
# fwrite.write(str(deliveryCount))
fwrite.write("test2")
dbcur.execute("INSERT INTO test(event_type) VALUES ('list')")
dbcon.commit()
loglist = line.split(',')
deliveryCount = 0
bounceType = loglist[0]
bounceCategory = loglist[10]
email = loglist[4]
jobId = loglist[19]
if bounceType == 'd':
deliveryCount += 1
fwrite = open("debug2.log","w")
# fwrite.write(str(deliveryCount))
fwrite.write("test3")
dbcur.execute("INSERT INTO test(event_type) VALUES (%d)", deliveryCount)
dbcon.commit()
dbcur.execute('UPDATE campaign_stat_delivered SET pmta_delivered = pmta_delivered + %s WHERE id = %s',(deliveryCount,jobId))
dbcon.commit()
dbcon.close()
Never use string interpolation to run a sql query.
You should do:
dbcur.execute(
'UPDATE campaign_stat_delivered SET pmta_delivered = pmta_delivered + %s WHERE id = %s',
(deliveryCount,jobId)
)
There's two arguments to the execute function. The query with placeholders, and a tuple of parameters. This way, mysql will escape your parameters for you and prevent sql injection attacks ( http://en.wikipedia.org/wiki/SQL_injection )
Your error must have come from the fact that you use the % operator on the result of the query 'UPDATE campaign_stat_delivered SET pmta_delivered = pmta_delivered + %s WHERE id = %s'. This query in itself (without parameters) is syntactically incorrect for mysql. You have to pass the tuple of parameters to the execute function as a second argument.

How do I read cx_Oracle.LOB data in Python?

I have this code:
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
print rows
template = rows[0][0]
orcl.close()
print template.read()
When I do print rows, I get this:
[(<cx_Oracle.LOB object at 0x0000000001D49990>,)]
However, when I do print template.read(), I get this error:
cx_Oracle.DatabaseError: Invalid handle!
Do how do I get and read this data? Thanks.
I've found out that this happens in case when connection to Oracle is closed before the cx_Oracle.LOB.read() method is used.
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
orcl.close() # before reading LOB to str
wkt = dane[0][0].read()
And I get: DatabaseError: Invalid handle!
But the following code works:
orcl = cx_Oracle.connect(usrpass+'#'+dbase)
c = orcl.cursor()
c.execute(sq)
dane = c.fetchall()
wkt = dane[0][0].read()
orcl.close() # after reading LOB to str
Figured it out. I have to do something like this:
curs.execute(sql)
for row in curs:
print row[0].read()
You basically have to loop through the fetchall object
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x in rows:
list_ = list(x)
print(list_)
There should be an extra comma in the for loop, see in below code, i have supplied an extra comma after x in for loop.
dsn = cx_Oracle.makedsn(hostname, port, sid)
orcl = cx_Oracle.connect(username + '/' + password + '#' + dsn)
curs = orcl.cursor()
sql = "select TEMPLATE from my_table where id ='6'"
curs.execute(sql)
rows = curs.fetchall()
for x, in rows:
print(x)
I had the same problem with in a slightly different context. I needed to query a +27000 rows table and it turns out that cx_Oracle cuts the connection to the DB after a while.
While a connection to the db is open, you can use the read() method of the cx_Oracle.Lob object to transform it into a string. But if the query brings a table that is too big, it won´t work because the connection will stop at some point and when you want to read the results from the query you´ll gt an error on the cx_Oracle objects.
I tried many things, like setting
connection.callTimeout = 0 (according to documentation, this means it would wait indefinetly), using fetchall() and then putting the results on a dataframe or numpy array but I could never read the cx_Oracle.Lob objects.
If I try to run the query using pandas.DataFrame.read_sql(query, connection) The dataframe would contain cx_Oracle.Lob objects with the connection closed, making them useless. (Again this only happens if the table is very big)
In the end I found a way of getting around this by querying and creating a csv file inmediatlely after, even though I know it´s not ideal.
def csv_from_sql(sql: str, path: str="dataframe.csv") -> bool:
try:
with cx_Oracle.connect(config.username, config.password, config.database, encoding=config.encoding) as connection:
connection.callTimeout = 0
data = pd.read_sql(sql, con=connection)
data.to_csv(path)
print("FILE CREATED")
except cx_Oracle.Error as error:
print(error)
return False
finally:
print("PROCESS ENDED\n")
return True
def make_query(sql: str, path: str="dataframe.csv") -> pd.DataFrame:
if csv_from_sql(sql, path):
dataframe = pd.read_csv("dataframe.csv")
return dataframe
return pd.DataFrame()
This took a long time (about 4 to 5 minutes) to bring my +27000-rows table, but it worked when everything else didn´t.
If anyone knows a better way, it would be helpful for me too.

Categories