python script hangs when calling cursor.fetchall() with large data set - python

I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?

I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!

Related

How to read 10 million rows from POSTGRES and write the data to a CSV in python?

I am trying to run a query on a table which has about 10 million rows. Basically I am trying to run Select * from events and then I am writing it to a CSV file.
Here is the code:
with create_server_connection() as connection:
cursor = connection.cursor()
cursor.itersize = 20000
cwd = os.getcwd()
query = open( sql_file_path, mode='r').read()
print(query)
cursor.execute(query)
with open(file_name, 'w', newline='')as fp:
a = csv.writer(fp)
for row in cursor:
a.writerow(row)
def create_server_connection():
DB_CONNECTION_PARAMS = os.environ["DB_REPLICA_CONNECTION"]
json_object = json.loads(DB_CONNECTION_PARAMS)
try:
conn = psycopg2.connect(
database=json_object["PGDATABASE"], user=json_object["PGUSER"], password=json_object["PGPASSWORD"], host=json_object["PGHOST"], port=json_object["PGPORT"]
)
except psycopg2.OperationalError as e:
print('Unable to connect!\n{0}').format(e)
sys.exit(1)
return conn
However, for some reason, this whole process is taking up a lot of memory. I am running this as an AWS-batch process and the process exits with this error OutOfMemoryError: Container killed due to memory usage
Is there a way to reduce memory usage?
From the psycopg2 docs:
When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

Pyodbc stored procedure with params not updating table

I am using python 3.9 with a pyodbc connection to call a SQL Server stored procedure with two parameters.
This is the code I am using:
connectionString = buildConnection() # build connection
cursor = connectionString.cursor() # Create cursor
command = """exec [D7Ignite].[Service].[spInsertImgSearchHitResults] #RequestId = ?, #ImageInfo = ?"""
values = (requestid, data)
cursor.execute(command, (values))
cursor.commit()
cursor.close()
requestid is simply an integer number, but data is defined as follows (list of json):
[{"ImageSignatureId":"27833", "SimilarityPercentage":"1.0"}]
The stored procedure I am trying to run is supposed to insert data into a table, and it works perfectly fine when executed from Management Studio. When running the code above I notice there are no errors but data is not inserted into the table.
To help me debug, I printed the query preview:
exec [D7Ignite].[Service].[spInsertImgSearchHitResults] #RequestId = 1693, #ImageInfo = [{"ImageSignatureId":"27833", "SimilarityPercentage":"1.0"}]
Pasting this exact line into SQL Server runs the stored procedure with no problem, and data is properly inserted.
I have enabled autocommit = True when setting up the connection and other CRUD commands work perfectly fine with pyodbc.
Is there anything I'm overlooking? Or is pyodbc simply not processing my query properly? If so, are there any other ways to run Stored Procedures from Python?

What is the best way to ignore errors when importing data from a pandas data frame to SQL Server?

I'm writing a python script to daily import data from a legacy system's data dump. I'd like to import the data and just skip rows that throw errors (e.g. wrong data-type). What is the best way of achieving this?
My current code:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect()
df = pd.read_csv(file_path)
df.to_sql(tbl_name,conn,if_exists="append",index=False)
The file is rather large, so I'd prefer not iterating through rows as I have seen in some examples.
Shouldn't the df.to_sql just ignore those by default? I thought that's how it worked. If not, just setup a try catch routine.
try:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect()
df = pd.read_csv(file_path)
df.to_sql(tbl_name,conn,if_exists="append",index=False)
catch:
print('en erro was detected; please check...')

pypyodbc retrieves data very slowly vs PLSQL

I tested the same Oracle query in PLSQL vs Python/pypyodbc. I'm pulling ~30k rows, which takes 27 seconds in PLSQL, while it takes approximately eight minutes in Python. My python/pypyodbc code is here:
import pandas as pd
import pypyodbc
q0 = '''
select *
from weatherview x
where x.WeatherNodeRCIKey IN (481, 562, 563, 561, 564, 565, 560, 658)
and x.WeatherDate >= '01-jan-2016'
'''
try:
con = pypyodbc.connect(driver='{Oracle in OraClient11g_home1}',
server='oracle', uid='acct', pwd='Pass', dbq='table')
with con:
cur = con.cursor()
cur.execute(q0)
q0_rows = cur.fetchall()
q0_hdnm = [i[0] for i in cur.description]
except Exception as e:
print("Error: " + str(e))
df0 = pd.DataFrame(q0_rows, columns=q0_hdnm)
df0.head()
Its hard for me to believe Python can be so much slower. I'm curious if this is a server side/client side issue, or perhaps a memory issue. I don't believe this is related to the dataframe/pandas portion of the code as I have run the code without the last few lines, with the same result.
I'm 90% sure the problem is related to fetchall() being slow.
I would be happy if anyone can point out:
How to troubleshoot the speed issues (time the DB connection etc)
Use a different package to pull the query that will be faster
Alter this code with pypyodbc to work more efficiently
EDIT: I changed the tags a little, removing [server-side], and adding [cx_Oracle] due to the answer I found below
I found one answer via my "use a different package" option above, using cx_Oracle. The code is as follows:
import pandas as pd
import cx_Oracle
q0 = '''
select *
from weatherview_historical x
where x.WeatherNodeRCIKey IN (481, 562, 563, 561, 564, 565, 560, 658)
and x.WeatherDate >= '01-jan-2016'
and x.WeatherTypeLu in (6436,6439)
'''
try:
#establish connection with profit (oracle) db
con = cx_Oracle.connect(user='uid', password='pass', dsn='dsn_name')
df0 = pd.read_sql_query(q0, con)
except Exception as e:
#return error message
print("Error: " + str(e))
The key of was the cx_Oracle package which fetched the results in 20 seconds vs 8 minutes for pypyodbc (and 27 seconds for PLSQL as above). I was also able to feed feed the rows directly into the dataframe in pandas via .read_sql_query
I'm still very interested as to why pypyodbc was so slow versus other options. If anyone has any thoughts as to making it run faster, whether thats:
changing encoding at sql level so python doesn't have to handle
changing number of items returned by DB each call (size of array/chunk size)
changing to server side
skipping the .execute or .fetchall steps (these still work with cx_Oracle it appears)
Please let me know
Use turbodbc library, it was the only way to upload fast: https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html

python+MySQLdb, simple select is too slow compared to flat file access

I have one simple table with 80 000 rows.
I'm trying to select and save all rows to python list as fast as possible.
It's taking around 4 - 10 seconds.
In contrast If I dump exact same table into csv file and process it with this code
f = open('list.csv','rb')
lines = f.read().splitlines()
f.close()
print len(lines)
It's taking only 0.08 - 0.3 second
I tried MySQLdb and mysql.connector using fetchall() or fetchone()
import time
start = time.time()
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'login', 'p', 'db');
with con:
cur = con.cursor()
cur.execute("SELECT * FROM table")
rows = cur.fetchall()
print len(rows)
print 'MySQLdb %s' % (time.time()-start)
Took 3.7 - 8 seconds with high CPU load
Is it possible to achieve same speed like with that csv file?
EDIT
My MySQL server seems to be ok.
In mysql console:
SELECT * from TABLE;
....
80789 rows in set (0.21 sec)
The entire query result will be restored in client side as a list when cur.execute(...) is done. Check self._rows attribute in MySQLdb/cursor.py for this.
That is to say, time cost differs in reading file content and fetching query results from a MySQL database. As we all, a built-in function is always faster than a 3PP function. So I don't think there is a way to make cursor.execute() same speed as open().
As for why open() is faster, I suggest you look into Python source code. Here is the link.
Hope it helps.

Categories