I tested the same Oracle query in PLSQL vs Python/pypyodbc. I'm pulling ~30k rows, which takes 27 seconds in PLSQL, while it takes approximately eight minutes in Python. My python/pypyodbc code is here:
import pandas as pd
import pypyodbc
q0 = '''
select *
from weatherview x
where x.WeatherNodeRCIKey IN (481, 562, 563, 561, 564, 565, 560, 658)
and x.WeatherDate >= '01-jan-2016'
'''
try:
con = pypyodbc.connect(driver='{Oracle in OraClient11g_home1}',
server='oracle', uid='acct', pwd='Pass', dbq='table')
with con:
cur = con.cursor()
cur.execute(q0)
q0_rows = cur.fetchall()
q0_hdnm = [i[0] for i in cur.description]
except Exception as e:
print("Error: " + str(e))
df0 = pd.DataFrame(q0_rows, columns=q0_hdnm)
df0.head()
Its hard for me to believe Python can be so much slower. I'm curious if this is a server side/client side issue, or perhaps a memory issue. I don't believe this is related to the dataframe/pandas portion of the code as I have run the code without the last few lines, with the same result.
I'm 90% sure the problem is related to fetchall() being slow.
I would be happy if anyone can point out:
How to troubleshoot the speed issues (time the DB connection etc)
Use a different package to pull the query that will be faster
Alter this code with pypyodbc to work more efficiently
EDIT: I changed the tags a little, removing [server-side], and adding [cx_Oracle] due to the answer I found below
I found one answer via my "use a different package" option above, using cx_Oracle. The code is as follows:
import pandas as pd
import cx_Oracle
q0 = '''
select *
from weatherview_historical x
where x.WeatherNodeRCIKey IN (481, 562, 563, 561, 564, 565, 560, 658)
and x.WeatherDate >= '01-jan-2016'
and x.WeatherTypeLu in (6436,6439)
'''
try:
#establish connection with profit (oracle) db
con = cx_Oracle.connect(user='uid', password='pass', dsn='dsn_name')
df0 = pd.read_sql_query(q0, con)
except Exception as e:
#return error message
print("Error: " + str(e))
The key of was the cx_Oracle package which fetched the results in 20 seconds vs 8 minutes for pypyodbc (and 27 seconds for PLSQL as above). I was also able to feed feed the rows directly into the dataframe in pandas via .read_sql_query
I'm still very interested as to why pypyodbc was so slow versus other options. If anyone has any thoughts as to making it run faster, whether thats:
changing encoding at sql level so python doesn't have to handle
changing number of items returned by DB each call (size of array/chunk size)
changing to server side
skipping the .execute or .fetchall steps (these still work with cx_Oracle it appears)
Please let me know
Use turbodbc library, it was the only way to upload fast: https://turbodbc.readthedocs.io/en/latest/pages/getting_started.html
Related
I'm writing a python script to daily import data from a legacy system's data dump. I'd like to import the data and just skip rows that throw errors (e.g. wrong data-type). What is the best way of achieving this?
My current code:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect()
df = pd.read_csv(file_path)
df.to_sql(tbl_name,conn,if_exists="append",index=False)
The file is rather large, so I'd prefer not iterating through rows as I have seen in some examples.
Shouldn't the df.to_sql just ignore those by default? I thought that's how it worked. If not, just setup a try catch routine.
try:
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn = engine.connect()
df = pd.read_csv(file_path)
df.to_sql(tbl_name,conn,if_exists="append",index=False)
catch:
print('en erro was detected; please check...')
I have the following python code which runs multiple SQL Queries in Oracle database and combines them into one dataframe.
The queries exist in a txt file and every row is a separate SQL query. The loop runs sequentially the queries. I want to cancel any SQL queries that run for more than 10 secs so as not to create an overhead in the database.
The following code doesnt actually me give the results that i want. More specifically this bit of the code really help me on my issue:
if (time.time() - start) > 10:
connection.cancel()
Full python code is the following. Probably it is an oracle function that can be called so as to cancel the query.
import pandas as pd
import cx_Oracle
import time
ip = 'XX.XX.XX.XX'
port = XXXX
svc = 'XXXXXX'
dsn_tns = cx_Oracle.makedsn(ip, port, service_name = svc)
connection = cx_Oracle.connect(user='XXXXXX'
, password='XXXXXX'
, dsn=dsn_tns
, encoding = "UTF-8"
, nencoding = "UTF-8"
)
filepath = 'C:/XXXXX'
appended_data = []
with open(filepath + 'sql_queries.txt') as fp:
line = fp.readline()
while line:
start = time.time()
df = pd.read_sql(line, con=connection)
if (time.time() - start) > 10:
connection.cancel()
print("Cancel")
appended_data.append(df)
df_combined = pd.concat(appended_data, axis=0)
line = fp.readline()
print(time.time() - start)
fp.close()
A better approach would be to spend some time tuning the queries to make them as efficient as necessary. As #Andrew points out we can't easily kill a database query from outside the database - or even from another session inside the database (it requires DBA level privileges).
Indeed, most DBAs would rather you ran a query for 20 seconds rather than attempt to kill every query which runs more than 10. Apart from anything else, having a process which polls you query to see how long it's been running for is itself a waste of database resources.
I suggest you discuss this with your DBA. You may find you're worrying about nothing.
Look at cx_Oracle 7's Connection.callTimeout setting. You'll need to be using Oracle client libraries 18+. (These will connect to Oracle DB 11.2+). The doc for the equivalent node-oracledb parameter explains the fine print behind the Oracle behavior and round trips.
I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.
To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.
Goals
load the data as quickly as possible without doing any "unnatural
acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)
Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.
/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras
/// Connect to database - works
conn_chunky = psycopg2.connect(
database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000
/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query)
/// result returns None (normal behavior) but result is not iterable
df = pd.DataFrame(result.fetchall())
The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.
See related questions (not the same issue):
Streaming data from Postgres into Python
psycopg2 leaking memory after large query
Added standard client side chunking code per request
nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
if first_loop:
df = dfx
first_loop = False
else:
df = df.append(dfx,ignore_index=True)
UPDATE:
#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
OLD answer:
I'd try to read data from PostgreSQL using internal Pandas method: read_sql():
from sqlalchemy import create_engine
engine = create_engine('postgresql://user#localhost:5432/dbname')
df = pd.read_sql(sql_query, engine)
I have a query that returns over 125K rows.
The goal is to write a script the iterates through the rows, and for each, populate a second table with data processed from the result of the query.
To develop the script, I created a duplicate database with a small subset of the data (4126 rows)
On the small database, the following code works:
import os
import sys
import random
import mysql.connector
cnx = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
cnx_out = mysql.connector.connect(user='dbuser', password='thePassword',
host='127.0.0.1',
database='db')
ins_curs = cnx_out.cursor()
curs = cnx.cursor(dictionary=True)
#curs = cnx.cursor(dictionary=True,buffered=True) #fail
with open('sql\\getRawData.sql') as fh:
sql = fh.read()
curs.execute(sql, params=None, multi=False)
result = curs.fetchall() #<=== script stops at this point
print len(result) #<=== this line never executes
print curs.column_names
curs.close()
cnx.close()
cnx_out.close()
sys.exit()
The line curs.execute(sql, params=None, multi=False) succeeds on both the large and small databases.
If I use curs.fetchone() in a loop, I can read all records.
If I alter the line:
curs = cnx.cursor(dictionary=True)
to read:
curs = cnx.cursor(dictionary=True,buffered=True)
The script hangs at curs.execute(sql, params=None, multi=False).
I can find no documentation on any limits to fetchall(), nor can I find any way to increase the buffer size, and no way to tell how large a buffer I even need.
There are no exceptions raised.
How can I resolve this?
I was having this same issue, first on a query that returned ~70k rows and then on one that only returned around 2k rows (and for me RAM was also not the limiting factor). I switched from using mysql.connector (i.e. the mysql-connector-python package) to MySQLdb (i.e. the mysql-python package) and then was able to fetchall() on large queries with no problem. Both packages seem to follow the python DB API, so for me MySQLdb was a drop-in replacement for mysql.connector, with no code changes necessary beyond the line that sets up the connection. YMMV if you're leveraging something specific about mysql.connector.
Pragmatically speaking, if you don't have a specific reason to be using mysql.connector the solution to this is just to switch to a package that works better!
I have one simple table with 80 000 rows.
I'm trying to select and save all rows to python list as fast as possible.
It's taking around 4 - 10 seconds.
In contrast If I dump exact same table into csv file and process it with this code
f = open('list.csv','rb')
lines = f.read().splitlines()
f.close()
print len(lines)
It's taking only 0.08 - 0.3 second
I tried MySQLdb and mysql.connector using fetchall() or fetchone()
import time
start = time.time()
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'login', 'p', 'db');
with con:
cur = con.cursor()
cur.execute("SELECT * FROM table")
rows = cur.fetchall()
print len(rows)
print 'MySQLdb %s' % (time.time()-start)
Took 3.7 - 8 seconds with high CPU load
Is it possible to achieve same speed like with that csv file?
EDIT
My MySQL server seems to be ok.
In mysql console:
SELECT * from TABLE;
....
80789 rows in set (0.21 sec)
The entire query result will be restored in client side as a list when cur.execute(...) is done. Check self._rows attribute in MySQLdb/cursor.py for this.
That is to say, time cost differs in reading file content and fetching query results from a MySQL database. As we all, a built-in function is always faster than a 3PP function. So I don't think there is a way to make cursor.execute() same speed as open().
As for why open() is faster, I suggest you look into Python source code. Here is the link.
Hope it helps.