I've been running a Jupyter Notebook with Python3 code for quite some time. It uses a combination of pyodbc and SQLAlchemy to connect to a few SQL Server databases on my company intranet. The purpose of the notebook is to pull data from an initial SQL Server database and store it in memory as a Pandas dataframe. The file then extracts specific values from one of the columns and sends that list of values through to two different SQL Server databases to pull back a mapping list.
All of this has been working great until I decided to rewrite the raw SQL queries as SQLAlchemy Core statements. I've gone though the process of validating that the SQLAlchemy queries compile to match the raw SQL queries. However, the queries run unimaginably slow. For instance, the initial raw SQL query runs in 25 seconds and the same query rewritten in SQLAlchemy Core runs in 15 minutes! The remaining queries didn't finish even after letting them run for 2 hours.
This could have something to do with how I'm reflecting the existing tables. I even took some time to override the ForeignKey and primary_key on the tables hoping that'd help improve performance. No dice.
I also know "if it ain't broke, don't fix it." But SQLAlchemy just looks so much nicer than a nasty block of hard coded SQL.
Can anyone explain why the SQLAlchemy queries are running so slowly. The SQLAlchemy docs don't give much insight. I'm running SQLAlchemy version 1.2.11.
import sqlalchemy
sqlalchemy.__version__
'1.2.11'
Here are the relevant lines. I'm excluding the exports for brevity but in case anyone needs to see that I'll be happy to supply it.
engine_dr2 = create_engine("mssql+pyodbc://{}:{}#SER-DR2".format(usr, pwd))
conn = engine_dr2.connect()
metadata_dr2 = MetaData()
bv = Table('BarVisits', metadata_dr2, autoload=True, autoload_with=engine_dr2, schema='livecsdb.dbo')
bb = Table('BarBillsUB92Claims', metadata_dr2, autoload=True, autoload_with=engine_dr2, schema='livecsdb.dbo')
mask = data['UCRN'].str[:2].isin(['DA', 'DB', 'DC'])
dr2 = data.loc[mask, 'UCRN'].unique().tolist()
sql_dr2 = select([bv.c.AccountNumber.label('Account_Number'),
bb.c.UniqueClaimReferenceNumber.label('UCRN')])
sql_dr2 = sql_dr2.select_from(bv.join(bb, and_(bb.c.SourceID == bv.c.SourceID,
bb.c.BillingID == bv.c.BillingID)))
sql_dr2 = sql_dr2.where(bb.c.UniqueClaimReferenceNumber.in_(dr2))
mapping_list = pd.read_sql(sql_dr2, conn)
conn.close()
The raw SQL query that should match sql_dr2 and runs lickety split is here:
"""SELECT Account_Number = z.AccountNumber, UCRN = y.UniqueClaimReferenceNumber
FROM livecsdb.dbo.BarVisits z
INNER JOIN
livecsdb.dbo.BarBillsUB92Claims y
ON
y.SourceID = z.SourceID
AND
y.BillingID = z.BillingID
WHERE
y.UniqueClaimReferenceNumber IN ({0})""".format(', '.join(["'" + acct + "'" for acct in dr2]))
The list dr2 typically contains upwards of 70,000 elements. Again, the raw SQL handles this in one minute or less. The SQLAlchemy rewrite has been running for 8+ hours now and still not done.
Update
Additional information is provided below. I don't own the database or the tables and they contain protected health information so it's not something I can directly share but I'll see about making some mock data.
tables = ['BarVisits', 'BarBillsUB92Claims']
for t in tables:
print(insp.get_foreign_keys(t))
[], []
for t in tables:
print(insp.get_indexes(t))
[{'name': 'BarVisits_SourceVisit', 'unique': False, 'column_names': ['SourceID', 'VisitID']}]
[]
for t in tables:
print(insp.get_pk_constraint(t))
{'constrained_columns': ['BillingID', 'SourceID'], 'name': 'mtpk_visits'}
{'constrained_columns': ['BillingID', 'BillNumberID', 'ClaimID', 'ClaimInsuranceID', 'SourceID'], 'name': 'mtpk_insclaim'}
Thanks in advance for any insight.
I figured out how to make the query run fast but have no idea why it's needed with this server and not any others.
Taking
sql_dr2 = str(sql.compile(dialect=mssql.dialect(), compile_kwargs={"literal_binds": True}))
and sending that through
pd.read_sql(sql_dr2, conn)
performs the query in about 2 seconds.
Again, I have no idea why that works but it does.
Related
I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?
If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.
My question is about memory and performance with querying large data and then processing.
Long story short, because of a bug. I am querying a table and getting all results between two timestamps. My Python script crashed due to not enough memory - This table is very wide and holds a massive JSON object. So I changed this to only return the Primary_Key of each row.
select id from *table_name*
where updated_on between %(date_one)s and %(date_two)s
order by updated_on asc
From here I loop through and query each row one by the Primary key for the row data.
for primary_key in *query_results*:
row_data = data_helper.get_by_id( primary_key )
# from here I do some formatting and throw a message on a queue processor,
# this is not heavy processing
Example:
queue_helper.put_message_on_queue('update_related_tables', message_dict)
My question is, is this a "good" way of doing this? Do I need to help Python with GC? or will Python clean the memory after each iteration in the loop?
Must be a very wide table? That doesn't seem like too crazy of a number of rows. Anyway you can make a lazy function to yield the data x number of rows at a time. It's not stated how you're executing your query, but this is a sqlalchemy/psycopg implementation:
with engine.connect() as conn:
result = conn.execute(*query*)
while True:
chunk = result.fetchmany(x)
if not chunk:
break
for row in chunk:
heavy_processing(row)
This is pretty similar to #it's-yer-boy-chet's answer, except it's just using the lower-level psycopg2 library instead of sqlalchemy. The iterator over conn.execute() will implicitly call the cursor.fetchone() method, returning one row at a time which keeps the memory footprint relatively small provided there aren't thousands and thousands of columns returned by the query. Not sure if it necessarily provides any performance benefits over sqlalchemy, it might be doing basically the same thing under the hood.
If you still need more performance after that I'd look into a different database connection library like asyncpg
conn = psycopg2.connect(user='user', password='password', host='host', database='database')
cursor = conn.cursor()
for row in cursor.execute(query):
message_dict = format_message(row)
queue_helper.put_message_on_queue('update_related_tables', message_dict)
I'm trying to replace some old MSSQL stored procedures with python, in an attempt to take some of the heavy calculations off of the sql server. The part of the procedure I'm having issues replacing is as follows
UPDATE mytable
SET calc_value = tmp.calc_value
FROM dbo.mytable mytable INNER JOIN
#my_temp_table tmp ON mytable.a = tmp.a AND mytable.b = tmp.b AND mytable.c = tmp.c
WHERE (mytable.a = some_value)
and (mytable.x = tmp.x)
and (mytable.b = some_other_value)
Up to this point, I've made some queries with SQLAlchemy, stored those data in Dataframes, and done the requisite calculations on them. I don't know now how to put the data back into the server using SQLAlchemy, either with raw SQL or function calls. The dataframe I have on my end would essentially have to work in the place of the temporary table created in MSSQL Server, but I'm not sure how I can do that.
The difficulty is of course that I don't know of a way to join between a dataframe and a mssql table, and I'm guessing this wouldn't work so I'm looking for a workaround
As the pandas doc suggests here :
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#DSN", echo = False)
dataframe.to_sql('tablename', engine , if_exists = 'replace')
engine parameter for msSql is basically the connection string check it here
if_exist parameter is a but tricky since 'replace' actually drops the table first and then recreates it and then inserts all data at once.
by setting the echo attribute to True it shows all background logs and sql's.
I am trying to select all the records from a sqlite db I have with sqlalchemy, loop over each one and do an update on it. I am doing this because I need to reformat ever record in my name column.
Here is the code I am using to do a simple test:
def loadDb(name):
sqlite3.connect(name)
engine = create_engine('sqlite:///'+dbPath(), echo=False)
metadata = MetaData(bind=engine)
return metadata
db = database("dealers.db")
metadata = db.loadDb()
dealers = Table('dealers', metadata, autoload=True)
dealer = dealers.select().order_by(asc(dealers.c.id)).execute()
for d in dealer:
u = dealers.update(dealers.c.id==d.id)
u.execute(name="hi")
break
I'm getting the error:
sqlalchemy.exc.OperationalError: (OperationalError) database table is locked u'UPDATE dealers SET name=? WHERE dealers.id = ?' ('hi', 1)
I'm very new to sqlalchemy and I'm not sure what this error means or how to fix it. This seems like it should be a really simple task, so I know I am doing something wrong.
With SQLite, you can't update the database while you are still performing the select. You need to force the select query to finish and store all of the data, then perform your loop. I think this would do the job (untested):
dealer = list(dealers.select().order_by(asc(dealers.c.id)).execute())
Another option would be to make a slightly more complicated SQL statement so that the loop executes inside the database instead of in Python. That will certainly give you a big performance boost.
This loop checks if a record is in the sqlite database and builds a list of dictionaries for those records that are missing and then executes a multiple insert statement with the list. This works but it is very slow (at least i think it is slow) as it takes 5 minutes to loop over 3500 queries. I am a complete newbie in python, sqlite and sqlalchemy so I wonder if there is a faster way of doing this.
list_dict = []
session = Session()
for data in data_list:
if session.query(Class_object).filter(Class_object.column_name_01 == data[2]).filter(Class_object.column_name_00 == an_id).count() == 0:
list_dict.append({'column_name_00':a_id,
'column_name_01':data[2]})
conn = engine.connect()
conn.execute(prices.insert(),list_dict)
conn.close()
session.close()
edit: I moved session = Session() outside the loop. Did not make a difference.
SOLUTION:
thanks to mcabral answer I modified the code as:
existing_record_list = []
list_dict = []
conn = engine.connect()
s = select([prices.c.column_name_01], prices.c.column_name_00==a_id)
result = conn.execute(s)
for row in result:
existing_record_list.append(row[0])
for data in raw_data['data']:
if data[2] not in existing_record_list:
list_dict.append({'column_name_00':a_id,
'column_name_01':data[2]}
conn = engine.connect()
conn.execute(prices.insert(),list_dict)
conn.close()
This now takes 6 seconds. That is some improvement!!
3500 queries seems a big number,
Have you consider fetching all entities in one query? Then you will be iterating over a list in memory, and not querying the database for each item.
Glad you found something that works, as an extra 2 cents:
I agree with mcabral. As a general rule, if you are putting a query inside of a loop, you are asking for trouble. Popular SQL DBs are generally optimized for data acquisition. Looping through a query generally indicates that you are procedurally doing something that should/could be done with either a single query or a string a queries that put data into each other.
There are exceptions to this, but from my experience they are usually few and far between... Every time I ran a query through a loop, I regretted it later.