Pandas Join DataTable to SQL Table to Prevent Memory Errors - python

So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))

I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...

Related

Obtain list of IDs inserted from pandas to_sql function

The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()

Update multiple rows of SQL table from Python script

I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()

Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

I am trying to optimize the performance of a simple query to a SQLite database by using indexing. As an example, the table has 5M rows, 5 columns; the SELECT statement is to pick up all columns and the WHERE statement checks for only 2 columns. However, unless I have all columns in the multi-column index, the performance of the query is worse than without any index.
Did I index the column incorrectly, or when selecting all columns, am I supposed to include all of them in the index in order to improve performance?
Below each case # is the result I got when creating the SQLite database in hard-disk. However, for some reason using the ':memory:' mode made all the indexing cases faster than without index.
import sqlite3
import datetime
import pandas as pd
import numpy as np
import os
import time
# Simulate the data
size = 5000000
apps = [f'{i:010}' for i in range(size)]
dates = np.random.choice(pd.date_range('2016-01-01', '2019-01-01').to_pydatetime().tolist(), size)
prod_cd = np.random.choice([f'PROD_{i}' for i in range(30)], size)
models = np.random.choice([f'MODEL{i}' for i in range(15)], size)
categories = np.random.choice([f'GROUP{i}' for i in range(10)], size)
# create a db in memory
conn = sqlite3.connect(':memory:', detect_types=sqlite3.PARSE_DECLTYPES)
c = conn.cursor()
# Create table and insert data
c.execute("DROP TABLE IF EXISTS experiment")
c.execute("CREATE TABLE experiment (appId TEXT, dtenter TIMESTAMP, prod_cd TEXT, model TEXT, category TEXT)")
c.executemany("INSERT INTO experiment VALUES (?, ?, ?, ?, ?)", zip(apps, dates, prod_cd, models, categories))
# helper functions
def time_it(func):
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
print("time for {} function is {}".format(func.__name__, time.time() - start))
return result
return wrapper
#time_it
def read_db(query):
df = pd.read_sql_query(query, conn)
return df
#time_it
def run_query(query):
output = c.execute(query).fetchall()
print(output)
# The main query
query = "SELECT * FROM experiment WHERE prod_cd IN ('PROD_1', 'PROD_5', 'PROD_10') AND dtenter >= '2018-01-01'"
# CASE #1: WITHOUT ANY INDEX
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 2.4783718585968018
# CASE #2: WITH INDEX FOR COLUMNS IN WHERE STATEMENT
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 3.221407890319824
# CASE #3: WITH INDEX FOR MORE THEN WHAT IN WHERE STATEMENT, BUT NOT ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>>time for read_db function is 3.176532745361328
# CASE #4: WITH INDEX FOR ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category, model)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 0.8257918357849121
The SQLite Query Optimizer Overview says:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches.
Index entries are not in the same order as the table entries, so if a query returns data from most of the table's pages, all those random-access lookups are slower than just scanning all table rows.
Index lookups are more efficient than a table scan only if your WHERE condition filters out much more rows than are returned.
SQLite assumes that lookups on indexed columns have a high selectivity. You can get better estimates by running ANALYZE after filling the table.
But if all your queries are in a form where an index does not help, it wold be a better idea to not use an index at all.
When you create an index over all columns used in the query, the additional table accesses are no longer necessary:
If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
When an index contains all of the data needed for a query and when the original table never needs to be consulted, we call that index a "covering index".

Fast loading and querying data in Python

I am doing some data analysis in Python. I have ~15k financial products identified by ISIN code and ~15 columns of daily data for each of them. I would like to easily and quickly access the data given an ISIN code.
The data is in a MySQL DB. On the Python side so far I have been working with Pandas DataFrame.
First thing I did was to use pd.read_sql to load the DF directly from the database. However, this is relatively slow. Then I tried loading the full database in a single DF and serializing it to a pickle file. The loading of the pickle file is fast, a few seconds. However, when querying for an individual product, the perfomance is the same as if I am querying the SQL DB. Here is some code:
import pandas as pd
from sqlalchemy import create_engine, engine
from src.Database import Database
import time
import src.bonds.database.BondDynamicDataETL as BondsETL
database_instance = Database(Database.get_db_instance_risk_analytics_prod())
engine = create_engine(
"mysql+pymysql://"
+ database_instance.get_db_user()
+ ":"
+ database_instance.get_db_pass()
+ "#"
+ database_instance.get_db_host()
+ "/"
+ database_instance.get_db_name()
)
con = engine.connect()
class DataBase:
def __init__(self):
print("made a DatBase instance")
def get_individual_bond_dynamic_data(self, isin):
return self.get_individual_bond_dynamic_data_from_db(isin, con)
#staticmethod
def get_individual_bond_dynamic_data_from_db(isin, connection):
df = pd.read_sql(
"SELECT * FROM BondDynamicDataClean WHERE isin = '"
+ isin
+ "' ORDER BY date ASC",
con=connection,
)
df.set_index("date", inplace=True)
return df
class PickleFile:
def __init__(self):
print("made a PickleFile instance")
df = pd.read_pickle("bonds_pickle.pickle")
# df.set_index(['isin', 'date'], inplace=True)
self.data = df
print("loaded file")
def get_individual_bond_dynamic_data(self, isin):
result = self.data.query("isin == '#isin'")
return result
fromPickle = PickleFile()
fromDB = DataBase()
isins = BondsETL.get_all_isins_with_dynamic_data_from_db(
connection=con,
table_name=database_instance.get_bonds_dynamic_data_clean_table_name(),
)
isins = isins[0:50]
start_pickle = time.time()
for i, isin in enumerate(isins):
x = fromPickle.get_individual_bond_dynamic_data(isin)
print("pickle: " + str(i))
stop_pickle = time.time()
for i, isin in enumerate(isins):
x = fromDB.get_individual_bond_dynamic_data(isin)
print("db: " + str(i))
stop_db = time.time()
pickle_t = stop_pickle - start_pickle
db_t = stop_db - stop_pickle
print("pickle: " + str(pickle_t))
print("db: " + str(db_t))
print("ratio: " + str(pickle_t / db_t))
This results in:
pickle: 7.636280059814453
db: 6.167926073074341
ratio: 1.23806283819615
Also, curiously enough setting the index on the DF (uncommenting the line in the constructor) slows down everything!
I thought of trying https://www.pytables.org/index.html as an alternative to Pandas. Any other ideas or comments?
Greetings,
Georgi
So, collating some thoughts from the comments:
Use mysqlclient instead of PyMySQL if you want more speed on the SQL side of the fence.
Ensure the columns you're querying on are indexed in your SQL table (isin for querying and date for ordering).
You can set index_col="date" directly in read_sql() according to the docs; it might be faster.
I'm no Pandas expert, but I think self.data[self.data.isin == isin] would be more performant than self.data.query("isin == '#isin'").
If you don't need to query things cross-isin and want to use pickles, you could store the data for each isin in a separate pickle file.
Also, for the sake of Lil Bobby Tables, the patron saint of SQL injection attacks, use parameters in your SQL statements instead of concatenating strings.
It helped a lot to transform the large data frame into a dictionary {isin -> DF} of smaller data frames indexed by ISIN code. Data retrieval is much more efficient from a dictionary compared to from a DF. Also, it is very natural to be able to request a single DF given an ISIN code. Hope this helps someone else.

How to select data from SQL Server based on data available in pandas data frame?

I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)

Categories