Difference between cursor.fetchall() and pandas Dataframe - python

I know there are a few ways to retrieve data from RDB table.
One with pandas as read_sql, the other with cursore.fetchall().
What are the main differences between both ways in terms of:
memory usage - is df less reccomended?
performance - selecting data from a table (e.g. large set of data)
performace - inserting data with a loop for cursor vs df.to_sql.
Thanks!

That's an interesting question.
For a ~10GB SQLite database, I get the following results for your second question.
pandas.sql_query seems comparable to speed with the cursor.fetchall.
The rest I leave as an exercise. :D
import sqlite3
import time
import pandas as pd
def db_operation(db_name):
connection = sqlite3.connect(db_name)
c = connection.cursor()
yield c
connection.commit()
connection.close()
start = time.perf_counter()
for i in range(0, 10):
with db_operation('components.db') as c:
c.execute('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''')
fetchall = c.fetchall()
time.perf_counter() - start
2.967 # fractional seconds
start = time.perf_counter()
for i in range(0, 10):
connection = sqlite3.connect('components.db')
sql_query = pd.read_sql_query('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''', con = connection)
connection.commit()
connection.close()
time.perf_counter() - start
2.983 # fractional seconds
The difference is that cursor.fetchall() is a bit more spartan (=plain).
pandas.read_sql_query returns a <class 'pandas.core.frame.DataFrame'> and so you can use all the methods of pandas.DataFrame, like pandas.DataFrame.to_latex, pandas.DataFrame.to_csv pandas.DataFrame.to_excel, etc. (documentation link)
One can accomplish the same exact goals with cursor.fetchall, but needs to press some or a lot extra keys.

Related

python Multiprocessing with ibm_db2?

I'm reading around 15 Million data from db2 LUW using python ibm_db2 module. I'm reading the data for every one million once and again reading this data in chunks to avoid memory issues. The problem here is to complete one loop of reading one million data its taking 4 Mins . How can i use multiprocessing to avoid delay . Blow is my code.
start =0
count = 15000000
check_point = 1000000
chunk_size = 100000
connection = get_db_conn_cur(secrets_db2)
for i in range(start, count, check_point):
query_str = "SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY a.timestamp) row_num, * from table a) where row_num between " + str( i + 1) + " and " + str(i + check_point) + ""
number_of_batches = check_point // chunk_size
last_chunk = check_point - (number_of_batches * chunk_size)
counter = 0
cur = connection.cursor()
cur.execute(query_str) s
chunk_size_l = chunk_size
while True:
counter = counter + 1
columns = [desc[0] for desc in cur.description]
print('counter', counter)
if counter > number_of_batches:
chunk_size_l = last_chunk
results = cur.fetchmany(chunk_size_l)
if not results:
break
df = pd.DataFrame(results)
#further processing
The problem here is not multiprocessing. Is your approach to read data. As far as I can see, you're using ROW_NUMBER() just to number the rows and then fetch them 1 million rows per loop.
As you're not using any WHERE condition on table a this will result in a FULL TABLE SCAN every loop you're running. You are wasting too much CPU and I/O on Db2 server just to have a row number, and that's why it's taking 4 minutes or more each loop.
Also, this is far from a efficient way to fetch data. You can have duplicate reads or phantom reads as long as data changes during your program. But that is subject for another thread.
You should be using one of the 4 fetch methods described here reading row by row from a SQL CURSOR. That way you'll be using a very small amount of RAM and can read data efficiently:
sql = "SELECT * FROM a ORDER BY a.timestamp"
stmt = ibm_db.exec_immediate(conn, sql)
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
dictionary = ibm_db.fetch_both(stmt)

Load data from snowflake to pandas dataframe (python) in batches

I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).
Attempts so far:
Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.
The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.
This is where I am at:
import snowflake.connector
import pandas as pd
from itertools import chain
SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'
query = '''
select *
from table(%s)
;
'''
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='MYUSERNAME',
account='MYACCOUNT',
authenticator = 'externalbrowser',
warehouse='<WH>',
database='<DB>',
role='<ROLE>',
schema='<SCHEMA>'
)
return conn
def fetch_pandas_concat_df(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(dat, columns =colstring.split(","))
grow.append(df)
rows += df.shape[0]
print(rows)
return pd.concat(grow)
def fetch_pandas_concat_list(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
grow.append(dat)
colstring = ','.join([col[0] for col in cur.description])
rows += len(dat)
print(rows)
# note that grow is a list of list of tuples(?) [[(),()]]
return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))
cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes....
df3.head()
df3.shape
cur.close()
The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?
Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.

Pandas Join DataTable to SQL Table to Prevent Memory Errors

So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...

Fastest way to retrieve data from SQLite database

I have the following code:
import sqlite3
import numpy as np
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("SELECT timestamp FROM stockData");
times = np.array(c.fetchall())
times = times.reshape(len(times))
buys = []
sells = []
def price(time):
c.execute("SELECT aapl FROM stockData WHERE timestamp = :t", {'t':time});
return c.fetchone()[0]
def trader(time):
p = float(price(time))
if(p < 186):
buys.append(p)
if(p > 186.5):
sells.append(p)
vfunc = np.vectorize(trader)
vfunc(times[:1000])
This takes about a minute to execute, the reason I see for this is because I am inefficiently calling specific data points from my SQLite DB. I realize I could get around this by calling all relevant data points with something like the following:
SEARCH aapl FROM stockData WHERE aapl < 186;
But I am dead set on having my code loop through all timestamps one by one, as I want to maintain proper chronology.
My database is about 5 million rows long, thus the current method is not feasible. How can I most efficiently loop through and retrieve this data?

How to create a large pandas dataframe from an sql query without running out of memory?

I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Code solution and remarks.
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
Would love to hear what the author / others have to say.
The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options
conn = engine.connect().execution_options(stream_results=True)
And passing the conn object to pandas in
pd.read_sql("SELECT *...", conn, chunksize=10000)
This will ensure that the cursor is handled server-side rather than client-side
You can use Server Side Cursors (a.k.a. stream results)
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass#localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.
More explanation here
chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data
stream_conn = engine.connect().execution_options(stream_results=True)
use pd.read_sql with thechunksize
pd.read_sql("SELECT * FROM SOURCE", stream_conn , chunksize=5000)
you can update version airflow.
for example, I had that error in the version 2.2.3 using docker-compose.
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
mysq 6.7
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
redis:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "250M"
airflow-webserver:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-scheduler:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-worker:
#cpus: "0.5"
#mem_reservation: "10M"
#mem_limit: "750M"
error: Task exited with return code Negsignal.SIGKILL
but update to the version
FROM apache/airflow:2.3.4.
and perform the pulls without problems, using the same resources configured in the docker-compose
enter image description here
my dag extractor:
function
def getDataForSchema(table,conecction,tmp_path, **kwargs):
conn=connect_sql_server(conecction)
query_count= f"select count(1) from {table['schema']}.{table['table_name']}"
logging.info(f"query: {query_count}")
real_count_rows = pd.read_sql_query(query_count, conn)
##sacar esquema de la tabla
metadataquery=f"SELECT COLUMN_NAME ,DATA_TYPE FROM information_schema.columns \
where table_name = '{table['table_name']}' and table_schema= '{table['schema']}'"
#logging.info(f"query metadata: {metadataquery}")
metadata = pd.read_sql_query(metadataquery, conn)
schema=generate_schema(metadata)
#logging.info(f"schema : {schema}")
#logging.info(f"schema: {schema}")
#consulta la tabla a extraer
query=f" SELECT {table['custom_column_names']} FROM {table['schema']}.{table['table_name']} "
logging.info(f"quere data :{query}")
chunksize=table["partition_field"]
data = pd.read_sql_query(query, conn, chunksize=chunksize)
count_rows=0
pqwriter=None
iteraccion=0
for df_row in data:
print(f"bloque {iteraccion} de total {count_rows} de un total {real_count_rows.iat[0, 0]}")
#logging.info(df_row.to_markdown())
if iteraccion == 0:
parquetName=f"{tmp_path}/{table['table_name']}_{iteraccion}.parquet"
pqwriter = pq.ParquetWriter(parquetName,schema)
tableData = pa.Table.from_pandas(df_row, schema=schema,safe=False, preserve_index=True)
#logging.info(f" tabledata {tableData.column(17)}")
pqwriter.write_table(tableData)
#logging.info(f"parquet name:::{parquetName}")
##pasar a parquet df directo
#df_row.to_parquet(parquetName)
iteraccion=iteraccion+1
count_rows += len(df_row)
del df_row
del tableData
if pqwriter:
print("Cerrando archivo parquet")
pqwriter.close()
del data
del chunksize
del iteraccion
Here is a one-liner. I was able to load in 49m records to the dataframe without running out of memory.
dfs = pd.concat(pd.read_sql(sql, engine, chunksize=500000), ignore_index=True)
Full one-line code using sqlalchemy and with operator:
db_engine = sqlalchemy.create_engine(db_url, pool_size=10, max_overflow=20)
with Session(db_engine) as session:
sql_qry = text("Your query")
data = pd.concat(pd.read_sql(sql_qry,session.connection().execution_options(stream_results=True), chunksize=500000), ignore_index=True)
You can try to change chunksize to find the optimal size for your case.
You can use chunksize option, but need to set it up to 6-7 digit if you have RAM issue.
for chunk in pd.read_sql(sql, engine, params = (fromdt, todt,filecode), chunksize=100000):
df1.append(chunk)
dfs = pd.concat(df1, ignore_index=True)
do this
If you want to limit the number of rows in output, just use:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()

Categories