I am querying a historical database and the result table should have around 144k rows but I am getting the result limited to 100k lines. I have already seem this problem using ODBC connection to the same DB but using a VBA .NET application.
I tested the same query in the DB SQL client and it return the correct number of row. The only thing that look suspect is that this client has a limit to number of rows shown on its interface and the default value is 100k.
import pypyodbc as pyodbc
import pandas as pd
import gc
import time
def GetTrend(tags,server,t_ini,t_end,period):
IP21_connection = 'DRIVER={{AspenTech SQLplus}};HOST={};PORT=10014'
IP21_Query = """select ts,avg from aggregates where name = '{}'
and ts between '{}' and '{}'
and period = 10*{}"""
conn = pyodbc.connect(IP21_connection.format(server))
cur = conn.cursor()
df2=pd.DataFrame()
i=0
for tag in tags:
cur.execute(IP21_Query.format(tag,t_ini,t_end,period))
df = pd.DataFrame(cur.fetchall(),columns=['ts',tag])
df['ts'] = pd.to_datetime(df['ts'])
df1=df.set_index('ts')
df2=pd.concat([df2,df1],axis=1)
i+=1
print('{} - {} of {} tags'
' collected'.format(time.asctime(time.localtime()), i,
len(tags)), flush=True)
gc.collect()
For the period I am querying the DB I would expect 144k rows but I am getting just 100k.
Just ran into the same issue myself and found the solution in the manual which you may have found yourself a long time ago! If not change your Driver settings to include 'MAXROWS = x', eg:
IP21_connection = 'DRIVER={{AspenTech SQLplus}};HOST={};PORT=10014;MAXROWS=1000000'
Related
I am a financial analyst with about two month's experience with Python, and I am working on a project using Python and SQL to automate the compilation of a report. The process involves accessing a changing number of Excel files saved in a share drive, pulling two tabs from each (summary and quote) and combining the datasets into two large "Quote" and "Summary" tables. The next step is to pull various columns from each, combine, calculate, etc.
The problem is that the dataset ends up being 3.4mm rows and around 30 columns. The program I wrote below works, but it took 40 minutes to work through the first part (creating the list of dataframes) and another 4.5 hours to create the database and export the data, not to mention using a LOT of memory.
I know there must be a better way to accomplish this, but I don't have a CS background. Any help would be appreciated.
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
quote_combined = list()
summary_combined = list()
# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes. Add two columns, format column headers, then
# add each dataframe to the list of dataframes.
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_combined.append(quote_sheet)
summary_combined.append(summary_sheet)
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
# Concatenate the list of dataframes to append one to another.
# Totals about 3.4mm rows for August
totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)
# Change directory, create Sqlite database, and send the combined dataframes to database
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')
print('Finished. It took: ', datetime.now() - starttime)
'''
I see a few things you could do. Firstly, since your first step is just to transfer the data to your SQL DB, you don't necessarily need to append all the files to each other. You can just attack the problem one file at a time (which means you can multiprocess!) - then, whatever computations need to be completed, can come later. This will also result in you cutting down your RAM usage since if you have 10 files in your folder, you aren't loading all 10 up at the same time.
I would recommend the following:
Construct an array of filenames that you need to access
Write a wrapper function that can take a filename, open + parse the file, and write the contents to your MySQL DB
Use the Python multiprocessing.Pool class to process them simultaneously. If you run 4 processes, for example, your task becomes 4 times faster! If you need to derive computations from this data and hence need to aggregate it, please do this once the data's in the MySQL DB. This will be way faster.
If you need to define some computations based on the aggregate data, do it now, in the MySQL DB. SQL is an incredibly powerful language, and there's a command out there for practically everything!
I've added in a short code snippet to show you what I'm talking about :)
from multiprocessing import Pool
PROCESSES = 4
FILES = []
def _process_file(filename):
print("Processing: "+filename)
pool = Pool(PROCESSES)
pool.map(_process_file, FILES)
SQL clarification: You don't need an independent table for every file you move to SQL! You can create a table based on a given schema, and then add the data from ALL your files to that one table, row by row. This is essentially what the function you use to go from DataFrame to table does, but it creates 10 different tables. You can look at some examples on inserting a row into a table here.However, in the specific use case that you have, setting the if_exists parameter to "append" should work, as you've mentioned in your comment. I just added the earlier references in because you mentioned that you're fairly new to Python, and a lot of my friends in the finance industry have found gaining a slightly more nuanced understanding of SQL to be extremely useful.
Try this, Here most of the time is taken is during Loading Data from excel to Dataframe. I am not sure following script will reduce the time to within seconds but It will reduce the RAM baggage, which in turn could speed up the process. It will potentially reduce the time by at least 5-10 minutes. Since I have no access to data I cannot be sure. But you should try this
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')
summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).
Attempts so far:
Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.
The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.
This is where I am at:
import snowflake.connector
import pandas as pd
from itertools import chain
SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'
query = '''
select *
from table(%s)
;
'''
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='MYUSERNAME',
account='MYACCOUNT',
authenticator = 'externalbrowser',
warehouse='<WH>',
database='<DB>',
role='<ROLE>',
schema='<SCHEMA>'
)
return conn
def fetch_pandas_concat_df(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(dat, columns =colstring.split(","))
grow.append(df)
rows += df.shape[0]
print(rows)
return pd.concat(grow)
def fetch_pandas_concat_list(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
grow.append(dat)
colstring = ','.join([col[0] for col in cur.description])
rows += len(dat)
print(rows)
# note that grow is a list of list of tuples(?) [[(),()]]
return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))
cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes....
df3.head()
df3.shape
cur.close()
The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?
Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.
I am trying to import a table that contains 81462 rows in a dataframe using the following code:
sql_conn = pyodbc.connect('DRIVER={SQL Server}; SERVER=server.database.windows.net; DATABASE=server_dev; uid=user; pwd=pw')
query = "select * from product inner join brand on Product.BrandId = Brand.BrandId"
df = pd.read_sql(query, sql_conn)
And the whole process takes a very long time. I think that I am already 30-minutes in and it's still processing. I'd assume that this is not quite normal - so how else should I import it so the processing time is quicker?
Thanks to #RomanPerekhrest. FETCH NEXT imported everything within 1-2 minutes.
SELECT product.Name, brand.Name as BrandName, description, size FROM Product inner join brand on product.brandid=brand.brandid ORDER BY Name OFFSET 1 ROWS FETCH NEXT 80000 ROWS ONLY
I have the following code:
import sqlite3
import numpy as np
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("SELECT timestamp FROM stockData");
times = np.array(c.fetchall())
times = times.reshape(len(times))
buys = []
sells = []
def price(time):
c.execute("SELECT aapl FROM stockData WHERE timestamp = :t", {'t':time});
return c.fetchone()[0]
def trader(time):
p = float(price(time))
if(p < 186):
buys.append(p)
if(p > 186.5):
sells.append(p)
vfunc = np.vectorize(trader)
vfunc(times[:1000])
This takes about a minute to execute, the reason I see for this is because I am inefficiently calling specific data points from my SQLite DB. I realize I could get around this by calling all relevant data points with something like the following:
SEARCH aapl FROM stockData WHERE aapl < 186;
But I am dead set on having my code loop through all timestamps one by one, as I want to maintain proper chronology.
My database is about 5 million rows long, thus the current method is not feasible. How can I most efficiently loop through and retrieve this data?
I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory.
This works:
import pandas.io.sql as psql
sql = "SELECT TOP 1000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
...but this does not work:
sql = "SELECT TOP 2000000 * FROM MyTable"
data = psql.read_frame(sql, cnxn)
It returns this error:
File "inference.pyx", line 931, in pandas.lib.to_object_array_tuples
(pandas\lib.c:42733) Memory Error
I have read here that a similar problem exists when creating a dataframe from a csv file, and that the work-around is to use the 'iterator' and 'chunksize' parameters like this:
read_csv('exp4326.csv', iterator=True, chunksize=1000)
Is there a similar solution for querying from an SQL database? If not, what is the preferred work-around? Should I use some other methods to read the records in chunks? I read a bit of discussion here about working with large datasets in pandas, but it seems like a lot of work to execute a SELECT * query. Surely there is a simpler approach.
As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
Code solution and remarks.
# Create empty list
dfl = []
# Create empty dataframe
dfs = pd.DataFrame()
# Start Chunking
for chunk in pd.read_sql(query, con=conct, ,chunksize=10000000):
# Start Appending Data Chunks from SQL Result set into List
dfl.append(chunk)
# Start appending data from list to dataframe
dfs = pd.concat(dfl, ignore_index=True)
However, my memory analysis tells me that even though the memory is released after each chunk is extracted, the list is growing bigger and bigger and occupying that memory resulting in a net net no gain on free RAM.
Would love to hear what the author / others have to say.
The best way I found to handle this is to leverage the SQLAlchemy steam_results connection options
conn = engine.connect().execution_options(stream_results=True)
And passing the conn object to pandas in
pd.read_sql("SELECT *...", conn, chunksize=10000)
This will ensure that the cursor is handled server-side rather than client-side
You can use Server Side Cursors (a.k.a. stream results)
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
engine = create_engine(
"postgresql://postgres:pass#localhost/example"
)
conn = engine.connect().execution_options(
stream_results=True)
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
As mentioned in the comments by others, using the chunksize argument in pd.read_sql("SELECT * FROM users", engine, chunksize=1000) does not solve the problem as it still loads the whole data in the memory and then gives it to you chunk by chunk.
More explanation here
chunksize still loads all the data in memory, stream_results=True is the answer. it is server side cursor that loads the rows in given chunks and save memory.. efficiently using in many pipelines, it may also help when you load history data
stream_conn = engine.connect().execution_options(stream_results=True)
use pd.read_sql with thechunksize
pd.read_sql("SELECT * FROM SOURCE", stream_conn , chunksize=5000)
you can update version airflow.
for example, I had that error in the version 2.2.3 using docker-compose.
AIRFLOW__CORE__EXECUTOR=CeleryExecutor
mysq 6.7
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
redis:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "250M"
airflow-webserver:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-scheduler:
cpus: "0.5"
mem_reservation: "10M"
mem_limit: "750M"
airflow-worker:
#cpus: "0.5"
#mem_reservation: "10M"
#mem_limit: "750M"
error: Task exited with return code Negsignal.SIGKILL
but update to the version
FROM apache/airflow:2.3.4.
and perform the pulls without problems, using the same resources configured in the docker-compose
enter image description here
my dag extractor:
function
def getDataForSchema(table,conecction,tmp_path, **kwargs):
conn=connect_sql_server(conecction)
query_count= f"select count(1) from {table['schema']}.{table['table_name']}"
logging.info(f"query: {query_count}")
real_count_rows = pd.read_sql_query(query_count, conn)
##sacar esquema de la tabla
metadataquery=f"SELECT COLUMN_NAME ,DATA_TYPE FROM information_schema.columns \
where table_name = '{table['table_name']}' and table_schema= '{table['schema']}'"
#logging.info(f"query metadata: {metadataquery}")
metadata = pd.read_sql_query(metadataquery, conn)
schema=generate_schema(metadata)
#logging.info(f"schema : {schema}")
#logging.info(f"schema: {schema}")
#consulta la tabla a extraer
query=f" SELECT {table['custom_column_names']} FROM {table['schema']}.{table['table_name']} "
logging.info(f"quere data :{query}")
chunksize=table["partition_field"]
data = pd.read_sql_query(query, conn, chunksize=chunksize)
count_rows=0
pqwriter=None
iteraccion=0
for df_row in data:
print(f"bloque {iteraccion} de total {count_rows} de un total {real_count_rows.iat[0, 0]}")
#logging.info(df_row.to_markdown())
if iteraccion == 0:
parquetName=f"{tmp_path}/{table['table_name']}_{iteraccion}.parquet"
pqwriter = pq.ParquetWriter(parquetName,schema)
tableData = pa.Table.from_pandas(df_row, schema=schema,safe=False, preserve_index=True)
#logging.info(f" tabledata {tableData.column(17)}")
pqwriter.write_table(tableData)
#logging.info(f"parquet name:::{parquetName}")
##pasar a parquet df directo
#df_row.to_parquet(parquetName)
iteraccion=iteraccion+1
count_rows += len(df_row)
del df_row
del tableData
if pqwriter:
print("Cerrando archivo parquet")
pqwriter.close()
del data
del chunksize
del iteraccion
Here is a one-liner. I was able to load in 49m records to the dataframe without running out of memory.
dfs = pd.concat(pd.read_sql(sql, engine, chunksize=500000), ignore_index=True)
Full one-line code using sqlalchemy and with operator:
db_engine = sqlalchemy.create_engine(db_url, pool_size=10, max_overflow=20)
with Session(db_engine) as session:
sql_qry = text("Your query")
data = pd.concat(pd.read_sql(sql_qry,session.connection().execution_options(stream_results=True), chunksize=500000), ignore_index=True)
You can try to change chunksize to find the optimal size for your case.
You can use chunksize option, but need to set it up to 6-7 digit if you have RAM issue.
for chunk in pd.read_sql(sql, engine, params = (fromdt, todt,filecode), chunksize=100000):
df1.append(chunk)
dfs = pd.concat(df1, ignore_index=True)
do this
If you want to limit the number of rows in output, just use:
data = psql.read_frame(sql, cnxn,chunksize=1000000).__next__()