I am having trouble loading 4.6M rows (11 vars) from snowflake to python. I generally use R, and it handles the data with no problem ... but I am struggling with Python (which I have rarely used, but need to on this occasion).
Attempts so far:
Use new python connector - obtained error message (as documented here: Snowflake Python Pandas Connector - Unknown error using fetch_pandas_all)
Amend my previous code to work in batches .. - this is what I am hoping for help with here.
The example code on the snowflake webpage https://docs.snowflake.com/en/user-guide/python-connector-pandas.html gets me almost there, but doesn't show how to concatenate the data from the multiple fetches efficiently - no doubt because those familiar with python already would know this.
This is where I am at:
import snowflake.connector
import pandas as pd
from itertools import chain
SNOWFLAKE_DATA_SOURCE = '<DB>.<Schema>.<VIEW>'
query = '''
select *
from table(%s)
;
'''
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='MYUSERNAME',
account='MYACCOUNT',
authenticator = 'externalbrowser',
warehouse='<WH>',
database='<DB>',
role='<ROLE>',
schema='<SCHEMA>'
)
return conn
def fetch_pandas_concat_df(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
colstring = ','.join([col[0] for col in cur.description])
df = pd.DataFrame(dat, columns =colstring.split(","))
grow.append(df)
rows += df.shape[0]
print(rows)
return pd.concat(grow)
def fetch_pandas_concat_list(cur):
rows = 0
grow = []
while True:
dat = cur.fetchmany(50000)
if not dat:
break
grow.append(dat)
colstring = ','.join([col[0] for col in cur.description])
rows += len(dat)
print(rows)
# note that grow is a list of list of tuples(?) [[(),()]]
return pd.DataFrame(list(chain(*grow)), columns = colstring.split(","))
cur = con.cursor()
cur.execute(query, (SNOWFLAKE_DATA_SOURCE))
df1 = fetch_pandas_concat_df(cur) # this takes forever to concatenate the dataframes - I had to stop it
df3 = fetch_pandas_concat_list(cur) # this is also taking forever.. at least an hour so far .. R is done in < 10 minutes....
df3.head()
df3.shape
cur.close()
The string manipulation you're doing is extremely expensive computationally. Besides, why would you want to combine everything to a single string, just to them break it back out?
Take a look at this section of the snowflake documentation. Essentially, you can go straight from the cursor object to the dataframe which should speed things up immensely.
Related
I know there are a few ways to retrieve data from RDB table.
One with pandas as read_sql, the other with cursore.fetchall().
What are the main differences between both ways in terms of:
memory usage - is df less reccomended?
performance - selecting data from a table (e.g. large set of data)
performace - inserting data with a loop for cursor vs df.to_sql.
Thanks!
That's an interesting question.
For a ~10GB SQLite database, I get the following results for your second question.
pandas.sql_query seems comparable to speed with the cursor.fetchall.
The rest I leave as an exercise. :D
import sqlite3
import time
import pandas as pd
def db_operation(db_name):
connection = sqlite3.connect(db_name)
c = connection.cursor()
yield c
connection.commit()
connection.close()
start = time.perf_counter()
for i in range(0, 10):
with db_operation('components.db') as c:
c.execute('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''')
fetchall = c.fetchall()
time.perf_counter() - start
2.967 # fractional seconds
start = time.perf_counter()
for i in range(0, 10):
connection = sqlite3.connect('components.db')
sql_query = pd.read_sql_query('''SELECT * FROM hashes WHERE (weight > 50 AND weight < 100)''', con = connection)
connection.commit()
connection.close()
time.perf_counter() - start
2.983 # fractional seconds
The difference is that cursor.fetchall() is a bit more spartan (=plain).
pandas.read_sql_query returns a <class 'pandas.core.frame.DataFrame'> and so you can use all the methods of pandas.DataFrame, like pandas.DataFrame.to_latex, pandas.DataFrame.to_csv pandas.DataFrame.to_excel, etc. (documentation link)
One can accomplish the same exact goals with cursor.fetchall, but needs to press some or a lot extra keys.
I am a financial analyst with about two month's experience with Python, and I am working on a project using Python and SQL to automate the compilation of a report. The process involves accessing a changing number of Excel files saved in a share drive, pulling two tabs from each (summary and quote) and combining the datasets into two large "Quote" and "Summary" tables. The next step is to pull various columns from each, combine, calculate, etc.
The problem is that the dataset ends up being 3.4mm rows and around 30 columns. The program I wrote below works, but it took 40 minutes to work through the first part (creating the list of dataframes) and another 4.5 hours to create the database and export the data, not to mention using a LOT of memory.
I know there must be a better way to accomplish this, but I don't have a CS background. Any help would be appreciated.
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
quote_combined = list()
summary_combined = list()
# Step through files in synced Sharepoint directory, select the files with the specific
# name format. For each file, parse the file name and add to 'tables' list, then load
# two specific tabs as pandas dataframes. Add two columns, format column headers, then
# add each dataframe to the list of dataframes.
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_combined.append(quote_sheet)
summary_combined.append(summary_sheet)
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
# Concatenate the list of dataframes to append one to another.
# Totals about 3.4mm rows for August
totalQuotes = pd.concat(quote_combined)
totalSummary = pd.concat(summary_combined)
# Change directory, create Sqlite database, and send the combined dataframes to database
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
cur = conn.cursor()
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
totalQuotes.to_sql(sqlite_table, sqlite_connection, if_exists = 'replace')
totalSummary.to_sql(sqlite_table2, sqlite_connection, if_exists = 'replace')
print('Finished. It took: ', datetime.now() - starttime)
'''
I see a few things you could do. Firstly, since your first step is just to transfer the data to your SQL DB, you don't necessarily need to append all the files to each other. You can just attack the problem one file at a time (which means you can multiprocess!) - then, whatever computations need to be completed, can come later. This will also result in you cutting down your RAM usage since if you have 10 files in your folder, you aren't loading all 10 up at the same time.
I would recommend the following:
Construct an array of filenames that you need to access
Write a wrapper function that can take a filename, open + parse the file, and write the contents to your MySQL DB
Use the Python multiprocessing.Pool class to process them simultaneously. If you run 4 processes, for example, your task becomes 4 times faster! If you need to derive computations from this data and hence need to aggregate it, please do this once the data's in the MySQL DB. This will be way faster.
If you need to define some computations based on the aggregate data, do it now, in the MySQL DB. SQL is an incredibly powerful language, and there's a command out there for practically everything!
I've added in a short code snippet to show you what I'm talking about :)
from multiprocessing import Pool
PROCESSES = 4
FILES = []
def _process_file(filename):
print("Processing: "+filename)
pool = Pool(PROCESSES)
pool.map(_process_file, FILES)
SQL clarification: You don't need an independent table for every file you move to SQL! You can create a table based on a given schema, and then add the data from ALL your files to that one table, row by row. This is essentially what the function you use to go from DataFrame to table does, but it creates 10 different tables. You can look at some examples on inserting a row into a table here.However, in the specific use case that you have, setting the if_exists parameter to "append" should work, as you've mentioned in your comment. I just added the earlier references in because you mentioned that you're fairly new to Python, and a lot of my friends in the finance industry have found gaining a slightly more nuanced understanding of SQL to be extremely useful.
Try this, Here most of the time is taken is during Loading Data from excel to Dataframe. I am not sure following script will reduce the time to within seconds but It will reduce the RAM baggage, which in turn could speed up the process. It will potentially reduce the time by at least 5-10 minutes. Since I have no access to data I cannot be sure. But you should try this
import os
import pandas as pd
from datetime import datetime
import sqlite3
from sqlalchemy import create_engine
from playsound import playsound
os.chdir(r'H:\AaronS\Databases')
conn = sqlite3.connect('AMZN-Quote-files_' + reportmonth)
engine = create_engine('sqlite:///AMZN-Quote-files_' + reportmonth + '.sqlite', echo=False)
sqlite_connection = engine.connect()
sqlite_table = 'totalQuotes'
sqlite_table2 = 'totalSummary'
reportmonth = '2020-08'
month_folder = r'C:\syncedSharePointFolder'
os.chdir(month_folder)
starttime = datetime.now()
print('Started', starttime)
c = 0
tables = list()
for xl in os.listdir(month_folder):
if '-Amazon' in xl:
ttime = datetime.now()
table_name = str(xl[11:-5])
tables.append(table_name)
quote_sheet = pd.read_excel(xl, sheet_name='-Amazon-Quote')
summary_sheet = pd.read_excel(xl, sheet_name='-Amazon-Summary')
quote_sheet.insert(0,'reportmonth', reportmonth)
summary_sheet.insert(0,'reportmonth', reportmonth)
quote_sheet.insert(0,'source_file', table_name)
summary_sheet.insert(0,'source_file', table_name)
quote_sheet.columns = quote_sheet.columns.str.strip()
quote_sheet.columns = quote_sheet.columns.str.replace(' ', '_')
summary_sheet.columns = summary_sheet.columns.str.strip()
summary_sheet.columns = summary_sheet.columns.str.replace(' ', '_')
quote_sheet.to_sql(sqlite_table, sqlite_connection, if_exists = 'append')
summary_sheet.to_sql(sqlite_table2, sqlite_connection, if_exists = 'append')
c = c + 1
print('Step', c, 'complete: ', datetime.now() - ttime, datetime.now() - starttime)
So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...
I have this python code that executes ~110k rows/second. I am wondering if it is possible to make it faster?
I am querying data from SQL, and need to format it to json
SQLquery= "SELECT value2 FROM mytable";
cursor.execute(SQLquery)
try:
ReturnedQuery = cursor.fetchall()
except Exception as ex:
pass
if(cursor.description):
#print(ReturnedQuery)
colTypes = cursor.description
column_names = [column[0] for column in colTypes]
NrOfColumns = len(column_names)
NrOfRows = len(ReturnedQuery)
print(NrOfRows)
Time1 = datetime.datetime.now()
data = []
for row in ReturnedQuery:
i = 0
dataRow = collections.OrderedDict()
for field in row:
dataRow[column_names[i]] = field
i = i + 1
data.append(dataRow)
Time2 = datetime.datetime.now()
TimeDiff =Time2 -Time1
print(TimeDiff)
connection.commit()
cursor.close()
Querying one column from SQL returns this: [(0.2,), (0.3,)]
I need to format it to look like this:
[OrderedDict([('value2', 0.2)]), OrderedDict([('value2', 0.3)])]
EDIT:
I filtered the query to get what I wanted insted. I am using TimeScaleDB, so I used the following query.
SELECT time_bucket('30 minutes', datetime) AS thirty_min,
AVG(value3) AS value3
FROM mytable
WHERE datetime > '2019-1-1 12:0:0.00' AND datetime < '2019-1-12 12:0:0.00'
GROUP BY thirty_min
ORDER BY thirty_min;
You can use list comprehension and cose the connection as fast as possible to save a few cycles. So this will probably be more efficient:
SQLquery= "SELECT value2 FROM mytable"
cursor.execute(SQLquery)
try:
result = cursor.fetchall()
except Exception as ex:
pass
if cursor.description:
column_names = [column[0] for column in cursor.description]
else:
column_names = []
cursor.close()
if column_names:
data = [OrderedDict(zip(column_names, row)) for row in result]
But perhaps you should take a look if you really need all these rows in the first place. Usually by filtering data before processing it, you can safe cycles in a more structural way.
Assuming you are bottlenecked by CPU (because python uses single Process), I suggest you to try multiprocessing module to split the CPU Load.
U can copy fetched columns to list and split it based on number of cores and create seperate Processes to process the splitted data to take advantage of muliple cores.
One problem might occur in writing results to same shared variable between processes. I have used Queue from multiprocessing module to overcome this.
I have two lists : one contains the column names of categorical variables and the other numeric as shown below.
cat_cols = ['stat','zip','turned_off','turned_on']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
These are the columns names in a table in Redshift.
I want to pass these as a parameter to pull only numeric columns from a table in Redshift(PostgreSql),write that into a csv and close the csv.
Next I want to pull only cat_cols and open the csv and then append to it and close it.
my query so far:
#1.Pull num data:
seg = ['seg1','seg2']
sql_data = str(""" SELECT {num_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Write to csv.
df_data.to_csv("df_sample.csv",index = False)
#2.Pull cat data:
sql_data = str(""" SELECT {cat_cols} """ + """FROM public.""" + str(seg) + """ order by random() limit 50000 ;""")
df_data = pd.read_sql(sql_data, cnxn)
# Append to df_seg.csv and close the connection to csv.
with open("df_sample.csv",'rw'):
## Append to the csv ##
This is the first time I am trying to do selective querying based on python lists and hence stuck on how to pass the list as column names to select from table.
Can someone please help me with this?
If you want, to make a query in a string representation, in your case will be better to use format method, or f-strings (required python 3.6+).
Example for the your case, only with built-in format function.
seg = ['seg1', 'seg2']
num_cols = ['acu_m1','acu_cnt_m1','acu_cnt_m2','acu_wifi_m2']
query = """
SELECT {} FROM public.{} order by random() limit 50000;
""".format(', '.join(num_cols), seg)
print(query)
If you want use only one item from the seg array, use seg[0] or seg[1] in format function.
I hope this will help you!