Pandas apply speed on large datasets. - python

I have a table in pandas that has two columns, QuarterHourDimID and StartDateDimID ; these columns give me an ID for each date / quarter hour pairing. For instance for January 1st, 2015, at 12:15PM the StartDateDimID would equal 1097 and QuarterHourDimID would equal 26. This is how the data I'm reading is organized.
It's a large table that I'm reading using pyodbc and pandas.read_sql(), ~450M rows and ~60 columns, so performance is an issue.
To parse the QuarterHourDimID and StartDateDimID columns into workable datetime indexes I'm running an apply function on every row to create an additional column datetime.
My code reading the table without the additional parsing is around 800ms; however when I run this apply function it adds around 4s to total run time (anywhere between 5.8-6s a query is expected.) The df that is returned is around ~45K rows and 5 columns (~450days*~100quarter-hour-parts)
I am hoping to more efficiently rewrite what I've written and get any input along the way.
Below is the code I've written thus far:
import pandas as pd
from datetime import datetime, timedelta
import pyodbc
def table(network, demo):
connection_string = "DRIVER={SQL Server};SERVER=OURSERVER;DATABASE=DB"
sql = """SELECT [ID],[StartDateDimID],[DemographicGroupDimID],[QuarterHourDimID],[Impression] FROM TABLE_NAME
WHERE (MarketDimID = 1
AND RecordTypeDimID = 2
AND EstimateTypeDimID = 1
AND DailyOrWeeklyDimID = 1
AND RecordSequenceCodeDimID = 5
AND ViewingTypeDimID = 4
AND NetworkDimID = {}
AND DemographicGroupDimID = {}
AND QuarterHourDimID IS NOT NULL)""".format(network, demo)
with pyodbc.connect(connection_string) as cnxn:
df = pd.read_sql(sql=sql, con=cnxn, index_col=None)
def time_map(quarter_hour, date):
if quarter_hour > 72:
return date + timedelta(minutes=(quarter_hour % 73)*15)
return date + timedelta(hours=6, minutes=(quarter_hour-1)*15)
map_date = {}
init_date = datetime(year=2012, month=1, day=1)
for x in df.StartDateDimID.unique():
map_date[x] = init_date + timedelta(days=int(x)-1)
#this is the part of my code that is likely bogging things down
df['datetime'] = df.apply(lambda row: time_map(int(row['QuarterHourDimID']),
map_date[row['StartDateDimID']]),
axis=1)
if network == 1278:
df = df.loc[df.groupby('datetime')['Impression'].idxmin()]
df = df.set_index(['datetime'])
return df

Just to post an example of the date-time conversion being performed in SQL rather than pandas and the time mock-up, using the above code and yielding a mean time of 6.4s/execution, I was able to rewrite the code entirely in SQL and got a mean time of 640ms/execution.
The updated code:
import pandas as pd
import pyodbc
SQL_QUERY ="""
SELECT [Impressions] = MIN(naf.Impression), [datetime] = DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
FROM [dbo].[NielsenAnalyticsFact] AS naf
LEFT JOIN [dbo].[DateDim] AS ddt
ON naf.StartDateDimID = ddt.DateDimID
LEFT JOIN [dbo].[TimeDim] as td
ON naf.QuarterHourDimID = td.TimeDimID
WHERE (naf.NielsenMarketDimID = 1
AND naf.RecordTypeDimID = 2
AND naf.AudienceEstimateTypeDimID = 1
AND naf.DailyOrWeeklyDimID = 1
AND naf.RecordSequenceCodeDimID = 5
AND naf.ViewingTypeDimID = 4
AND naf.NetworkDimID = 1278
AND naf.DemographicGroupDimID = 3
AND naf.QuarterHourDimID IS NOT NULL)
GROUP BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey)))
ORDER BY DATEADD(minute,td.Minute,DATEADD(hour,td.Hour,CONVERT(smalldatetime, ddt.DateKey))) ASC
"""
%%timeit -n200
with pyodbc.connect(DB_CREDENTIALS) as cnxn:
df = pd.read_sql(sql=SQL_QUERY,
con=cnxn,
index_col=None)
200 loops, best of 3: 613 ms per loop

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

How can i get the top 10 most frequent values between 2 dates from a csv with pandas?

Essentially I have a csv file which has an OFFENCE_CODE column and a column with some dates called OFFENCE_MONTH. The code I have provided retrieves the 10 most frequently occuring offence codes within the OFFENCE_CODE column, however I need to be able to do this between 2 dates from the OFFENCE_MONTH column.
import numpy as np
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
print(df['OFFENCE_CODE'].value_counts().nlargest(10))
You can use pandas.Series.between :
df['OFFENCE_MONTH'] = pd.to_datetime(df['OFFENCE_MONTH'])
input_date1 = pd.to_datetime('2012/11/1')
input_date2 = pd.to_datetime('2013/11/1')
m = df['OFFENCE_MONTH'].between(input_date1, input_date2)
df.loc[m, 'OFFENCE_CODE'].value_counts().nlargest(10)
You can do this if it is per month:
import pandas as pd
input_date1 = 2012/11/1
input_date2 = 2013/11/1
# example dataframe
# df = pd.read_csv("penalty_data_set.csv", dtype='unicode', usecols=['OFFENCE_CODE', 'OFFENCE_MONTH'])
d = {'OFFENCE_MONTH':[1,1,1,2,3,4,4,5,6,12],
'OFFENCE_CODE':['a','a','b','d','r','e','f','g','h','a']}
df = pd.DataFrame(d)
print(df)
# make a filter (example here)
df_filter = df.loc[(df['OFFENCE_MONTH']>=1) & (df['OFFENCE_MONTH']<5)]
print(df_filter)
# arrange the filter
print(df_filter['OFFENCE_CODE'].value_counts().nlargest(10))
example result:
a 2
b 1
d 1
r 1
e 1
f 1
First you need to convert the date in OFFENCE_MONTH column to datetime :
from datetime import datetime
datetime.strptime(input_date1, "%Y-%m-%d")
datetime.strptime(input_date2, "%Y-%m-%d")
datetime.strptime(df['OFFENCE_MONTH'], "%Y-%m-%d")
Then Selecting rows based on your conditions:
rslt_df = df[df['OFFENCE_MONTH'] >= input_date1 and df['OFFENCE_MONTH'] <= input_date2]
print(rslt_df['OFFENCE_CODE'].value_counts().nlargest(10))

how to insert multiple python parameter into sql sting to create a dataframe

I've been trying to create a dataframe based on a SQL string, but I want to query the same things (3 counts) over different periods (in this example, monthly).
For context, I've been working in a python notebook on Civis.
I came up with this :
START_DATE_LIST = ["2021-01-01","2021-02-01","2021-03-01","2021-04-01","2021-05-01","2021-06-01","2021-07-01","2021-08-01","2021-09-01","2021-10-01","2021-11-01","2021-12-01"]
END_DATE_LIST = ["2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30","2021-07-31","2021-08-31","2021-09-31","2021-10-31","2021-11-30","2021-12-31"]
for start_date, end_date in zip(START_DATE_LIST,END_DATE_LIST) :
SQL = f"select count (distinct case when (CON.firstdonationdate__c< {start_date} and CON.last_gift_date__c> {start_date} ) then CON.id else null end) as Donors_START_DATE, \
count (distinct case when (CON.firstdonationdate__c< {end_date} and CON.c_last_gift_date__c> {end_date}) then CON.id else null end) as Donors_END_DATE, \
count (distinct case when (CON.firstdonationdate__c> {start_date} and CON.firstdonationdate__c<{end_date}) then CON.id else null end) as New_Donors \
from staging.contact CON;"
df2 =civis.io.read_civis_sql(SQL, "database", use_pandas=True)
df2['START_DATE']=start_date
df2['END_DATE']= end_date
It runs but then the output is only :
donors_start_date donors_end_date new_donors START_DATE END_DATE
0 47458 0 0 2021-12-01 2021-12-31
I'm thinking I have two problems :
1/ it reruns the df each time and, I need to find a way to stack up the outputs for each month.
2/ why doesn't it compute the last two counts for the last month.
Any feedback is greatly appreciated!
I think you have correctly identified the problem yourself:
In each iteration, you perform an SQL query and assign the result to DataFrame object called df2 (thus overriding its previous value)
Instead, you want create a DataFrame object outside the loop, then append data to it:
import pandas as pd
START_DATE_LIST = ...
END_DATE_LIST = ...
df = pd.DataFrame()
for start_date, end_date in zip(START_DATE_LIST, END_DATE_LIST) :
SQL = ...
row = civis.io.read_civis_sql(SQL, "database", use_pandas=True)
row['START_DATE'] = start_date
row['END_DATE'] = end_date
df = df.append(row)

conditional skipping of the files using pandas.read_sql

I am trying to read the values in the columns which I would like to use from the database files such as MS Access file only if the certain condition are met.
I have 26 different MS access files representing the database for 26 different years.
import pyodbc
import pandas as pd
import numpy as np
k = 1993 + np.arange(24)
for i in k:
print(i)
DBfile = r'D:\PMIS1993_2016'+'\\'+str(i)+'\\pmismzxpdata_'+str(i)+'.mdb'
print(DBfile)
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb)};DBQ='+DBfile)
cur = conn.cursor()
qry = "SELECT JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY, JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY FROM PMIS_JCP_RATINGS WHERE BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'"
dataf = pd.read_sql(qry, conn)
print(dataf)
D = list(dataf.values[0])
print(D)
conn.close()
Here I have tried to read values of variables of JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY and JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY when BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
However, not every year meets the conditions of BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
So, I would like to skip the years which does not meet these condition such as if else function indicating the years which does not satisfy.
If you have any help or idea, I would really appreciate.
Isaac
You can use the .empty attribute:
In [11]: pd.DataFrame().empty # This DataFrame has no rows
Out[11]: True
e.g. to skip the empty datafs:
if not dataf.empty:
D = list(dataf.values[0])
print(D)

Trying to use Deque to limit DataFrame of incoming data... suggestions?

I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)

Categories