Pandas performance improvement over pd.apply with eval() - python

I'm trying to optimize my script performance via Pandas. I'm running into a roadblock where I need to apply a large number of filters to a DataFrame and store a few totals from the results.
Currently the fastest way I can make this happen is running a For Loop on the list of filters (as strings) and using eval() to calculate the totals:
for filter_str in filter_list:
data_filtered = data[eval(filter_str)]
avg_change = data_filtered['NewChangePerc'].mean()
Here's my attempt at using pd.apply() to speed this up because I can't think of a vectorized way to make it happen (the filters are in a DataFrame this time instead of a list):
def applying(x):
f = data[eval(x)]
avg_change = f['NewChangePerc'].mean()
filter_df.processed.apply(applying)
The main goal is to simply make it as fast as possible. What I don't understand is why a For Loop is faster than pd.apply(). It's about twice as fast.
Any input would be greatly appreciated.
UPDATE
Here's more specifics about what I'm trying to accomplish:
Take a data set of roughly 67 columns and 2500 rows.
Code Name ... Price NewChangePerc
0 JOHNS33 Johnson, Matthew ... 0.93 0.388060
1 QUEST01 Questlove, Inc. ... 18.07 0.346498
2 773NI01 773 Entertainment ... 1.74 0.338462
3 CLOVE03 Cloverfield, Sam ... 21.38 0.276418
4 KITET08 Kite Teal 6 ... 0.38 0.225806
Then take a list of filters.
['Price > 5.0', 'NewChangePerc < .02']
Apply each filter to the data and calculate certain values, such as the average NewChangePerc.
For example, when applying 'Price > 5.0', the average NewChangePerc would be ~0.31.
Now grow that list of filters to a length of about 1,000, and it starts to take some time to run. I need to cut that time down as much as possible. I've run out of ideas and can't find any solutions beyond what I've listed above, but they're just too slow (~0.86s for 50 filters with the For Loop; ~1.65s for 50 filters with pd.apply()). Are there any alternatives?

Related

Pandas: How to efficiently diff() after a groupby() operation?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.
Here is how my dataset looks like:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to apply the diff() function for each location on the prec column. The original dataset piles up the prec numbers; by applying diff() I will get the appropriate prec value for each hour.
With these in mind, I have implemented the following algorithm in Pandas:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
This works really well functionally, however the performance and the memory consumption is horrible. It is taking around 30 minutes on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
Also, the overall memory consumption of the Python script is sky-rocketing during this operation; it grows around 300%! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.
With the input from #Quang Hoang and #Ben. T, I figured out a solution that is pretty fast but still consumes a lot of memory.
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:
df_filtered seems like a copy of the data; that should increase the memory a lot.
df_diffed is also a copy.
The memory usage is very intensive while computing these two variables. I am not sure if there is any in-place way to execute such operations.

More efficient way to classify

normal = []
nine_plus []
tw_plus = []
for i in df['SubjectID'].unique():
x= df.loc[df['SubjectID']==i]
if(len(x['Year Term ID'].unique())<=8):
normal.append(i)
elif(len(x['Year Term ID'].unique())>=9 and len(x['Year Term ID'].unique())<13):
nine_plus.append(i)
elif(len(x['Year Term ID'].unique())>=13):
tw_plus.append(i)
Hello, I am dealing with a dataset that has 10 million rows. The dataset is about student records and I am trying to classify the students into three groups according to how many semesters they have attended. I feel like I am using very crude method right now, and there could be more efficient way of categorizing. Any suggestions?
You go through a lot of repeated iterations that are likely to make your data frame slower than a simple Python list. Use the data frame organization in your favor.
Group your rows by Subject_ID, then Year_Term_ID.
Extract the count of rows in each sub-group -- which you currently have as len(x(...
Make a function, lambda, or extra column that represents the classification; call that len expression load:
0 if load <= 8 else 1 if load <= 12 else 3
Use that expression to re-group your students into the three desired classifications.
Do not iterate through the rows of the data frame: this is a "code smell" that you're missing a vectorized capability.
Does that get you moving?

High performance apply on group by pandas

I need to calculate percentile on a column of a pandas dataframe. A subset of the dataframe is as below:
I want to calculate the 20th percentile of the SaleQTY, but for each group of ["Barcode","ShopCode"]:
so I define a function as below:
def quant(group):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
return group
And apply this function on each group pf my sales data which has almost 18 million rows and roughly 3 million groups of ["Barcode","ShopCode"]:
quant_sale = sales.groupby(['Barcode','ShopCode']).apply(quant)
That took 2 hours to complete on a windows server with 128 GB Ram and 32 Core.
It make not sense because that is one small part of my code. S o I start searching the net to enhance the performance.
I came up with "numba" solution with below code which didn't work:
from numba import njit, jit
#jit(nopython=True)
def quant_numba(df):
final_quant = []
for bar_shop,group in df.groupby(['Barcode','ShopCode']):
group["Quantile"] = np.quantile(group["SaleQTY"], 0.2)
final_quant.append((bar_shop,group["Quantile"]))
return final_quant
result = quant_numba(sales)
It seems that I cannot use pandas objects within this decorator.
I am not sure whether I can use of multi processing (which I'm unfamiliar with the whole concept) or whether is there any solution to speed up my code. So any help would be appreciated.
You can try DataFrameGroupBy.quantile:
df1 = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].quantile(0.2)
Or like montioned #Jon Clements for new columns filled by percentiles use GroupBy.transform:
df['Quantile'] = df.groupby(['Barcode', 'Shopcode'])['SaleQTY'].transform('quantile', q=0.2)
There is a inbuilt function in panda called quantile().
quantile() will help to get nth percentile of a column in df.
Doc reference link
geeksforgeeks example reference

increasing pandas dataframe imputation performance

I want to impute a large datamatrix (90*90000) and later an even larger one (150000*800000) using pandas.
At the moment I am testing with the smaller one on my laptop (8gb ram, Haswell core i5 2.2 GHz, the larger dataset will be run on a server).
The columns have some missing values that I want to impute with the most frequent one over all rows.
My working code for this is:
freq_val = pd.Series(mode(df.ix[:,6:])[0][0], df.ix[:,6:].columns.values) #most frequent value per column, starting from the first SNP column (second row of 'mode'gives actual frequencies)
df_imputed = df.ix[:,6:].fillna(freq_val) #impute unknown SNP values with most frequent value of respective columns
The imputation takes about 20 minutes on my machine. Is there another implementation that would increase performance?
try this:
df_imputed = df.iloc[:, 6:].fillna(df.iloc[:, 6:].apply(lambda x: x.mode()).iloc[0])
I tried different approaches. The key learning is that the mode function is really slow. Alternatively, I implemented the same functionality using np.unique (return_counts=True) and np.bincount. The latter is supposedly faster, but it doesn't work with NaN values.
The optimized code now needs about 28 s to run. MaxU's answer needs ~48 s on my machine to finish.
The code:
iter = range(np.shape(df.ix[:,6:])[1])
freq_val = np.zeros(np.shape(df.ix[:,6:])[1])
for i in iter:
_, count = np.unique(df.ix[:,i+6], return_counts=True)
freq_val[i] = count.argmax()
freq_val_series = pd.Series(freq_val, df.ix[:,6:].columns.values)
df_imputed = df.ix[:,6:].fillna(freq_val_series)
Thanks for the input!

Migrating Excel financial model to and Corkscrew calculation in Python Pandas

I'm working on replacing an Excel financial model into Python Pandas. By financial model I mean forecasting a cash flow, profit & loss statement and balance sheet over time for a business venture as opposed to pricing swaps / options or working with stock price data that are also referred to as financial models. It's quite possible that the same concepts and issues apply to the latter types I just don't know them that well so can't comment.
So far I like a lot of what I see. The models I work with in Excel have a common time series across the top of the page, defining the time period we're interested in forecasting. Calculations then run down the page as a series of rows. Each row is therefore a TimeSeries object, or a collection of rows becomes a DataFrame. Obviously you need to transpose to read between these two constructs but this is a trivial transformation.
Better yet each Excel row should have a common, single formula and only be based on rows above on the page. This lends itself to vector operations that are computationally fast and simple to write using Pandas.
The issue I get is when I try to model a corkscrew-type calculation. These are often used to model accounting balances, where the opening balance for one period is the closing balance of the prior period. You can't use a .shift() operation as the closing balance in a given period depends, amongst other things, on the opening balance in the same period. This is probably best illustrated with an example:
Time 2013-04-01 2013-05-01 2013-06-01 2013-07-01 ...
Opening Balance 0 +3 -2 -10
[...]
Some Operations +3 -5 -8 +20
[...]
Closing Balance +3 -2 -10 +10
In pseudo-code my solution to how to calculate these sorts of things is a follows. It is not a vectorised solution and it looks like it is pretty slow
# Set up date range
dates = pd.date_range('2012-04-01',periods=500,freq='MS')
# Initialise empty lists
lOB = []
lSomeOp1 = []
lSomeOp2 = []
lCB = []
# Set the closing balance for the initial loop's OB
sCB = 0
# As this is a corkscrew calculation will need to loop through all dates
for d in dates:
# Create a datetime object as will reference it several times below
dt = d.to_datetime()
# Opening balance is either initial opening balance if at the
# initial date or else the last closing balance from prior
# period
sOB = inp['ob'] if (dt == obDate) else sCB
# Calculate some additions, write-off, amortisation, depereciation, whatever!
sSomeOp1 = 10
sSomeOp2 = -sOB / 2
# Calculate the closing balance
sCB = sOB + sSomeOp1 + sSomeOp2
# Build up list of outputs
lOB.append(sOB)
lSomeOp1.append(sSomeOp1)
lSomeOp2.append(sSomeOp2)
lCB.append(sCB)
# Convert lists to timeseries objects
ob = pd.Series(lOB, index=dates)
someOp1 = pd.Series(lSomeOp1, index=dates)
someOp2 = pd.Series(lSomeOp2, index=dates)
cb = pd.Series(lCB, index=dates)
I can see that where you only have one or two lines of operations there might be some clever hacks to vectorise the computation, I'd be grateful to hear any tips people have on doing these sorts of tricks.
Some of the corkscrews I have to build, however, have 100's of intermediate operations. In these cases what's my best way forward? Is it to accept the slow performance of Python? Should I migrate to Cython? I've not really looked into it (so could be way off base) but the issue with the latter approach is that if I'm moving 100's of lines into C why am I bothering with Python in the first place, it doesn't feel like a simple lift and shift?
This following makes in-place updates, which should improve performance
import pandas as pd
import numpy as np
book=pd.DataFrame([[0, 3, np.NaN],[np.NaN,-5,np.NaN],[np.NaN,-8,np.NaN],[np.NaN,+20,np.NaN]], columns=['ob','so','cb'], index=['2013-04-01', '2013-05-01', '2013-06-01', '2013-07-01'])
for row in book.index[:-1]:
book['cb'][row]=book.ix[row, ['ob', 'so']].sum()
book['ob'][book.index.get_loc(row)+1]=book['cb'][row]
book['cb'][book.index[-1]]=book.ix[book.index[-1], ['ob', 'so']].sum()
book

Categories