Pandas transform method performing slow - python

I have a canonical Pandas transform example in which performance seems inexplicably slow. I have read the Q&A on the apply method, which is related but, in my humble opinion, offers an incomplete and potentially misleading answer to my question as I explain below.
The first five lines of my dataframe are
id date xvar
0 1004 1992-05-31 4.151628
1 1004 1993-05-31 2.868015
2 1004 1994-05-31 3.043287
3 1004 1995-05-31 3.189541
4 1004 1996-05-31 4.008760
There are 24,693 rows in the dataframe.
There are 2,992 unique id values.
I want to center xvar by id.
Approach 1 takes 861 ms:
df_r['xvar_center'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: x - x.mean())
)
Approach 2 takes 9 ms:
# Group means
df_r_mean = (
df_r
.groupby('id', as_index=False)['xvar']
.mean()
.rename(columns={'xvar':'xvar_avg'})
)
# Merge group means onto dataframe and center
df_w = (
pd
.merge(df_r, df_r_mean, on='id', how='left')
.assign(xvar_center=lambda x: x.xvar - x.xvar_avg)
)
The Q&A on the apply method recommends relying on vectorized functions whenever possible, much like #sammywemmy's comment implies. This I see as overlap. However, the Q&A on the apply method also sates:
"...here are some common situations where you will want to get rid of any calls to apply...Numeric Data"
#sammywemmy's comment does not "get rid of any calls to" the transform method in their answer to my question. On the contrary, the answer relies on the transform method. Therefore, unless #sammywemmy's suggestion is strictly dominated by an alternative approach that does not rely on the transform method, I think my question and its answer are sufficiently distinct from the discussion in Q&A on the apply method. (Thank you for your patience and help.)

This answer is due to the insightful comment from #sammywemmy, who deserves all credit and no blame for any inaccuracy here. Because a similar usage of transform is illustrated in the Pandas User's Guide, I thought elaborating may be useful for others.
My hypothesis is that the problem rests with a combination of using a non-vectorized function and a large number of groups. When I change the groupby variable from id (2,992 unique values) to year (constructed from the date variable and containing 28 unique values), the performance difference between my original approach and #sammywemmy's narrows substantially but is still significant.
%%timeit
df_r['xvar_center_y'] = (
df_r
.groupby('year')['xvar']
.transform(lambda x: x - x.mean())
)
11.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs.
%timeit df_r['xvar_center_y'] = df_r.xvar - df_r.groupby('year')['xvar'].transform('mean')
1.69 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The beauty of #sammywemmy's insight is that it is easy to apply to other common transformations for potentially significant performance improvements and at a modest cost in terms of additional code. For example, consider standardizing a variable:
%%timeit
df_r['xvar_z'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: (x - x.mean()) / x.std())
)
1.34 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs.
%%timeit
df_r['xvar_z'] = (
(df_r.xvar - df_r.groupby('id')['xvar'].transform('mean'))
/ df_r.groupby('id')['xvar'].transform('std')
)
3.96 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Compare with another column value

train.loc[:,'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > train['q_5']
I get Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do left, right = left.align(right, axis=1, copy=False)before e.g.left == right` and something strange output with a lot of columns, but I did expect cell values masked with True or False for calculate sum on next step.
Comparing each columns separately works just fine
train['nd_mean_2021-04-15'] > train['q_5']
But works slowly and messy code.
I've tested your original solution, and two additional ways of performing this comparison you want to make.
To cut to the chase, the following option had the smallest execution time:
%%timeit
sliced_df = df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27']
comparisson_df = pd.DataFrame({col: df['q_5'] for col in sliced_df.columns})
(sliced_df > comparisson_df)
# 1.46 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Drawback: it's little bit messy and requires you to create 2 new objects (sliced_df and comparisson_df)
Option 2: Using DataFrame.apply (slower but more readable)
The second option although slower than your original and the above implementations, in my opinion is the cleanest and easiest to read of them all. If you're not trying to process large amounts of data (I assume not, since you're using pandas instead of Dask or Spark that are tools more suitable for processing large volumes of data) then it's worth bringing it to the discussion table:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'].apply(lambda col: col > df['q_5'])
# 5.66 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Original Solution
I've also tested the performance of your original implementation and here's what I got:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > df['q_5']
# 2.02 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Side-Note: If the FutureWarning message is bothering you, there's always the option to ignore them, adding the following code after your script imports:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
DataFrame Used for Testing
All of the above implementations used the same dataframe, that I created using the following code:
import pandas as pd
import numpy as np
columns = list(
map(
lambda value: f'nd_mean_{value}',
pd.date_range('2021-04-15', '2021-08-27', freq='W').to_series().dt.strftime('%Y-%m-%d').to_list()
)
)
df = pd.DataFrame(
{col: np.random.randint(0, 100, 10) for col in [*columns, 'q_5']}
)
Screenshots

Faster pandas DatetimeIndex membership checking

I have a tight loop which, among other things, checks whether a given date (in the form of a pandas.Timestamp) is contained in a given unique pandas.DatetimeIndex (the application being checking whether a date is a custom business day).
As a minimal example, consider this bit:
import pandas as pd
dates = pd.date_range("2020", "2021")
index = dates.to_series().sample(frac=0.7).sort_index().index
for date in dates:
if date in index:
# Do stuff...
(Note that simply iterating over index is not an option in the full application)
To my surprise, I found that the date in index bit takes up a significant part of the total runtime. Profiling furthermore shows that Pandas' membership check does a lot more than just a hash lookup, which is further confirmed by a small experiment comparing DatetimeIndex vs a plain python set:
%timeit [date in index for date in dates]
# 3.28 ms ± 81.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs
index_set = set(index)
%timeit [date in index_set for date in dates]
# 341 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Note that the difference is almost 10x! Why this difference and can I do anything to make it faster?

How to quickly subset many dataframes?

I have 180 DataFrame objects, each one has 3130 rows and it's about 300KB in memory.
The index is a DatetimeIndex, business days from 2000-01-03 to 2011-12-31:
from datetime import datetime
import pandas as pd
freq = pd.tseries.offsets.BDay()
index = pd.date_range(datetime(2000,1,3), datetime(2011,12,31), freq=freq)
df = pd.DataFrame(index=index)
df['A'] = 1000.0
df['B'] = 2000.0
df['C'] = 3000.0
df['D'] = 4000.0
df['E'] = 5000.0
df['F'] = True
df['G'] = 1.0
df['H'] = 100.0
I preprocess all the data taking advantage of numpy/pandas vectorization, then I have to loop through the dataframes day by day. To prevent the possibility of 'look ahead bias' and get data from the future I must be sure each day I only return a subset of my dataframes, up to that datapoint. I explain: if the current datapoint I am processing is datetime(2010,5,15) I need data from datetime(2000,1,3) to datetime(2010,5,15). You should not be able to access data more recent than datetime(2010,5,15). With this subset I'll make other computations I can't vectorize because they are path dependent.
I modified my original loop like this:
def get_data(datapoint):
return df.loc[:datapoint]
calendar = df.index
for datapoint in calendar:
x = get_data(datapoint)
This kind of code is painfully slow. What is my best option to improve its speed?
If I do not try to prevent the look ahead bias my production code takes about 3 minutes to run but it is too risky. With code like this it takes 13 minutes and this is unacceptable.
%%timeit
A slightly faster option is using iloc instead of loc but it is still slow:
def get_data2(datapoint):
idx = df.index.get_loc(datapoint)
return df.iloc[:idx]
for datapoint in calendar:
x = get_data(datapoint)
371 ms ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for datapoint in calendar:
x = get_data2(datapoint)
327 ms ± 7.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The original code, which was not trying to prevent the possibility of look ahead bias, simply returned the whole DataFrame when called for each datapoint. In this example is 100 time faster, real code is 4 times faster.
def get_data_no_check():
return df
for datapoint in calendar:
x = get_data_no_check()
2.87 ms ± 89.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
See if this work for you:
datapoint_range = pd.date_range(datetime(2000,1,3), datetime.now(), freq=freq)
datapoint = datapoint_range[-1]
Logic is: replacing the ending date to be today so as to ensure not future date. Then get the last date of the range.
Then use your df.loc[:datapoint] to get the range you want.
I solved it like this: first I preprocess all my data in the DataFrame to take advantage of pandas vectorization then I convert it into a dict of dict and I iterate over it preventing the possibility of 'look ahead bias'. Since data are already preprocessed I can avoid the DataFrame overhead. The increase in processing speed in production code let me speechless: down from more than 30 minutes to 40 seconds!
# Convert the DataFrame into a dict of dict
for s, data in self._data.items():
self._data[s] = data.to_dict(orient='index')

Calculations using Pandas apply & lambda

I have a data frame with a column for dates (YYYY/MM/DD format) and one for wind speed measurements. Each date has more than one wind speed measurement associated with it, and I want to calculate the standard error of wind speed measurements on each day.
I use the pandas 'groupby' to group all of the wind speeds to the date they were taken on, and I calculate the mean and number of measurements taken from each day.
To calculate the standard error, have to sum the squared difference of each value from a day and the average of values from that day. Obviously, these are different lengths, and I cannot figure out how to do this with the lambdas function.
Is there a better way to go about this?
#calculate daily averages, daily number of measurements, and list of every value from day
average_from_date = df.groupby(['time'])['wind_spd_ms'].mean()
number = df.groupby(['time'])['wind_spd_ms'].count()
values_from_date = df.groupby(['time'])['wind_spd_ms'].apply(list)
#return list of standard errors for each date in the data set
standard_errors = df.groupby(['time'])['wind_spd_ms'].apply(lambda x: (sum((values_from_date -
average_from_date)**2)/(number-1)))
While the groupby.GroupBy.sem is a good way to calculate this since it is a ready function in pandas, there might be cases when you need to calculate a new column with a function that does not exist in the library.
apply() + lambda
This is how you would calculate a new column* using the "apply lambda" approach:
res1 = df.groupby(['time'])['wind_spd_ms'].apply(lambda x: ((x-np.mean(x))**2)/(len(x)-1))
What is important to understand that since df.groupby(['time']) is DataFrame(GroupBy) object the df.groupby(['time'])['wind_spd_ms'] is a Series(GroupBy) object and the apply() function therefore is pd.Series.apply. It takes a function as an argument, and the function will be called with the pandas series (here: the df.groupby(['time'])['wind_spd_ms']) as the argument. Now, you know already how to calculate the standard deviation if you get a list/Series.
apply() + another function
With apply you are not restricted to lambdas, but the argument can be any function that takes pd.Series as the argument. So, equally good solution would be.
def calculate_std(x):
ave = np.mean(x)
return ((x-ave)**2)/(len(x)-1)
res2 = df.groupby(['time'])['wind_spd_ms'].apply(calculate_std)
With a little more complex calculations this is much more readable and preferable solution.
How fast are different alternatives?
One might think that "using lambdas is faster", but if you time the functions yourself you see that there is no speed gains from using lambdas:
In [3]: timeit df.groupby(['time'])['wind_spd_ms'].apply(lambda x: ((x-np.mean(x))**2)/(len(x)-1))
3.26 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: timeit df.groupby(['time'])['wind_spd_ms'].apply(calculate_std)
2.87 ms ± 63.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
On the other hand, the sem function mentioned by oli5679 is faster
In [5]: timeit df.groupby(['time'])['wind_spd_ms'].sem()
1.33 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(Altough, library funtions are not always the fastest. For example, using scipy.ndimage.interpolation.shift for shifting..)
* This is the equation for groupwise standard deviation (not standard error of groupwise mean)
Pandas groupby has a 'sem', which you can use without having to create your own lambda function. See more info here.
import pandas as pd
test = pd.DataFrame({'group':['a','a','a','b','b','b'],'val':[1,100,-40,5,7,8]})
test.groupby(['group'])['val'].sem()
#a 41.554516
#b 0.881917
See below example of how to do it from scratch. I think original attempt wasn't quite fitting definition here. You to divide total squared difference from mean by N-1 to calculate sample variance, but also need to divide this by N, and squareroot to get SEM.
test["squared_difference_from_average"] = (
test["val"] - test.groupby(["group"])["val"].transform("mean")
) ** 2
group_count = test.groupby(["group"])["val"].count()
standard_errors = (
(
(test.groupby(["group"]) ["squared_difference_from_average"].sum())
/ (group_count - 1)
)
/ group_count
) ** 0.5

Normalize DataFrame by group

Let's say that I have some data generated as follows:
N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
and then I create some categorization variable:
indx = np.random.randint(0,3,size=N).astype(np.int32)
and generate a DataFrame:
import pandas as pd
df = pd.DataFrame(np.hstack((data, indx[:,None])),
columns=['a%s' % k for k in range(m)] + [ 'indx'])
I can get the mean value, per group as:
df.groubpy('indx').mean()
What I'm unsure of how to do is to then subtract the mean off of each group, per-column in the original data, so that the data in each column is normalized by the mean within group. Any suggestions would be appreciated.
In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
should do it.
If the data contains many groups (thousands or more), the accepted answer using a lambda may take a very long time to compute. A fast solution would be:
groups = df.groupby("indx")
mean, std = groups.transform("mean"), groups.transform("std")
normalized = (df[mean.columns] - mean) / std
Explanation and benchmarking
The accepted answer suffers from a performance problem using apply with a lambda. Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python lambda function itself for each group creates a considerable overhead.
This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.
We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform. Let's put the solution from above into a method for benchmarking:
def normalize_by_group(df, by):
groups = df.groupby(by)
# computes group-wise mean/std,
# then auto broadcasts to size of group chunk
mean = groups.transform("mean")
std = groups.transform("std")
normalized = (df[mean.columns] - mean) / std
return normalized
I changed the data generation from the original question to allow for more groups:
def gen_data(N, num_groups):
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
indx = np.random.randint(0,num_groups,size=N).astype(np.int32)
df = pd.DataFrame(np.hstack((data, indx[:,None])),
columns=['a%s' % k for k in range(m)] + [ 'indx'])
return df
With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:
In: df2g = gen_data(10000, 2) # 3 cols, 10000 rows, 2 groups
In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:
In: df1000g = gen_data(10000, 1000) # 3 cols, 10000 rows, 1000 groups
In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer works and is elegant.
Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):
means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
df = df.merge(means_stds,on='indx')
df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']
To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:
means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
df = df.merge(means_stds,on='indx')
for col in ['a0','a1']:
df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']
Although this is not the prettiest solution, you could do something like this:
indx = df['indx'].copy()
for indices in df.groupby('indx').groups.values():
df.loc[indices] -= df.loc[indices].mean()
df['indx'] = indx

Categories