Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum

Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum - python

I am running a groupby rolling count, sum & mean using Pandas v1.1.0 and I notice that the rolling count is considerably slower than the rolling mean & sum. This seems counter intuitive as we can derive the count from the mean and sum and save time. Is this a bug or am I missing something? Grateful for advice.
import pandas as pd
# Generate sample df
df = pd.DataFrame({'column1': range(600), 'group': 5*['l'+str(i) for i in range(120)]})
# sort by group for easy/efficient joining of new columns to df
df=df.sort_values('group',kind='mergesort').reset_index(drop=True)
# timing of groupby rolling count, sum and mean
%timeit df['mean']=df.groupby('group').rolling(3,min_periods=1)['column1'].mean().values
%timeit df['sum']=df.groupby('group').rolling(3,min_periods=1)['column1'].sum().values
%timeit df['count']=df.groupby('group').rolling(3,min_periods=1)['column1'].count().values
### Output
6.14 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.61 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
76.1 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
### df Output for illustration
print(df.head(10))
column1 group mean sum count
0 0 l0 0.0 0.0 1.0
1 120 l0 60.0 120.0 2.0
2 240 l0 120.0 360.0 3.0
3 360 l0 240.0 720.0 3.0
4 480 l0 360.0 1080.0 3.0
5 1 l1 1.0 1.0 1.0
6 121 l1 61.0 122.0 2.0
7 241 l1 121.0 363.0 3.0
8 361 l1 241.0 723.0 3.0
9 481 l1 361.0 1083.0 3.0

Did you really mean count (number of non-NaN values)? That can not be inferred from just sum and mean.
I suspect that what you are looking for would be a size operator (just the length of the group, irrespective of whether or not there are any NaNs). While size exists in regular groupby, it seems that it is absent in RollingGroupBy (at least as of pandas 1.1.4). One can calculate the size of the rolling groups with:
# DRY:
rgb = df.groupby('group').rolling(3, min_periods=1)['column1']
# size is either:
rgb.apply(len)
# or
rgb.apply(lambda g: g.shape[0])
Neither of those two is as fast as it could, of course, because there needs to be a call to the function for each group, rather than being all vectorized and working just off of the rolling window indices start and end. On my system, either of the above is 2x slower than rgb.sum() or rgb.mean().
Thinking about how one would implement size: it is obvious (just end - start for each window).
Now, in the case one really wanted to speed up count (count of non-NaN values): one could establish a "cumulative count" at first:
cumcnt = (1 - df['column1'].isnull()).cumsum()
(this is very fast, about 200x faster than rgb.mean() on my system).
Then the rolling function could simply take cumcnt[end] - cumcnt[start].
I don't know enough about the internals of RollingGroupBy (and their use of various mixins) to assess feasibility, but at least functionally it seems pretty straightforward.
Update:
It seems that the issue is already fixed with these commits. That was fast and simple --I am impressed with the internal architecture of pandas and all the tools they already have on their Swiss army knife!

Related

Fast way to get rolling percentile ranks

Let's say we have a pandas df like this:
A B C
day1 2.4 2.1 3.0
day2 4.0 3.0 2.0
day3 3.0 3.5 2.5
day4 1.0 3.1 3.0
.....
I want to get for all columns rolling percentile ranks, with a window of 10 observations.
The following code works but it's slow:
scores = pd.DataFrame().reindex_like(df).replace(np.nan, '', regex=True)
scores = df.rolling(10).apply(lambda x: stats.percentileofscore(x, x[-1]))
I also tried this, but it's even slower:
def pctrank(x):
n = len(x)
temp = x.argsort()
ranks = np.empty(n)
ranks[temp] = (np.arange(n) + 1) / n
return ranks[-1]
scores = df.rolling(window=10,center=False).apply(pctrank)
Is there a faster solution? Thanks

As you want the rank of a single element in the rolling window, you don't need to sort at every step. You could just compare the last value to all others in the window:
def pctrank_comp(x):
x = x.to_numpy()
smaller_eq = (x <= x[-1]).sum()
return smaller_eq / len(x)
To remove the apply overhead, you could rewrite the same in NumPy, using the slide_tricks from NumPy v1.20:
from numpy.lib.stride_tricks import sliding_window_view
data = df.to_numpy()
sw = sliding_window_view(data, 10, axis=0)
scores_np = (sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1]
scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
This doesn't contain the first 9 NaN values per column, as your solution, I'll leave it up to you to fix that, if needed.
Switching the sliding window axis from the last axis to the first gives another performance improvement:
sw = sliding_window_view(data, 10, axis=0).T
scores_np = (sw <= sw[-1:, ...]).sum(axis=0).T / sw.shape[0]
To benchmark, some testing data with 1000 rows:
df = pd.DataFrame(np.random.uniform(0, 10, size=(1000, 3)), columns=list("ABC"))
Original solution from question comes in at 381ms:
%timeit scores = df.rolling(window=10,center=False).apply(pctrank)
381 ms ± 2.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Diff implementation using apply, ~5x faster on my machine:
%timeit scores_comp = df.rolling(window=10,center=False).apply(pctrank_comp)
71.9 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The groupby solution from Cimbali's answer, ~45x faster on my machine:
%timeit grouped = pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1); scores_grouped = grouped.rank(pct=True).loc[0].where(grouped.count().eq(10))
8.49 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pandas sliding window from #Cimbali, ~105x faster:
%timeit scores_concat = pd.concat({n: df.shift(n).le(df) for n in range(10)}).groupby(level=1).sum() / 10
3.63 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Sum shift version from #Cimbali, ~141x faster:
%timeit scores_sum = sum(df.shift(n).le(df) for n in range(10)).div(10)
2.71 ms ± 70.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The Numpy sliding window solution from above. For 1000 elements, it's faster than the Pandas versions, at ~930x (and possibly uses less memory?), but more complicated. For larger datasets, it becomes slower than the Pandas version.
%timeit data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0); scores_np = (sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1]; scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
409 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The fastest solution is moving the axes around, 2800x faster than the original version for 1000 rows, and about 2x faster than the Pandas sum version for 1M rows:
%timeit data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0).T; scores_np = (sw <= sw[-1:, ...]).sum(axis=0).T / sw.shape[0]; scores_np_df = pd.DataFrame(scores_np, columns=df.columns)
132 µs ± 750 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here’s the way to write this with pandas-only tools, where pd.DataFrame.rank() comes in handy:
df.rolling(10).apply(lambda x: x.rank(pct=True).iloc[-1])
If this remains very slow and your window is reasonable, you could concatenate across an axis to generate all the values to compare, and then use groupby.rank() to compare within each set of values:
>>> pd.concat({n: df.shift(10 - n) for n in range(10)})
A B
0 0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
9 95 17.0 9.0
96 12.0 11.0
97 11.0 19.0
98 4.0 15.0
99 8.0 17.0
[1000 rows x 2 columns]
>>> grouped = pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1)
>>> grouped.rank(pct=True).loc[0].where(grouped.count().eq(10))
A B
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
.. ... ...
95 0.75 0.50
96 0.60 1.00
97 0.20 0.60
98 0.50 0.70
99 0.75 0.35
[100 rows x 2 columns]
We can compare this with #w-m’s great answer which computes ranks using sums, this gives slightly different results, probably in the case of tie-breaks between grades. The sliding window view computation using pandas could look like:
>>> sum(df.shift(n).le(df) for n in range(10)).div(10)
A B
0 0.1 0.1
1 0.1 0.2
2 0.1 0.1
3 0.2 0.1
4 0.1 0.4
.. ... ...
95 0.8 0.5
96 0.6 1.0
97 0.2 0.6
98 0.5 0.7
99 0.8 0.4
[100 rows x 2 columns]
Note that you can always add .where(df.index.to_series().ge(10)) to the resulting dataframes to remove the 10 first rows.
Here’s what happens when I compare these solutions and both from #w-m’s post:
You can see the sliding window remains faster. If you’re using pandas you might as well use the rank(), it’s not that much slower and gives you more flexibility. .apply() techniques are always slow.
Results obtained with:
import numpy as np, pandas as pd, timeit
glob = {'df': pd.DataFrame(np.random.uniform(0, 10, size=(1000, 3)), columns=list("ABC")), 'pctrank': pctrank, 'pctrank_comp': pctrank_comp, 'sliding_window_view': np.lib.stride_tricks.sliding_window_view, 'pd': pd}
timeit.timeit('df.rolling(window=10,center=False).apply(pctrank)', globals=glob, number=10) / 10
timeit.timeit('df.rolling(window=10,center=False).apply(pctrank_comp)', globals=glob, number=100) / 100
timeit.timeit('data = df.to_numpy(); sw = sliding_window_view(data, 10, axis=0); pd.DataFrame((sw <= sw[..., -1:]).sum(axis=2) / sw.shape[-1], columns=df.columns)', globals=glob, number=10_000) / 10_000
timeit.timeit('pd.concat({n: df.shift(n).le(n) for n in range(10)}).groupby(level=1).sum()', globals=glob, number=10_000) / 10_000
timeit.timeit('sum(df.shift(n).le(df) for n in range(10)).div(10)', globals=glob, number=10_000) / 10_000
timeit.timeit('pd.concat({n: df.shift(n) for n in range(10)}).groupby(level=1).rank(pct=True).loc[0]', globals=glob, number=1000) / 1000

You can use the swifter package to apply your percentiles much faster.
https://github.com/jmcarpenter2/swifter

Pandas: How to calculate the percentage of one column against another?

I am just trying to calculate the percentage of one column against another's total, but I am unsure how to do this in Pandas so the calculation gets added into a new column.
Let's say, for argument's sake, my data frame has two attributes:
Number of Green Marbles
Total Number of Marbles
Now, how would I calculate the percentage of the Number of Green Marbles out of the Total Number of Marbles in Pandas?
Obviously, I know that the calculation will be something like this:
(Number of Green Marbles / Total Number of Marbles) * 100
Thanks - any help is much appreciated!

By default, arithmetic operations on pandas dataframes are element-wise, so this is as simple as it can be:
import pandas as pd
>>> d = pd.DataFrame()
>>> d['green'] = [3,5,10,12]
>>> d['total'] = [8,8,20,20]
>>> d
green total
0 3 8
1 5 8
2 10 20
3 12 20
>>> d['percent_green'] = d['green'] / d['total'] * 100
>>> d
green total percent_green
0 3 8 37.5
1 5 8 62.5
2 10 20 50.0
3 12 20 60.0
References:
pandas.DataFrame.div documentation;
Adding new column to existing dataframe in python pandas?

df['percentage columns'] = (df['Number of Green Marbles']) / (df['Total Number of Marbles'] ) * 100

Here is my comparison of regular vs vectorized approach:
%timeit us_consum['Commercial_%ofUS'] = us_consum['Commercial_MWhrs']*100/us_consum['Total US consumption (MWhr)']
351 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit us_consum['Commercial_%ofUS'] = (us_consum['Commercial_MWhrs'].div(us_consum['Total US consumption (MWhr)']))*100
337 µs ± 60.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Is there a faster way to fill in a new column in a dataframe based on exsiting ones than using loc in python?

I have a dataframe that includes the total number of products made in a factory for every day but it is a cumulative field and not daily values. I am trying to calculate daily values by just subtracting every day's cumulative number from next day's number. Here is the code I'm using. I am using loc to make sure that it inserts the new values in the original dataframe. This way it takes 10 seconds for 1000 rows which is kind of long, since the original data is much larger. Wondering if there is a faster way.
before:
date sum
0 2020-03-24 10
1 2020-03-25 50
2 2020-03-26 90
3 2020-03-27 140
4 2020-03-28 180
code:
for i in range(1, 1000):
data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']
after:
date sum daily_products
0 2020-03-24 10
1 2020-03-25 50 40
2 2020-03-26 90 40
3 2020-03-27 140 50
4 2020-03-28 180 40
and the time it takes for 1000 rows:
Total runtime of the program is 9.468996286392212

Let use
data['daily_products'] = data['sum'].diff()
159 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops
each)
versus
for i in range(1, len(data)-1):
data.loc[i, 'daily_products'] = data.loc[i, 'sum'] - data.loc[i-1, 'sum']
716 µs ± 26.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
and...
data['daily_products'] = data['sum'] - data['sum'].shift(-1)
305 µs ± 6.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)

To address your question specifically, something you can do is using .shift
data['daily_products'] = data['sum'] - data['sum'].shift(-1)
Or cf #Scott's answer, in your case with .diff(-1)
Note that sum is a method of pd.DataFrame, which means that this is somehow a reserved name you should not use. Indeed, using such name for your variable prevents you from getting it by doing data.sum. By opposition, you can do data.daily_products since this column name does not enter in conflict with pandas's name space. Once it has been defined though.

When should I (not) want to use pandas apply() in my code?

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?

apply, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If apply is so bad, then why is it in the API?"
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.
Let's address the next question.
"How and when should I make my code apply-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
<!- ->
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
Read more at the
docs.
A Common Pitfall: Exploding Columns of Lists
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
"Are there any situations where apply is good?"
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
The graph was plotted using the perfplot library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

Not all applys are alike
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.

For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)

Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by #jpp.

Pandas groupby nlargest sum

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.
State County Population
Alabama a 100
Alabama b 50
Alabama c 40
Alabama d 5
Alabama e 1
...
Wyoming a.51 180
Wyoming b.51 150
Wyoming c.51 56
Wyoming d.51 5
I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.
In the end, I'll have a list that will have the state and the population (of it's top 2 counties).
I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.
The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

You can use apply after performing the groupby:
df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())
I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.
The resulting output:
State
Alabama 150
Wyoming 330
EDIT
A slightly cleaner approach, as suggested by #cs95:
df.groupby('State')['Population'].nlargest(2).sum(level=0)
This is slightly slower than using apply on larger DataFrames though.
Using the following setup:
import numpy as np
import pandas as pd
from string import ascii_letters
n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
'B': np.random.randint(10**7, size=n)})
I get the following timings:
In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

Using agg, the grouping logic looks like:
df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})
This results in another dataframe object; which you could query to find the most populous states, etc.
Population
State
Alabama 150
Wyoming 330

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum - python

Related

Fast way to get rolling percentile ranks

Pandas: How to calculate the percentage of one column against another?

Is there a faster way to fill in a new column in a dataframe based on exsiting ones than using loc in python?

When should I (not) want to use pandas apply() in my code?

Pandas groupby nlargest sum

Categories

Resources