pandas: GroupBy .pipe() vs .apply() - python

In the example from the pandas documentation about the new .pipe() method for GroupBy objects, an .apply() method accepting the same lambda would return the same results.
In [195]: import numpy as np
In [196]: n = 1000
In [197]: df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
.....: 'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
.....: 'Revenue': (np.random.random(n)*50+10).round(2),
.....: 'Quantity': np.random.randint(1, 10, size=n)})
In [199]: (df.groupby(['Store', 'Product'])
.....: .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
.....: .unstack().round(2))
Out[199]:
Product Product_1 Product_2 Product_3
Store
Store_1 6.93 6.82 7.15
Store_2 6.69 6.64 6.77
I can see how the pipe functionality differs from apply for DataFrame objects, but not for GroupBy objects. Does anyone have an explanation or examples of what can be done with pipe but not with apply for a GroupBy?

What pipe does is to allow you to pass a callable with the expectation that the object that called pipe is the object that gets passed to the callable.
With apply we assume that the object that calls apply has subcomponents that will each get passed to the callable that was passed to apply. In the context of a groupby the subcomponents are slices of the dataframe that called groupby where each slice is a dataframe itself. This is analogous for a series groupby.
The main difference between what you can do with a pipe in a groupby context is that you have available to the callable the entire scope of the the groupby object. For apply, you only know about the local slice.
Setup
Consider df
df = pd.DataFrame(dict(
A=list('XXXXYYYYYY'),
B=range(10)
))
A B
0 X 0
1 X 1
2 X 2
3 X 3
4 Y 4
5 Y 5
6 Y 6
7 Y 7
8 Y 8
9 Y 9
Example 1
Make the entire 'B' column sum to 1 while each sub-group sums to the same amount. This requires that the calculation be aware of how many groups exist. This is something we can't do with apply because apply wouldn't know how many groups exist.
s = df.groupby('A').B.pipe(lambda g: df.B / g.transform('sum') / g.ngroups)
s
0 0.000000
1 0.083333
2 0.166667
3 0.250000
4 0.051282
5 0.064103
6 0.076923
7 0.089744
8 0.102564
9 0.115385
Name: B, dtype: float64
Note:
s.sum()
0.99999999999999989
And:
s.groupby(df.A).sum()
A
X 0.5
Y 0.5
Name: B, dtype: float64
Example 2
Subtract the mean of one group from the values of another. Again, this can't be done with apply because apply doesn't know about other groups.
df.groupby('A').B.pipe(
lambda g: (
g.get_group('X') - g.get_group('Y').mean()
).append(
g.get_group('Y') - g.get_group('X').mean()
)
)
0 -6.5
1 -5.5
2 -4.5
3 -3.5
4 2.5
5 3.5
6 4.5
7 5.5
8 6.5
9 7.5
Name: B, dtype: float64

print(df.groupby(['A'])['B'].apply(lambda l: l/l.sum()/df.A.nunique()))

Related

How does Python Pandas Transform work internally when passed a lambda question?

I found the following example online which explains how to essentially achieve a SQL equivalent of PARTITION BY
df['percent_of_points'] = df.groupby('team')['points'].transform(lambda x: x/x.sum())
#view updated DataFrame
print(df)
team points percent_of_points
0 A 30 0.352941
1 A 22 0.258824
2 A 19 0.223529
3 A 14 0.164706
4 B 14 0.191781
5 B 11 0.150685
6 B 20 0.273973
7 B 28 0.383562
I struggle to understand what the 'x' refers to in the lambda function lambda x: x/x.sum() because it appears to refer to an individual element when used as the numerator i.e. 'x' but also appears to be a list of values when used as a denominator i.e. x.sum().
I think I am not thinking about this is in the right way or have a gap in my understanding of python or pandas.
it appears to refer to an individual element when used as the
numerator i.e. 'x' but also appears to be a list of values when used
as a denominator i.e. x.sum()
It doesn't, it returns a pd.Series of length the size of the group, and x / x.sum() is not a single value, it a pd.Series of the same size.
.transform will assign the values of this series to the corresponding values in that column from the group-by operation.
So, consider:
In [15]: df
Out[15]:
team points
0 A 30
1 A 22
2 A 19
3 A 14
4 B 14
5 B 11
6 B 20
7 B 28
In [16]: for k, g in df.groupby('team')['points']:
...: print(g)
...: print(g / g.sum())
...:
0 30
1 22
2 19
3 14
Name: points, dtype: int64
0 0.352941
1 0.258824
2 0.223529
3 0.164706
Name: points, dtype: float64
4 14
5 11
6 20
7 28
Name: points, dtype: int64
4 0.191781
5 0.150685
6 0.273973
7 0.383562
Name: points, dtype: float64
In [17]:

Pandas count, sum, average specific range/ value for each row

i have big data, i want to count, sum, average for each row only between specific range.
df = pd.DataFrame({'id0':[10.3,20,30,50,108,110],'id1':[100.5,0,300,570,400,140], 'id2':[-2.6,-3,5,12,44,53], 'id3':[-100.1,4,6,22,12,42]})
id0 id1 id2 id3
0 10.3 100.5 -2.6 -100.1
1 20.0 0.0 -3.0 4.0
2 30.0 300.0 5.0 6.0
3 50.0 570.0 12.0 22.0
4 108.0 400.0 44.0 12.0
5 110.0 140.0 53.0 42.0
for example i want to count the occurrence of value between 10-100 for each row, so it will get:
0 1
1 1
2 1
3 3
4 2
5 2
Name: count_10-100, dtype: int64
currently i get this done by iterate for each row, transverse and using groupby. But this take a time because i have ~500 column and 500000 row
You can apply the conditions with AND between them, and then sum along the row (axis 1):
((df >= 10) & (df <= 100)).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2
dtype: int64
For sum and mean, you can apply the conditions with where:
df.where((df >= 10) & (df <= 100)).sum(axis=1)
df.where((df >= 10) & (df <= 100)).mean(axis=1)
Credit for this goes to #anky, who posted it first as a comment :)
Below summarizes the different situations in which you'd want to count something in a DataFrame (or Series, for completeness), along with the recommended method(s).
DataFrame.count returns counts for each column as a Series since the non-null count varies by column.
DataFrameGroupBy.size returns a Series, since all columns in the same group share the same row-count.
DataFrameGroupBy.count returns a DataFrame, since the non-null count could differ across columns in the same group.
To get the group-wise non-null count for a specific column, use df.groupby(...)['x'].count() where "x" is the column to count.
#Code Examples
df = pd.DataFrame({
'A': list('aabbc'), 'B': ['x', 'x', np.nan, 'x', np.nan]})
s = df['B'].copy()
df
A B
0 a x
1 a x
2 b NaN
3 b x
4 c NaN
s
0 x
1 x
2 NaN
3 x
4 NaN
Name: B, dtype: object
Row Count of a DataFrame: len(df), df.shape[0], or len(df.index)
len(df)
# 5
df.shape[0]
# 5
len(df.index)
# 5
Of the three methods above, len(df.index) (as mentioned in other answers) is the fastest.
Note
All the methods above are constant time operations as they are simple attribute lookups.
df.shape (similar to ndarray.shape) is an attribute that returns a tuple of (# Rows, # Cols).
Column Count of a DataFrame: df.shape[1], len(df.columns)
df.shape[1]
# 2
len(df.columns)
# 2
Analogous to len(df.index), len(df.columns) is the faster of the two methods (but takes more characters to type).
Row Count of a Series:
len(s), s.size, len(s.index)
len(s)
# 5
s.size
# 5
len(s.index)
# 5
s.size and len(s.index) are about the same in terms of speed. But I recommend len(df).
size is an attribute, and it returns the number of elements (=count of rows for any Series). DataFrames also define a size attribute which returns the same result as
df.shape[0] * df.shape[1].
Non-Null Row Count: DataFrame.count and Series.count
The methods described here only count non-null values (meaning NaNs are ignored).
Calling DataFrame.count will return non-NaN counts for each column:
df.count()
A 5
B 3
dtype: int64
For Series, use Series.count to similar effect:
s.count()
# 3
Group-wise Row Count: GroupBy.size
For DataFrames, use DataFrameGroupBy.size to count the number of rows per group.
df.groupby('A').size()
A
a 2
b 2
c 1
dtype: int64
Similarly, for Series, you'll use SeriesGroupBy.size.
s.groupby(df.A).size()
A
a 2
b 2
c 1
Name: B, dtype: int64
In both cases, a Series is returned.
Group-wise Non-Null Row Count: GroupBy.count
Similar to above, but use GroupBy.count, not GroupBy.size. Note that size always returns a Series, while count returns a Series if called on a specific column, or else a DataFrame.
The following methods return the same thing:
df.groupby('A')['B'].size()
df.groupby('A').size()
A
a 2
b 2
c 1
Name: B, dtype: int64
df.groupby('A').count()
B
A
a 2
b 1
c 0
df.groupby('A')['B'].count()
A
a 2
b 1
c 0
Name: B, dtype: int64
There's a neat way to do that with aggregations and using pandas methods. It can be read as "aggregate by row (axis=1) where x is greater or equal to 10 and less or equal to 100".
df.agg(lambda x : (x.ge(10) & x.le(100)).sum(), axis=1)
Something like this will help you.
df["n_values_in_range"] = df.apply(
func=lambda row: count_values_in_range(row, range_min, range_max), axis=1)
Try this:
df.apply(lambda x: x.between(10, 100), axis=1).sum(axis=1)
Output:
0 1
1 1
2 1
3 3
4 2
5 2

Pandas - Grouping rows and averaging per column [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

Cubic Root of Pandas DataFrame

I understand how to take cubic root of both positive and negative numbers. But when trying to use apply-lambda method to efficiently process all elements of a dataframe, I run into an ambiguity issue. Interestingly, this error does not arise with equalities, so I am wondering what could be wrong with the code:
sample[columns]=sample[columns].apply(lambda x: (-1)*np.power(-x,1./3) if x<0 else np.power(x,1./3))
It looks like you are passing a list or array of column names. I assume this because your variable name is plural with an s at the end. If this is the case, then sample[columns] is a dataframe. This is an issue because apply iterates through each column, passing that column the lambda you passed to apply. So you get
(-1) * np.power(-series_object, -1./3) if series_object < 0 else...
And it's the series_object < 0 that is messing things up because you are asking for the truthiness of a whole series being less than zero.
applymap
f = lambda x: -np.power(-x, 1./3) if x < 0 else np.power(x, 1./3)
sample[columns] = sample[columns].applymap(f)
That said, I'd use a lambda defined as follows
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
Then you could perform this on the entire dataframe
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(-10, 10, (5, 5)))
df
0 1 2 3 4
0 6 1 -8 0 5
1 3 1 3 9 -2
2 -10 2 -10 -8 -10
3 -3 9 3 8 2
4 -6 -7 9 3 -3
f = lambda x: np.sign(x) * np.power(abs(x), 1./3)
f(df)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Same as
df.applymap(f)
0 1 2 3 4
0 1.817121 1.000000 -2.000000 0.000000 1.709976
1 1.442250 1.000000 1.442250 2.080084 -1.259921
2 -2.154435 1.259921 -2.154435 -2.000000 -2.154435
3 -1.442250 2.080084 1.442250 2.000000 1.259921
4 -1.817121 -1.912931 2.080084 1.442250 -1.442250
Check for equality
df.applymap(f).equals(f(df))
True
And its faster
%timeit df.applymap(f)
%timeit f(df)
1000 loops, best of 3: 1.11 ms per loop
1000 loops, best of 3: 473 µs per loop
It doesn't have to be complicated, simply use NumPys cube-root function: np.cbrt:
df[columns] = np.cbrt(df[columns])
It requires NumPy >= 1.10 though.
For older versions you could use np.absolute and np.sign instead of using conditionals:
df[columns] = df[columns].apply(lambda x: np.power(np.absolute(x), 1./3) * np.sign(x))
This calculates the cube root of the absolute and then changes the sign appropriatly.
Try :
sample[columns]=sample[columns]**(1/3)

Merging two rows and averaging the columns of those rows in pandas [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

Categories