Pandas Dataframe groupby: apply several lambda functions at once

Pandas Dataframe groupby: apply several lambda functions at once - python

I group the following pandas dataframe by 'name' and then apply several lambda functions on 'value' to generate additional columns.
Is it possible to apply these lambda functions at once, to increase efficiency?
import pandas as pd
df = pd.DataFrame({'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'value': [1, 3, 1, 2, 3, 1, 2, 3, 3], })
df['Diff'] = df.groupby('name')['value'].transform(lambda x: x - x.iloc[0])
df['Count'] = df.groupby('name')['value'].transform(lambda x: x.count())
df['Index'] = df.groupby('name')['value'].transform(lambda x: x.index - x.index[0] + 1)
print(df)
Output:
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3

Here is possible use GroupBy.apply with one function, but not sure if better performance:
def f(x):
a = x - x.iloc[0]
b = x.count()
c = x.index - x.index[0] + 1
return pd.DataFrame({'Diff':a, 'Count':b, 'Index':c})
df = df.join(df.groupby('name')['value'].apply(f))
print(df)
name value Diff Count Index
0 A 1 0 2 1
1 A 3 2 2 2
2 B 1 0 4 1
3 B 2 1 4 2
4 B 3 2 4 3
5 B 1 0 4 4
6 C 2 0 3 1
7 C 3 1 3 2
8 C 3 1 3 3

Related

Is there a pythonic way to add an enumerating column while exploding a list column in pandas?

Consider the following DataFrame:
>>> df = pd.DataFrame({'A': [1,2,3], 'B':['abc', 'def', 'ghi']}).apply({'A':int, 'B':list})
>>> df
A B
0 1 [a, b, c]
1 2 [d, e, f]
2 3 [g, h, I]
This is one way to get the desired result:
>>> df['B'] = df['B'].apply(enumerate).apply(list)
>>> df = df.explode('B', ignore_index=True)
>>> df['B'] = pd.Series(df['B'], index=['B1', 'B2'])})
>>> df.droplevel(0, axis=1)
A B1 B2
0 1 0 a
1 1 1 b
2 1 2 c
3 2 0 d
4 2 1 e
5 2 2 f
6 3 0 g
7 3 1 h
8 3 2 i
Is there a neater way?

A groupby on the index is an option:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount())
A B B1
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2
you can always reset the index, if you have no use for it:
df.explode('B').assign(
B1 = lambda df: df.groupby(level=0).cumcount()).reset_index(drop=True)
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
Since pandas version 1.3.0 you can use multiple columns with explode out of the box:
df.assign(
B1 = df.B.apply(len).apply(range)).explode(['B', 'B1'], ignore_index = True))
A B B1
0 1 a 0
1 1 b 1
2 1 c 2
3 2 d 0
4 2 e 1
5 2 f 2
6 3 g 0
7 3 h 1
8 3 i 2
I think a faster option would be to run the reshaping outside Pandas, and then rejoin back to the dataframe (of course only tests can affirm/deny this):
from itertools import chain
# you can use np.concatenate instead
# np.concatenate(df.B)
flattened = chain.from_iterable(df.B)
index = df.index.repeat([*map(len, df.B)])
flattened = pd.Series(flattened, index, name = 'B1')
(pd.concat([df.A, flattened], axis=1)
.assign(B2 = lambda df: df.groupby(level=0).cumcount())
)
A B1 B2
0 1 a 0
0 1 b 1
0 1 c 2
1 2 d 0
1 2 e 1
1 2 f 2
2 3 g 0
2 3 h 1
2 3 i 2

arithmetic on pandas dataframe row-wise

df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1

IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()

Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))

Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1

Expanding the rows of a data frame based on its column containing lists [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
Say I have the following Pandas Dataframe:
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
a b
0 1 [1, 2]
1 2 [2, 3, 4]
2 3 [5]
How would I "unstack" the lists in the "b" column in order to transform it into the dataframe:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5

Starting from Pandas 0.25.0, there is internal method DataFrame.explode(), which was designed just for that:
res = df.explode("b")
output
In [98]: res
Out[98]:
a b
0 1 1
0 1 2
1 2 2
1 2 3
1 2 4
2 3 5
Solution for Pandas versions < 0.25: generic vectorized approach - will work also for multiple columns DFs:
assuming we have the following DF:
In [159]: df
Out[159]:
a b c
0 1 [1, 2] 5
1 2 [2, 3, 4] 6
2 3 [5] 7
Solution:
In [160]: lst_col = 'b'
In [161]: pd.DataFrame({
...: col:np.repeat(df[col].values, df[lst_col].str.len())
...: for col in df.columns.difference([lst_col])
...: }).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns.tolist()]
...:
Out[161]:
a b c
0 1 1 5
1 1 2 5
2 2 2 6
3 2 3 6
4 2 4 6
5 3 5 7
Setup:
df = pd.DataFrame({
"a" : [1,2,3],
"b" : [[1,2],[2,3,4],[5]],
"c" : [5,6,7]
})
Vectorized NumPy approach:
In [124]: pd.DataFrame({'a':np.repeat(df.a.values, df.b.str.len()),
'b':np.concatenate(df.b.values)})
Out[124]:
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
OLD answer:
Try this:
In [89]: df.set_index('a', append=True).b.apply(pd.Series).stack().reset_index(level=[0, 2], drop=True).reset_index()
Out[89]:
a 0
0 1 1.0
1 1 2.0
2 2 2.0
3 2 3.0
4 2 4.0
5 3 5.0
Or bit nicer solution provided by #Boud:
In [110]: df.set_index('a').b.apply(pd.Series).stack().reset_index(level=-1, drop=True).astype(int).reset_index()
Out[110]:
a 0
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5

Here is another approach with itertuples -
df = pd.DataFrame({"a" : [1,2,3], "b" : [[1,2],[2,3,4],[5]]})
data = []
for i in df.itertuples():
lst = i[2]
for col2 in lst:
data.append([i[1], col2])
df_output = pd.DataFrame(data =data, columns=df.columns)
df_output
Output is -
a b
0 1 1
1 1 2
2 2 2
3 2 3
4 2 4
5 3 5
Edit: You can also compress the loops into a single code and populate data as -
data = [[i[1], col2] for i in df.itertuples() for col2 in i[2]]

How to count the number of same value across multiple columns?

For example:
A B C
1 1 2
2 1 2
3 3 3
3 2 1
I want to add a column D which represent the same numbers of values across A, B and C.
D
2
2
3
1

Option 1
You can use stack + groupby + value_counts:
df['D'] = df.stack().groupby(level=0).value_counts().max(level=0)
df
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1
If you want the number which has the highest mode, chain a groupby + head call -
v = (df.stack()
.groupby(level=0)
.value_counts()
.groupby(level=0)
.head(1)
.reset_index(level=0, drop=True)
)
1 2
2 2
3 3
1 1
dtype: int64
df['Num'], df['Num_Mode'] = v.index, v.values # to assign it
If multiple numbers has the same highest mode, only one of them are returned.
Option 2
Another option inspired by #Wen, using apply with pd.Series.mode -
df['D'] = df.stack().groupby(level=0).apply(lambda x: pd.Series.mode(x).max())
Or,
df['D'] = df.apply(pd.Series.mode, 1).max(1).astype(int)
df
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1

scipy mode can return the count as well
from scipy import stats
df['D']=stats.mode(df.values,1)[1]
df
Out[829]:
A B C D
0 1 1 2 2
1 2 1 2 2
2 3 3 3 3
3 3 2 1 1
More Info:
stats.mode(df.values,1)
Out[830]:
ModeResult(mode=array([[1],
[2],
[3],
[1]], dtype=int64), count=array([[2],
[3],
[4],
[2]]))

Merge and split columns in pandas dataframe

I want to know how to merge multiple columns, and split them again.
Input data
A B C
1 3 5
2 4 6
Merge A, B, C to one column X
X
1
2
3
4
5
6
Process something with X, then split X into A, B, C again. The number of rows for A, B, C is same(2).
A B C
1 3 5
2 4 6
Is there any simple way for this work?

Start with df:
A B C
0 1 3 5
1 2 4 6
Next, get all values in one column:
df2 = df.unstack().reset_index(drop=True).rename('X').to_frame()
print(df2)
X
0 1
1 2
2 3
3 4
4 5
5 6
And, convert back to original shape:
df3 = pd.DataFrame(df2.values.reshape(2,-1, order='F'), columns=list('ABC'))
print(df3)
A B C
0 1 3 5
1 2 4 6

Setup
df=pd.DataFrame({'A': {0: 1, 1: 2}, 'B': {0: 3, 1: 4}, 'C': {0: 5, 1: 6}})
df
Out[684]:
A B C
0 1 3 5
1 2 4 6
Solution
Merge df to 1 column:
df2 = pd.DataFrame(df.values.flatten('F'),columns=['X'])
Out[686]:
X
0 1
1 2
2 3
3 4
4 5
5 6
Split it back to 3 columns:
pd.DataFrame(df2.values.reshape(-1,3,order='F'),columns=['A','B','C'])
Out[701]:
A B C
0 1 3 5
1 2 4 6

un unwind in the way you'd like, you need to either unstack or ravel with order='F'
Option 1
def proc1(df):
v = df.values
s = v.ravel('F')
s = s * 2
return pd.DataFrame(s.reshape(v.shape, order='F'), df.index, df.columns)
proc1(df)
A B C
0 2 6 10
1 4 8 12
Option 2
def proc2(df):
return df.unstack().mul(2).unstack(0)
proc2(df)
A B C
0 2 6 10
1 4 8 12

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Dataframe groupby: apply several lambda functions at once - python

Related

Is there a pythonic way to add an enumerating column while exploding a list column in pandas?

arithmetic on pandas dataframe row-wise

Expanding the rows of a data frame based on its column containing lists [duplicate]

How to count the number of same value across multiple columns?

Merge and split columns in pandas dataframe

Categories

Resources