pandas: groupby sum conditional on other column - python

i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()

You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13

Related

Data Transforming/formatting in Python

I've the following panda data:
df = {'ID_1': [1,1,1,2,2,3,4,4,4,4],
'ID_2': ['a', 'b', 'c', 'f', 'g', 'd', 'v', 'x', 'y', 'z']
}
df = pd.DataFrame(df)
display(df)
ID_1 ID_2
1 a
1 b
1 c
2 f
2 g
3 d
4 v
4 x
4 y
4 z
For each ID_1, I need to find the combination (order doesn't matter) of ID_2. For example,
When ID_1 = 1, the combinations are ab, ac, bc.
When ID_1 = 2, the combination is fg.
Note, if the frequency of ID_1<2, then there is no combination here (see ID_1=3, for example).
Finally, I need to store the combination results in df2 as follows:
One way using itertools.combinations:
from itertools import combinations
def comb_df(ser):
return pd.DataFrame(list(combinations(ser, 2)), columns=["from", "to"])
new_df = df.groupby("ID_1")["ID_2"].apply(comb_df).reset_index(drop=True)
Output:
from to
0 a b
1 a c
2 b c
3 f g
4 v x
5 v y
6 v z
7 x y
8 x z
9 y z

Combine two pandas index slices

How can two pandas.IndexSlice s be combined into one?
Set up of the problem:
import pandas as pd
import numpy as np
idx = pd.IndexSlice
cols = pd.MultiIndex.from_product([['A', 'B', 'C'], ['x', 'y'], ['a', 'b']])
df = pd.DataFrame(np.arange(len(cols)*2).reshape((2, len(cols))), columns=cols)
df:
A B C
x y x y x y
a b a b a b a b a b a b
0 0 1 2 3 4 5 6 7 8 9 10 11
1 12 13 14 15 16 17 18 19 20 21 22 23
How can the two slices idx['A', 'y', :] and idx[['B', 'C'], 'x', :], be combined to show in one dataframe?
Separately they are:
df.loc[:, idx['A', 'y',:]]
A
y
a b
0 2 3
1 14 15
df.loc[:, idx[['B', 'C'], 'x', :]]
B C
x x
a b a b
0 4 5 8 9
1 16 17 20 21
Simply combining them as a list does not play nicely:
df.loc[:, [idx['A', 'y',:], idx[['B', 'C'], 'x',:]]]
....
TypeError: unhashable type: 'slice'
My current solution is incredibly clunky, but gives the sub df that I'm looking for:
df.loc[:, df.loc[:, idx['A', 'y', :]].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
A B C
y x x
a b a b a b
0 2 3 4 5 8 9
1 14 15 16 17 20 21
However this doesn't work when one of the slices is just a series (as expected), which is less fun:
df.loc[:, df.loc[:, idx['A', 'y', 'a']].columns.to_list() + df.loc[:,
idx[['B', 'C'], 'x', :]].columns.to_list()]
...
AttributeError: 'Series' object has no attribute 'columns'
Are there any better alternatives to what I'm currently doing that would ideally work with dataframe slices and series slices?
General solution is join together both slice:
a = df.loc[:, idx['A', 'y', 'a']]
b = df.loc[:, idx[['B', 'C'], 'x', :]]
df = pd.concat([a, b], axis=1)
print (df)
A B C
y x x
a a b a b
0 2 4 5 8 9
1 14 16 17 20 21

Pandas Wrap Display into Multiple Columns

All,
I have a pandas dataframe with ~30 rows and 1 column. When I display it in Jupyter, all 30 rows are displayed in one long list. I am looking for a way to wrap the rows into multiple displayed columns, such as below:
Example dataframe:
df = pd.DataFrame([
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
'u', 'v', 'w', 'x', 'y', 'z', 'aa', 'ab', 'ac', 'ad'],
columns=['value'])
Example output
value value value
0 a 10 k 20 u
1 b 11 l 21 v
2 c 12 m 22 w
3 d 13 n 23 x
4 e 14 o 24 y
5 f 15 p 25 z
6 g 16 q 26 aa
7 h 17 r 27 ab
8 i 18 s 28 ac
9 j 19 t 29 ad
You can use this helper function:
def reshape(df, rows=10):
length = len(df)
cols = np.ceil(length / rows).astype(int)
df = df.assign(rows=np.tile(np.arange(rows), cols)[:length],
cols=np.repeat(np.arange(cols), rows)[:length]) \
.pivot('rows', 'cols', df.columns.tolist()) \
.sort_index(level=1, axis=1).droplevel(level=1, axis=1).rename_axis(None)
return df
Output
>>> reshape(df)
value value value
0 a k u
1 b l v
2 c m w
3 d n x
4 e o y
5 f p z
6 g q aa
7 h r ab
8 i s ac
9 j t ad
Try
df[['col1', 'col2']] = df['col'].str.split(' ', 1, expand=True)

Pandas Dataframe pivot with rolling window

I am trying to prepare data for some time-series modeling with Python Pandas (first timer). My DataFrame looks like this:
df = pd.DataFrame({
'time': [0, 1, 2, 3, 4],
'colA': ['a', 'b', 'c', 'd', 'e'],
'colB': ['v', 'w', 'x', 'y', 'z'],
'value' : [10, 11, 12, 13, 14]
})
# time colA colB value
# 0 0 a v 10
# 1 1 b w 11
# 2 2 c x 12
# 3 3 d y 13
# 4 4 e z 14
Is there a combination of functions that could transform it into the following format?
# colA-2 colA-1 colA colB-2 colB-1 colB value
# _ _ a _ _ v 10
# _ a b _ v w 11
# a b c v w x 12
# b c d w x y 13
# c d e x y z 14
I am very new to Python/Pandas and I do not have any concrete code/results that got me even close to what I need...
You can use the shift function:
df['colA-2'] =df['colA'].shift(2, fill_value='-' )
df['colA-1'] =df['colA'].shift(1,fill_value='-')
...
I'd use pd.concat
pd.concat([
df[['colA', 'colB']].shift(i).add_suffix(f'-{i}')
for i in range(1, 3)], axis=1
).fillna('-').join(df)
colA-1 colB-1 colA-2 colB-2 time colA colB value
0 - - - - 0 a v 10
1 a v - - 1 b w 11
2 b w a v 2 c x 12
3 c x b w 3 d y 13
4 d y c x 4 e z 14

Pandas - aggregate over inconsistent values types (string vs list)

Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14

Categories