I have a pandas dataframe and i need to subtract rows if they have the same group.
Input Dataframe:
A
B
Value
A1
B1
10.0
A1
B1
5.0
A1
B2
5.0
A2
B1
3.0
A2
B1
5.0
A2
B2
1.0
Expected Dataframe:
A
B
Value
A1
B1
5.0
A1
B2
5.0
A2
B1
-2.0
A2
B2
1.0
Logic: For example the first and second rows of the dataframe are in group A1 and B1 so the value must be 10.0 - 5.0 = 5.0. 4º and 5º rows have the same group as well so the value must be 3.0 - 5.0 = -2.0.
Only subtract rows if they have same A and B value.
Thank you!
Try:
subtract = lambda x: x.iloc[0] - (x.iloc[1] if len(x) == 2 else 0)
out = df.groupby(['A', 'B'])['Value'].apply(subtract).reset_index()
print(out)
# Output:
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
You can prepare duplicated rows to be substracted and then sum after grouping. This works for more than one duplicated row in the correct order, too.
import pandas as pd
df = pd.read_html('https://stackoverflow.com/q/70438208/14277722')[0]
df.loc[df.duplicated(subset=['A','B']), 'Value'] *=-1
df.groupby(['A','B'], as_index=False).sum()
Output
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
Let us pass the condition within groupby apply
out = df.groupby(['A','B'])['Value'].apply(lambda x : x.iloc[0]-x.iloc[-1] if len(x)>1 else x.iloc[0]).reset_index(name = 'Value')
Out[18]:
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
One option is to use numpy's reduce function:
df.groupby(['A', 'B'], as_index = False).Value.agg(np.subtract.reduce)
A B Value
0 A1 B1 5.0
1 A1 B2 5.0
2 A2 B1 -2.0
3 A2 B2 1.0
Related
Probably easier to understand with an example so here we go:
random = np.random.uniform(size=(3))
what_i_have = pd.DataFrame({
('a', 'a'): random,
('b', 'b1'): np.linspace(3, 5, 3),
('b', 'b2'): np.linspace(6, 8, 3),
('b', 'b3'): np.linspace(9, 11, 3)
})
what_i_want = pd.DataFrame({
('a', 'a'): np.concatenate((random, random, random)),
('b', 'b_category'): ['b1']*3 + ['b2']*3 + ['b3']*3,
('b', 'b_value'): np.linspace(3, 11, 9)
})
print(what_i_have)
print('----------------------------------')
print(what_i_want)
Output:
a b
a b1 b2 b3
0 0.587075 3.0 6.0 9.0
1 0.798710 4.0 7.0 10.0
2 0.206860 5.0 8.0 11.0
----------------------------------
a b
a b_category b_value
0 0.587075 b1 3.0
1 0.798710 b1 4.0
2 0.206860 b1 5.0
3 0.587075 b2 6.0
4 0.798710 b2 7.0
5 0.206860 b2 8.0
6 0.587075 b3 9.0
7 0.798710 b3 10.0
8 0.206860 b3 11.0
My issue is that my data doesn't just have b1 b2 b3, it also has b4, b5, b6... All the way to about b90. The obvious solution would be to make a loop creating 90 dataframes, one for each category, then concatenating them into one dataframe, but I imagine there must be a better way of doing it.
edit:
what_i_have.unstack() doesn't really solve the issue, as can be seen below. It could be an intermediate step but there's still some work to do with this result before reaching what I want and I don't see much of an advantage in doing this over the loop solution I've previously mentioned:
a a 0 0.587075
1 0.798710
2 0.206860
b b1 0 3.000000
1 4.000000
2 5.000000
b2 0 6.000000
1 7.000000
2 8.000000
b3 0 9.000000
1 10.000000
2 11.000000
Keeping the MultiIndex, might be a better way out there though:
df = df.melt(id_vars=[df.columns[0]], var_name=['b','b_category'], value_name='b_value')
a = df[[('a','a')]]
b = df[['b', 'b_category', 'b_value']].pivot(columns='b').swaplevel(0,1, axis=1)
df = pd.concat([a, b], axis=1)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
Output:
a b
a b_category b_value
0 0.737076 b1 3.0
1 0.718409 b1 4.0
2 0.269516 b1 5.0
3 0.737076 b2 6.0
4 0.718409 b2 7.0
5 0.269516 b2 8.0
6 0.737076 b3 9.0
7 0.718409 b3 10.0
8 0.269516 b3 11.0
what_i_have.columns = what_i_have.columns.droplevel()
what_i_have.melt(id_vars=['a1'],value_vars=['b1','b2','b3'],var_name='b_category', value_name='b_value')
I think these commands should solve your issue. You lose the multilevel due to the use of the melt. Not sure how to get around that at the moment but otherwise this should meet your requirements.
output
Pull out b and melt:
out = what_i_have.pop('b')
# we need this length to extend
# what_i_have
size = out.columns.size
out = out.melt()
out.columns = ['b_category', 'b_value']
out.columns = [['b', 'b'], out.columns]
Reindex what is left of what_i_have and concatenate:
index = np.tile(what_i_have.a.index, size)
what_i_want = what_i_have.reindex(index).reset_index(drop = True)
pd.concat([what_i_want, out], axis = 1)
a b
a b_category b_value
0 0.883754 b1 3.0
1 0.172427 b1 4.0
2 0.631352 b1 5.0
3 0.883754 b2 6.0
4 0.172427 b2 7.0
5 0.631352 b2 8.0
6 0.883754 b3 9.0
7 0.172427 b3 10.0
8 0.631352 b3 11.0
I have the following dataframe df:
A B Var Value
0 A1 B1 T1name T1
1 A2 B2 T1name T1
2 A1 B1 T2name T2
3 A2 B2 T2name T2
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
I now want to 'half' my dataframe because Var contains variables that should not go under the same column. My intended outcome is:
A B Name Value
0 A1 B1 T1 1
1 A2 B2 T1 1
2 A1 B1 T2 2
3 A2 B2 T2 2
What should I use to unpivot this correctly?
Just filter where the string contains res and assign a new column with the first two characters of the var columns
df[df['Var'].str.contains('res')].assign(Name=df['Var'].str[:2]).drop(columns='Var')
A B Value Name
4 A1 B1 1 T1
5 A2 B2 1 T1
6 A1 B1 2 T2
7 A2 B2 2 T2
Note that this creates a slice of the original DataFrame and not a copy
then :
df = df[~df['Var'].isin(['T1name','T2name'])]
output :
A B Var Value
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
There are different options available looking at the df. Regex seems to be on top of the list. If regex doesn't work, maybe think of redefining your problem:
Filter Value by dtype, replace unwanted characters in df and rename columns. Code below
df[df['Value'].str.isnumeric()].replace(regex=r'res$', value='').rename(columns={'Var':'Name'})
A B Name Value
4 A1 B1 T1 1
5 A2 B2 T1 1
6 A1 B1 T2 2
7 A2 B2 T2 2
I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:
A B C
id
118 a1 B1 0
119 a1 B1 0
120 a1 B1 101.1
121 a1 B1 106.67
122 a1 B2 103.33
237 a1 B2 100
df = pd.DataFrame(df.groupby(
['A', 'B'])['B'].aggregate('count')).unstack(level=1)
to which I get the first part right:
B
B B1 B2
A
a1 4 2
a2 7 9
a3 9 17
a4 8 8
a5 7 8
But then when I need to get the percentage of the count that is > 0
prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()
I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.
The result I'm looking for is something like this:
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
a2 7 9 7 6
a3 9 17 9 5
Thanks,
You can do:
(df['C'].gt(1).groupby([df['A'],df['B']])
.agg(['size','sum'])
.rename(columns={'size':'B','sum':'passed'})
.unstack('B')
)
Output (from sample data):
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.
DataFrame for this exercise:
A B C
0 a1 B1 0.00
1 a1 B1 0.00
2 a1 B1 98.87
3 a1 B1 101.10
4 a1 B2 106.67
5 a1 B2 103.00
6 a2 B1 0.00
7 a2 B1 0.00
8 a2 B1 33.00
9 a2 B1 100.00
10 a2 B2 80.00
11 a3 B1 90.00
12 a3 B2 99.00
Average while excluding the zeros
for this I had to add .replace(0, np.nan) before the groupby function.
A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})
df = pd.DataFrame(df.replace(0, np.nan)
.groupby(['A', 'B'])
.agg({'B':'size','C':['count','mean']})
.rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)
Count Passed Avg Score
B B1 B2 B1 B2 B1 B2
A
a1 4 2 2 2 99.985 104.835
a2 4 1 2 1 66.500 80.000
a3 1 1 1 1 90.000 99.000
How can I join Series A multiindexed by (A, B) with Series B indexed by A?
Currently the only way is to bring the indices to a common footing -- e.g. move the B level of the series_A MultiIndex to a column so that both series_A and series_B are indexed only by A:
import pandas as pd
series_A = pd.Series(1, index=pd.MultiIndex.from_product([['A1', 'A4'],['B1','B2']], names=['A','B']), name='series_A')
# A B
# A1 B1 1
# B2 1
# A4 B1 1
# B2 1
# Name: series_A, dtype: int64
series_B = pd.Series(2, index=pd.Index(['A1', 'A2', 'A3'], name='A'), name='series_B')
# A
# A1 2
# A2 2
# A3 2
# Name: series_B, dtype: int64
tmp = series_A.to_frame().reset_index('B')
result = tmp.join(series_B, how='outer').set_index('B', append=True)
print(result)
yields
series_A series_B
A B
A1 B1 1.0 2.0
B2 1.0 2.0
A2 NaN NaN 2.0
A3 NaN NaN 2.0
A4 B1 1.0 NaN
B2 1.0 NaN
Another way to join them would be to unstack the B level from series_A:
In [215]: series_A.unstack('B').join(series_B, how='outer')
Out[215]:
B1 B2 series_B
A
A1 1.0 1.0 2.0
A2 NaN NaN 2.0
A3 NaN NaN 2.0
A4 1.0 1.0 NaN
unstack moves the B index level to the column index. Thus the theme is the
same (bring the indices to a common footing), though the result is different.
I want to display the result of a single value aggregation with 2 group by's into a table.
Such that
df.groupby(['colA', 'colB']).size
Would yield:
B1 B2 B3 B4
A1 s11 s12 s13 ..
A2 s21 s22 s23 ..
A3 s31 s32 s33 ..
A4 .. .. .. s44
What's a quick and easy way of doing this?
EDIT: here's an example. I have the logins of all users, and I want to display the number of logins (=rows) for each user and day
Day,User
1,John
1,John
1,Ben
1,Sarah
2,Ben
2,Sarah
2,Sarah
Should yield:
D\U John Ben Sarah
1 2 1 1
2 0 1 2
Use:
df.groupby(['colA', 'colB']).size().unstack()
Example:
df = pd.DataFrame(np.transpose([np.random.choice(['B1','B2','B3'], size=10),
np.random.choice(['A1','A2','A3'], size=10)]),
columns=['A','B'])
df
A B
0 B3 A1
1 B1 A2
2 B3 A3
3 B1 A3
4 B2 A2
5 B3 A3
6 B3 A1
7 B2 A1
8 B1 A3
9 B3 A3
Now:
df.groupby(['A','B']).size().unstack()
B A1 A2 A3
A
B1 NaN 1.0 2.0
B2 1.0 1.0 NaN
B3 2.0 NaN 3.0
Update now that your post has data:
df.groupby(['Day','User']).size().unstack().fillna(0)
User Ben John Sarah
Day
1 1.0 2.0 1.0
2 1.0 0.0 2.0