I have a data that looks like this:
A B C
1 1 1
1 1 5
1 2 7
1 2 3
2 1 8
2 1 10
2 2 1
2 2 4
I need to group by A and B and sum C then get the mean of (sum C) for each unique value in A
Output1:
A B SumC
1 1 6
2 10
2 1 18
2 5
Output2:
A Mean C
1 8
2 11.5
My attempt:
DailyCount_ps = (df_new.groupby(["A","B"])["C"].sum()).rename(“Sum C”)
Any help?
Well you can do it in 2 steps:
df = df.groupby(['A', 'B'], as_index=False)['C'].sum().rename({'C': 'Sum C'}, axis=1)
df['Mean C'] = df.groupby('A')['Sum C'].transform('mean')
df
A B Sum C Mean C
0 1 1 6 8.0
1 1 2 10 8.0
2 2 1 18 11.5
3 2 2 5 11.5
Related
I cannot solve a very easy/simple problem in pandas. :(
I have the following table:
df = pd.DataFrame(data=dict(a=[1, 1, 1,2, 2, 3,1], b=["A", "A","B","A", "B", "A","A"]))
df
Out[96]:
a b
0 1 A
1 1 A
2 1 B
3 2 A
4 2 B
5 3 A
6 1 A
I would like to make an incrementing ID of each grouped (grouped by columns a and b) unique item. So the result would like like this (column c):
Out[98]:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I tried with:
df.groupby(["a", "b"]).nunique().cumsum().reset_index()
Result:
Out[105]:
a b c
0 1 A 1
1 1 B 2
2 2 A 3
3 2 B 4
4 3 A 5
Unfortunatelly this works only for the grouped by dataset and not on the original dataset. As you can see in the original table I have 7 rows and the grouped by returns only 5.
So could someone please help me on how to get the desired table:
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Thank you in advance!
groupby + ngroup
df['c'] = df.groupby(['a', 'b']).ngroup() + 1
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
Use pd.factorize after create a tuple from (a, b) columns:
df['c'] = pd.factorize(df[['a', 'b']].apply(tuple, axis=1))[0] + 1
print(df)
# Output
a b c
0 1 A 1
1 1 A 1
2 1 B 2
3 2 A 3
4 2 B 4
5 3 A 5
6 1 A 1
I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)
I want to add some columns for group features(std, mean...), the code below works but the dataset is really big and the performance is bad. Is there any good idea to improve the code? Thanks
import pandas as pd
df = pd.DataFrame([[1,2,1], [1,2,2], [1,3,3], [1,3,4],[2,8,9], [2,11,11]], columns=['A', 'B', 'C'])
df['mean'] = 0
df2 = df.groupby('A')
for a, group in df2:
mean = group['C'].mean()
df.loc[df['A'] == a, 'mean'] = mean
df
'''
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
'''
Pandas' groupby.transform does the job of broadcasting aggregate statistics across the original index. This makes it perfect for your purposes and should be considered the idiomatic way to perform this task.
pipelined solution that produces a copy of df with new column
df.assign(Mean=df.groupby('A').C.transform('mean'))
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
In place assignment
df['Mean'] = df.groupby('A').C.transform('mean')
df
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
Alternatively, you can use pd.factorize and np.bincount
f, u = pd.factorize(df.A.values)
totals = np.bincount(f, df.C.values)
counts = np.bincount(f)
df.assign(Mean=(totals / counts)[f])
A B C Mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
Here is one way:
s = df.groupby('A')['C'].mean()
df['mean'] = df['A'].map(s)
# A B C mean
# 0 1 2 1 2.5
# 1 1 2 2 2.5
# 2 1 3 3 2.5
# 3 1 3 4 2.5
# 4 2 8 9 10.0
# 5 2 11 11 10.0
Explanation
First, groupby 'A' and calculate mean of 'C'. This creates a series with index unique entries in 'A' and values as required.
Second, map this series onto your dataframe. This is possible because pd.Series.map can take a series as an input.
You can call mean with index
df.assign(mean=df.A.map(df.set_index('A').C.mean(level=0)))
Out[28]:
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
Or using get
df['mean']=df.set_index('A').C.mean(level=0).get(df.A).values
df
Out[35]:
A B C mean
0 1 2 1 2.5
1 1 2 2 2.5
2 1 3 3 2.5
3 1 3 4 2.5
4 2 8 9 10.0
5 2 11 11 10.0
I have a data frame like below
df=pd.DataFrame({'a':['a','a','b','a','b','a','a','a'], 'b' : [1,0,0,1,0,1,1,1], 'c' : [1,2,3,4,5,6,7,8],'d':['1','2','1','2','1','2','1','2']})
df
Out[94]:
a b c d
0 a 1 1 1
1 a 0 2 2
2 b 0 3 1
3 a 1 4 2
4 b 0 5 1
5 a 1 6 2
6 a 1 7 1
7 a 1 8 2
I want something like below
df[(df['a']=='a') & (df['b']==1)]
In [97]:
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].rank()
Out[97]:
0 1
3 1
5 2
6 2
7 3
dtype: float64
I want this rank as a new column in dataframe df and wherever there is no rank I want NaN. SO final output will be something like below
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I will appreciate all the help and guidance. Thanks a lot.
Almost there, you just need to call transform to return a series with an index aligned to your orig df:
In [459]:
df['rank'] = df[(df['a']=='a') & (df['b']==1)].groupby('d')['c'].transform(pd.Series.rank)
df
Out[459]:
a b c d rank
0 a 1 1 1 1
1 a 0 2 2 NaN
2 b 0 3 1 NaN
3 a 1 4 2 1
4 b 0 5 1 NaN
5 a 1 6 2 2
6 a 1 7 1 2
7 a 1 8 2 3
I have two df,
First df
A B C
1 1 3
1 1 2
1 2 5
2 2 7
2 3 7
Second df
B D
1 5
2 6
3 4
The column Bhas the same meaning in the both dfs. What is the most easy way add column D to the corresponding values in the first df? Output should be:
A B C D
1 1 3 5
1 1 2 5
1 2 5 6
2 2 7 6
2 3 7 4
Perform a 'left' merge in your case on column 'B':
In [206]:
df.merge(df1, how='left', on='B')
Out[206]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4
Another method would be to set 'B' on your second df as the index and then call map:
In [215]:
df1 = df1.set_index('B')
df['D'] = df['B'].map(df1['D'])
df
Out[215]:
A B C D
0 1 1 3 5
1 1 1 2 5
2 1 2 5 6
3 2 2 7 6
4 2 3 7 4