I'm new to pandas, and I want to create a new column in my pandas dataframe. I'd like to groupby one column, and then divide two other columns together.
This perfectly works:
df['new_col'] = (df.col2/df.col3)
However, when I groupby another column, what I have doesn't work:
df['new_col'] = df.groupby('col1')(df.col2/df.col3)
Does anyone know how I can rewrite the above code? Thanks.
Setup
df = pd.DataFrame(dict(
Col1=list('AAAABBBB'),
Col2=range(1, 9, 1),
Col3=range(9, 1, -1)
))
df
df.groupby('Col1').sum().eval('Col4 = Col2 / Col3')
Col1 Col2 Col3
0 A 1 9
1 A 2 8
2 A 3 7
3 A 4 6
4 B 5 5
5 B 6 4
6 B 7 3
7 B 8 2
Solution
Using pd.DataFrame.eval
We can use eval to create new columns in a pipeline
df.groupby('Col1', as_index=False).sum().eval('Col4 = Col2 / Col3')
Col1 Col2 Col3 Col4
0 A 10 30 0.333333
1 B 26 14 1.857143
This may be what you are looking for:
import pandas as pd
df = pd.DataFrame([['A', 4, 3], ['B', 2, 4], ['C', 5, 1], ['A', 5, 1], ['B', 2, 7]],
columns=['Col1', 'Col2', 'Col3'])
# Col1 Col2 Col3
# 0 A 4 3
# 1 B 2 4
# 2 C 5 1
# 3 A 5 1
# 4 B 2 7
df['Col4'] = df['Col2'] / df['Col3']
df = df.sort_values('Col1')
# Col1 Col2 Col3 Col4
# 0 A 4 3 1.333333
# 3 A 5 1 5.000000
# 1 B 2 4 0.500000
# 4 B 2 7 0.285714
# 2 C 5 1 5.000000
Or if you need to perform a groupby.sum first:
df = df.groupby('Col1', as_index=False).sum()
df['Col4'] = df['Col2'] / df['Col3']
# Col1 Col2 Col3 Col4
# 0 A 9 4 2.250000
# 1 B 4 11 0.363636
# 2 C 5 1 5.000000
Related
Let say that we have this dataframe:
d = {'col1': [1, 2,0,55,12], 'col2': [3, 4,44,34,46], 'col3': [A,A,B,B,A] }
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 A
1 2 4 A
2 0 44 B
3 55 34 B
4 12 46 A
I want another column that would count the numbers of the A and B separately as follows:
col1 col2 col3 count
0 1 3 A 2
1 2 4 A 2
2 0 44 B 2
3 55 34 B 2
4 12 46 A 1
I have tried the group by, but it does not do what I want, could you please help please ?
You can create consecutive groups by compare shifted values for not eqaual with cumulative sum and pass to GroupBy.transform with GroupBy.size:
g = df['col3'].ne(df['col3'].shift()).cumsum()
df['count'] = df.groupby(g)['col3'].transform('size')
print (df)
col1 col2 col3 count
0 1 3 A 2
1 2 4 A 2
2 0 44 B 2
3 55 34 B 2
4 12 46 A 1
Or alternative with Series.value_counts and Series.map:
s = df['col3'].ne(df['col3'].shift()).cumsum()
df['count'] = s.map(s.value_counts())
import pandas as pd
d= {'col1': [1, 2,0,55,12], 'col2': [3, 4,44,34,46], 'col3':
["A","A","B","B","A"] }
df = pd.DataFrame(data=d)
len_of_col1=len(df.col1)
len_of_col2=len(df.col2)
len_of_col3=len(df.col3)
print(len_of_col1,len_of_col2,len_of_col3)
I have 2 dataframes df1 and df2 and I want to compare 'col1' of both dataframes and get the rows from df1 where 'col1' values don't match. Only 'col1' is common in both dataframes.
Suppose I have:
df1 = pd.DataFrame({
'col1': range(1, 6),
'col2': range(10, 60, 10),
'col3': [*'abcde']
})
df2 = pd.DataFrame({
'col1': range(1, 4),
'cola': ['Aa', 'bcd', 'h'],
'colb': [12, 'sadf', 'dd']
})
print(df1)
col1 col2 col3
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
4 5 50 e
print(df2)
col1 cola colb
0 1 Aa 12
1 2 bcd sadf
2 3 h dd
I want to get:
col1 col2 col3
0 4 40 d
1 5 50 e
Just use this:-
df1[~df1['col1'].isin(df2['col1'])]
Now if you print above code you will get your desired output:-
col1 col2 col3
3 4 40 d
4 5 50 e
Quick and Dirty
df1.append(df1.merge(df2.col1)).drop_duplicates(keep=False)
col1 col2 col3
3 4 40 d
4 5 50 e
i want to replace all rows that have "A" in name column
with single row from another df
i got this
data={"col1":[2,3,4,5,7],
"col2":[4,2,4,6,4],
"col3":[7,6,9,11,2],
"col4":[14,11,22,8,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
that is my single row (the another df)
data2={"col1":[0]
,"col2":[1]
,"col3":[5]
,"col4":[6]
}
df2=pd.DataFrame.from_dict(data2)
df2
that how i want it to look like
data={"col1":[0,0,4,0,7],
"col2":[1,1,4,1,4],
"col3":[5,5,9,5,2],
"col4":[6,6,22,6,5],
"name":["A","A","V","A","B"],
"n_roll":[8,2,1,3,9]}
df=pd.DataFrame.from_dict(data)
df
i try do this df.loc[df["name"]=="A"][df2.columns]=df2
but it did not work
We can try mask + combine_first
df = df.mask(df['name'].eq('A'), df2.loc[0], axis=1).combine_first(df)
df
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8.0
1 0 1 5 6 A 2.0
2 4 4 9 22 V 1.0
3 0 1 5 6 A 3.0
4 7 4 2 5 B 9.0
df.loc[df["name"]=="A"][df2.columns]=df2 is index-chaining and is not expected to work. For details, see the doc.
You can also use boolean indexing like this:
df.loc[df['name']=='A', df2.columns] = df2.values
Output:
col1 col2 col3 col4 name n_roll
0 0 1 5 6 A 8
1 0 1 5 6 A 2
2 4 4 9 22 V 1
3 0 1 5 6 A 3
4 7 4 2 5 B 9
I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.
I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0
We can use DataFrame.merge with indicator = True + DataFrame.query and DataFrame.drop
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
3 4 13
4 5 14
if now for example we change a value of row 0:
df1.iat[0,0]=3
row 0 is no longer filtered
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1) )
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
Step by step
df_filtered=( df1.merge(df2,how='outer',indicator=True)
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 1 10 right_only
df_filtered=( df1.merge(df2,how='outer',indicator=True).query("_merge == 'left_only'")
)
print(df_filtered)
col1 col2 _merge
0 3 10 left_only
3 4 13 left_only
4 5 14 left_only
df_filtered=( df1.merge(df2,how='outer',indicator=True)
.query("_merge == 'left_only'")
.drop('_merge',axis=1)
)
print(df_filtered)
col1 col2
0 3 10
3 4 13
4 5 14
You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask
s1 = df1.agg(tuple, axis=1)
s2 = df2.agg(tuple, axis=1)
df1[~s1.isin(s2)]
Out[538]:
col1 col2
3 4 13
4 5 14
Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.