Create new column with max value with groupby - python

From the following dataframe, I am trying to add a new column, with the condition that for every id check the maximium value. Then place the maximium value for each row of every id in the new column.
df
id value
1 0
1 0
1 0
2 0
2 1
3 1
3 1
Expected result:
id value new_column
1 0 0
1 0 0
1 0 0
2 0 1
2 1 1
3 1 1
3 1 1
I have tried:
df['new_column'] = df.groupby(['id'])['value'].idxmax()
Or:
df['new_column'] = df.groupby(['id'])['value'].max()
But neither of these give the desired result.

You need to use transform for this:
df['new_column'] = df.groupby(['id'])['value'].transform('max')
This more succinctly replicates the following:
df['new_column'] = df['id'].map(df.groupby(['id'])['value'].max())
Remember that the result of a groupby operation is a series with index set to grouper column(s).
Since indices are not aligned between your original dataframe and the groupby object, assignment will not happen automatically.

Related

How to assign a value to a column for a subset of dataframe based on a condition in Pandas?

I have a data frame:
df = pd.DataFrame([[0,4,0,0],
[1,5,1,0],
[2,6,0,0],
[3,7,1,0]], columns=['index', 'A', 'class', 'label'])
df:
index
A
class
label
0
4
0
0
1
5
1
0
2
6
0
0
3
7
1
0
I want to change the label to 1, if the mean of A column of rows with class 0 is bigger than the mean of all data in column A?
How to do this in a few line of code?
I tried this but didn't work:
if df[df['class'] == 0]['A'].mean() > df['A'].mean():
df[df['class']]['lable'] = 1
Use the following, pandas.DataFrame.groupby 'class', get groupby.mean of each group of 'A', check whether greater than df['A'].mean(), and pandas.Series.map that boolean series astype(int) to df['class'] and assign to df['label']:
>>> df['label'] = df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
).astype(int)
>>> df
index A class label
0 0 4 0 0
1 1 5 1 1
2 2 6 0 0
3 3 7 1 1
Since you are checking only for class == 0, you need to add another boolean mask on df['class']:
>>> df['label'] = (df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
) & (~df['class'].astype(bool))
).astype(int)
index A class label
0 0 4 0 0
1 1 5 1 0 # because (5+7)/2 < (4+5+6+7)/4
2 2 6 0 0
3 3 7 1 0 # because (5+7)/2 < (4+5+6+7)/4
So even if your code has worked, you will not know it, because the conditions do not get fulfilled.
If I understand correctly, if the condition you mentioned is fullfilled, than the labels of all rows changes to 1 right? in that case what you did es correct but you missed something, the code should look like this:
if df[df['class'] == 0]['A'].mean() > df['A'].mean:
df['label'] = 1
This should work.
What you did does not work because when you use df[df['class']], you are only selecting the 'class' column of the DataFrame, so the 'label' column you want to modify is not called

pandas: Unable to write values to single row dataframe

I have a single row dataframe(df) on which I want to insert value for every column using only the index numbers.
The dataframe df is in following form.
a b c
1 0 0 0
2 0 0 0
3 0 0 0
df.iloc[[0],[1]] = predictions[:1]
This gives me the following warning and does not write anything to the row:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
However when I try using
pred_row.iloc[0,1] = predictions[:1]
It gives me error
ValueError: Incompatible indexer with Series
Is there a way to write value to single row dataframe.
Predictions is any random value that I am trying to set in a particular cell of df
For set one element of Series to DataFrame change selecting to predictions[0]:
print (df)
a b c
1 0 0 0
2 0 0 0
3 0 0 0
predictions = pd.Series([1,2,3])
print (predictions)
0 1
1 2
2 3
dtype: int64
df.iloc[0, 1] = predictions[0]
#more general for set one element of Series by position
#df.iloc[0, 1] = predictions.iat[0]
print (df)
a b c
1 0 1 0
2 0 0 0
3 0 0 0
Details:
#scalar
print (predictions[0])
1
#one element Series
print (predictions[:1])
0 1
dtype: int64
Also working convert one element Series to one element array, but set by scalar is simplier:
df.iloc[0, 1] = predictions[:1].values
print (df)
a b c
1 0 1 0
2 0 0 0
3 0 0 0
print (predictions[:1].values)
[1]

Converting pandas column of comma-separated strings into dummy variables

In my dataframe, I have a categorical variable that I'd like to convert into dummy variables. This column however has multiple values separated by commas:
0 'a'
1 'a,b,c'
2 'a,b,d'
3 'd'
4 'c,d'
Ultimately, I'd want to have binary columns for each possible discrete value; in other words, final column count equals number of unique values in the original column. I imagine I'd have to use split() to get each separate value but not sure what to do afterwards. Any hint much appreciated!
Edit: Additional twist. Column has null values. And in response to comment, the following is the desired output. Thanks!
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Use str.get_dummies
df['col'].str.get_dummies(sep=',')
a b c d
0 1 0 0 0
1 1 1 1 0
2 1 1 0 1
3 0 0 0 1
4 0 0 1 1
Edit: Updating the answer to address some questions.
Qn 1: Why is it that the series method get_dummies does not accept the argument prefix=... while pandas.get_dummies() does accept it
Series.str.get_dummies is a series level method (as the name suggests!). We are one hot encoding values in one Series (or a DataFrame column) and hence there is no need to use prefix. Pandas.get_dummies on the other hand can one hot encode multiple columns. In which case, the prefix parameter works as an identifier of the original column.
If you want to apply prefix to str.get_dummies, you can always use DataFrame.add_prefix
df['col'].str.get_dummies(sep=',').add_prefix('col_')
Qn 2: If you have more than one column to begin with, how do you merge the dummies back into the original frame?
You can use DataFrame.concat to merge one hot encoded columns with the rest of the columns in dataframe.
df = pd.DataFrame({'other':['x','y','x','x','q'],'col':['a','a,b,c','a,b,d','d','c,d']})
df = pd.concat([df, df['col'].str.get_dummies(sep=',')], axis = 1).drop('col', 1)
other a b c d
0 x 1 0 0 0
1 y 1 1 1 0
2 x 1 1 0 1
3 x 0 0 0 1
4 q 0 0 1 1
The str.get_dummies function does not accept prefix parameter, but you can rename the column names of the returned dummy DataFrame:
data['col'].str.get_dummies(sep=',').rename(lambda x: 'col_' + x, axis='columns')

Using two different data frames to compute new variable

I have two dataframes of the same dimensions that look like:
df1
ID flag
0 1
1 0
2 1
df2
ID flag
0 0
1 1
2 0
In both dataframes I want to create a new variable that denotes an additive flag. So the new variable will look like this:
df1
ID flag new_flag
0 1 1
1 0 1
2 1 1
df2
ID flag new_flag
0 0 1
1 1 1
2 0 1
So if either flag columns is a 1 the new flag will be a 1.
I tried this code:
df1['new_flag']= 1
df2['new_flag']= 1
df1['new_flag'][(df1['flag']==0)&(df1['flag']==0)]=0
df2['new_flag'][(df2['flag']==0)&(df2['flag']==0)]=0
I would expect the same number of 1 in both new_flag but they differ. Is this because I'm not going row by row? Like this question?
pandas create new column based on values from other columns
If so how do I include criteria from both datafrmes?
You can use np.logical_or to achieve this, if we set df1 to be all 0's except for the last row so we don't just get a column of 1's, we can cast the result of np.logical_or using astype(int) to convert the boolean array to 1 and 0:
In [108]:
df1['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df2['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df1
Out[108]:
ID flag new_flag
0 0 0 0
1 1 0 1
2 2 1 1
In [109]:
df2
Out[109]:
ID flag new_flag
0 0 0 0
1 1 1 1
2 2 0 1

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories