python pandas dataframe : fill nans with a conditional mean - python

I have the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A NaN
6 B NaN
And I want indexes 5 and 6 to be filled with the conditional mean of 'Vals' based on the 'Cat' column, namely 2 and 4.5
The following code works fine:
means = df.groupby('Cat').Vals.mean()
for i in df[df.Vals.isnull()].index:
df.loc[i, 'Vals'] = means[df.loc[i].Cat]
Cat Vals
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 A 2
6 B 4.5
But I'm looking for something nicer, like
df.Vals.fillna(df.Vals.mean(Conditionally to column 'Cat'))
Edit: I found this, which is one line shorter, but I'm still not happy with it:
means = df.groupby('Cat').Vals.mean()
df.Vals = df.apply(lambda x: means[x.Cat] if pd.isnull(x.Vals) else x.Vals, axis=1)

We wish to "associate" the Cat values with the missing NaN locations.
In Pandas such associations are always done via the index.
So it is natural to set Cat as the index:
df = df.set_index(['Cat'])
Once this is done, then fillna works as desired:
df['Vals'] = df['Vals'].fillna(means)
To return Cat to a column, you could then of course use reset_index:
df = df.reset_index()
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'Cat' : ['A', 'A', 'A','B', 'B', 'A', 'B'],
'Vals' : [1, 2, 3, 4, 5, np.nan, np.nan]})
means = df.groupby(['Cat'])['Vals'].mean()
df = df.set_index(['Cat'])
df['Vals'] = df['Vals'].fillna(means)
df = df.reset_index()
print(df)
yields
Cat Vals
0 A 1.0
1 A 2.0
2 A 3.0
3 B 4.0
4 B 5.0
5 A 2.0
6 B 4.5

Related

Aggregate values in pandas dataframe based on lists of indices in a pandas series

Suppose you have a dataframe with an "id" column and a column of values:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df1
id vals
0 a 1
1 b 2
2 c 3
You also have a series that contains lists of "id" values that correspond to those in df1:
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2
id
0 [b, c]
1 [a, c]
2 [a, b]
Now, you need a computationally efficient method for taking the mean of the "vals" column in df1 using the corresponding ids in df2 and creating a new column in df1. For instance, for the first row (index=0) we would take the mean of the values for ids "b" and "c" in df1 (since these are the id values in df2 for index=0):
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
You could do it this way:
df1['avg_vals'] = df2.apply(lambda x: df1.loc[df1['id'].isin(x), 'vals'].mean())
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
...but suppose it is too slow for your purposes. I.e., I need something much more computationally efficient if possible! Thanks for your help in advance.
Let us try
df1['new'] = pd.DataFrame(df2.tolist()).replace(dict(zip(df1.id,df1.vals))).mean(1)
df1
Out[109]:
id vals new
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Try something like:
df1['avg_vals'] = (df2.explode()
.map(df1.set_index('id')['vals'])
.groupby(level=0)
.mean()
)
output:
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5
Thanks to #Beny and #mozway for their answers. But, these still were not performing as efficiently as I needed. I was able to take some of mozway's answer and add a merge and groupby to it which sped things up:
df1 = pd.DataFrame({'id': ['a', 'b', 'c'] , 'vals': [1, 2, 3]})
df2 = pd.Series([['b', 'c'], ['a', 'c'], ['a', 'b']])
df2 = df2.explode().reset_index(drop=False)
df1['avg_vals'] = pd.merge(df1, df2, left_on='id', right_on=0, how='right').groupby('index').mean()['vals']
df1
id vals avg_vals
0 a 1 2.5
1 b 2 2.0
2 c 3 1.5

Replace missing data based on certain conditions

Let's say I have data:
a b
0 1.0 NaN
1 6.0 1
2 3.0 NaN
3 1.0 NaN
I would like to iterate over this data to see,
if Data[i] == NaN **and** column['a'] == 1.0 then replace NAN with 4 instead of replace by 4 in any NaN you see. How shall I go about it? I tried every for if function and it didn't work. I also did
for i in df.itertuples():
but the problem is df.itertuples() doesn't have a replace functionality and the other methods I've seen were to do it one by one.
End Result looking for:
a b
0 1.0 4
1 6.0 1
2 3.0 NaN
3 1.0 4
def func(x):
if x['a'] == 1 and pd.isna(x['b']):
x['b'] = 4
return x
df = pd.DataFrame.from_dict({'a': [1.0, 6.0, 3.0, 1.0], 'b': [np.nan, 1, np.nan, np.nan]})
df.apply(func, axis=1)
Instead of iterrows(), apply() may be a better option.
You can create a mask and then fill in the intended NaNs using that mask:
df = pd.DataFrame({'a': [1,6,3,1], 'b': [np.nan, 1, np.nan, np.nan]})
mask = df[['a', 'b']].apply(lambda x: (x[0] == 1) and (pd.isna(x[1])), axis=1)
df['b'] = df['b'].mask(mask, df['b'].fillna(4))
print(df)
a b
0 1 4.0
1 6 1.0
2 3 NaN
3 1 4.0
df2 = df[df['a']==1.0].fillna(4.0)
df2.combine_first(df)
Can this help you?
Like you said, you can achieve this by combining 2 conditions: a==1 and b==Nan.
To combine two conditions in python you can use &.
In your example:
import pandas as pd
import numpy as np
# Create sample data
d = {'a': [1, 6, 3, 1], 'b': [np.nan, 1, np.nan, np.nan]}
df = pd.DataFrame(data=d)
# Convert to numeric
df = df.apply(pd.to_numeric, errors='coerce')
print(df)
# Replace Nans
df[ (df['a'] == 1 ) & np.isnan(df['b']) ] = 4
print(df)
Should do the trick.

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Subtract one row from another in Pandas DataFrame

I am trying to subtract one row from another in a Pandas DataFrame. I have multiple descriptor columns preceding one numerical column, forcing me to set the index of the DataFrame on the two descriptor columns.
When I do this I get a KeyError on whatever the first column name listed in the set_index() list of columns is. In this case it is 'COL_A':
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
df.reset_index(inplace=True)
KeyError: 'COL_A'
I did not give this a second thought and cannot figure out why the KeyError is how this resolves.
I came upon this question for a quick answer. Here's what my solution ended up being.
>>> df = pd.DataFrame(data=[[5,5,5,5], [3,3,3,3]], index=['r1', 'r2'])
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
>>> df.loc['r3'] = df.loc['r1'] - df.loc['r2']
>>> df
0 1 2 3
r1 5 5 5 5
r2 3 3 3 3
r3 2 2 2 2
>>>
Not sure I understand you correctly:
df = pd.DataFrame({'COL_A': ['A', 'A'],
'COL_B': ['B', 'B'],
'COL_C': [4, 2]})
gives:
COL_A COL_B COL_C
0 A B 4
1 A B 2
then
df.set_index(['COL_A', 'COL_B'], inplace=True)
df.iloc[1] = (df.iloc[1] / df.iloc[0])
yields:
COL_A COL_B
A B 4.0
B 0.5
If you now want to subtract, say row 0 from row 1, you can:
df.iloc[1].subtract(df.iloc[0])
to get:
COL_C -3.5

Pandas groupby result into multiple columns

I have a dataframe in which I'm looking to group and then partition the values within a group into multiple columns.
For example: say I have the following dataframe:
>>> import pandas as pd
>>> import numpy as np
>>> df=pd.DataFrame()
>>> df['Group']=['A','C','B','A','C','C']
>>> df['ID']=[1,2,3,4,5,6]
>>> df['Value']=np.random.randint(1,100,6)
>>> df
Group ID Value
0 A 1 66
1 C 2 2
2 B 3 98
3 A 4 90
4 C 5 85
5 C 6 38
>>>
I want to groupby the "Group" field, get the sum of the "Value" field, and get new fields, each of which holds the ID values of the group.
Currently I am able to do this as follows, but I am looking for a cleaner methodology:
First, I create a dataframe with a list of the IDs in each group.
>>> g=df.groupby('Group')
>>> result=g.agg({'Value':np.sum, 'ID':lambda x:x.tolist()})
>>> result
ID Value
Group
A [1, 4] 98
B [3] 76
C [2, 5, 6] 204
>>>
And then I use pd.Series to split those up into columns, rename them, and then join it back.
>>> id_df=result.ID.apply(lambda x:pd.Series(x))
>>> id_cols=['ID'+str(x) for x in range(1,len(id_df.columns)+1)]
>>> id_df.columns=id_cols
>>>
>>> result.join(id_df)[id_cols+['Value']]
ID1 ID2 ID3 Value
Group
A 1 4 NaN 98
B 3 NaN NaN 76
C 2 5 6 204
>>>
Is there a way to do this without first having to create the list of values?
You could use
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
to create id_df without the intermediate result DataFrame.
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
grouped = df.groupby('Group')
values = grouped['Value'].agg('sum')
id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
result = pd.concat([id_df, values], axis=1)
print(result)
yields
ID1 ID2 ID3 Value
Group
A 1 4 NaN 77
B 3 NaN NaN 84
C 2 5 6 86
Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":
Using #unutbu setup:
import pandas as pd
import numpy as np
np.random.seed(2016)
df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
'ID': [1, 2, 3, 4, 5, 6],
'Value': np.random.randint(1, 100, 6)})
#Create group
grp = df.groupby('Group')
#Create helper column
df['ID_Count'] = grp['ID'].cumcount() + 1
#Pivot dataframe using helper column and add 'Value' column to pivoted output.
df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
Output:
ID_Count ID1 ID2 ID3 Value
Group
A 1.0 4.0 NaN 77
B 3.0 NaN NaN 84
C 2.0 5.0 6.0 86
Using get_dummies and MultiLabelBinarizer (scikit-learn):
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.DataFrame()
df['Group']=['A','C','B','A','C','C']
df['ID']=[1,2,3,4,5,6]
df['Value']=np.random.randint(1,100,6)
mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])
df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
pd.concat([df3, df2['Value']], axis=1)
ID1 ID2 ID3 Value
Group
A 1 4 NaN 63
B 3 NaN NaN 59
C 2 5 6 230

Categories