Replace missing data based on certain conditions - python

Let's say I have data:
a b
0 1.0 NaN
1 6.0 1
2 3.0 NaN
3 1.0 NaN
I would like to iterate over this data to see,
if Data[i] == NaN **and** column['a'] == 1.0 then replace NAN with 4 instead of replace by 4 in any NaN you see. How shall I go about it? I tried every for if function and it didn't work. I also did
for i in df.itertuples():
but the problem is df.itertuples() doesn't have a replace functionality and the other methods I've seen were to do it one by one.
End Result looking for:
a b
0 1.0 4
1 6.0 1
2 3.0 NaN
3 1.0 4

def func(x):
if x['a'] == 1 and pd.isna(x['b']):
x['b'] = 4
return x
df = pd.DataFrame.from_dict({'a': [1.0, 6.0, 3.0, 1.0], 'b': [np.nan, 1, np.nan, np.nan]})
df.apply(func, axis=1)
Instead of iterrows(), apply() may be a better option.

You can create a mask and then fill in the intended NaNs using that mask:
df = pd.DataFrame({'a': [1,6,3,1], 'b': [np.nan, 1, np.nan, np.nan]})
mask = df[['a', 'b']].apply(lambda x: (x[0] == 1) and (pd.isna(x[1])), axis=1)
df['b'] = df['b'].mask(mask, df['b'].fillna(4))
print(df)
a b
0 1 4.0
1 6 1.0
2 3 NaN
3 1 4.0

df2 = df[df['a']==1.0].fillna(4.0)
df2.combine_first(df)
Can this help you?

Like you said, you can achieve this by combining 2 conditions: a==1 and b==Nan.
To combine two conditions in python you can use &.
In your example:
import pandas as pd
import numpy as np
# Create sample data
d = {'a': [1, 6, 3, 1], 'b': [np.nan, 1, np.nan, np.nan]}
df = pd.DataFrame(data=d)
# Convert to numeric
df = df.apply(pd.to_numeric, errors='coerce')
print(df)
# Replace Nans
df[ (df['a'] == 1 ) & np.isnan(df['b']) ] = 4
print(df)
Should do the trick.

Related

Is there a pythonic way of shifting pandas dataframe cells to the left, while pushing out or overwriting any nan?

I have a pandas dataframe (starting_df) with nan values in the left-hand columns. I'd like to shift all values over to the left for a left-aligned dataframe. My Dataframe is 24x24, but for argument's sake, I'm just posting a 4x4 version.
After some cool initial answers here, I modified the dataframe to also include a non-leading nan, who's position I'd like to preserve.
I have a piece of code that accomplishes what I want, but it relies on nested for-loops and suppressing an IndexError, which does not feel very pythonic. I have no experience with error handling in general, but simply suppressing an error does not seem to be the right strategy.
Starting dataframe and desired final dataframe:
Here is the code that (poorly) accomplishes the goal.
import pandas as pd
import numpy as np
def get_left_aligned(starting_df):
"""take a starting df with right-aligned numbers and nan, and
turn it into a left aligned table."""
left_aligned_df = pd.DataFrame()
for temp_index_1 in range(0, starting_df.shape[0]):
temp_series = []
for temp_index_2 in range(0, starting_df.shape[0]):
try:
temp_series.append(starting_df.iloc[temp_index_2, temp_index_2 + temp_index_1])
temp_index_2 += 1
except IndexError:
pass
temp_series = pd.DataFrame(temp_series, columns=['col'+str(temp_index_1 + 1)])
left_aligned_df = pd.concat([left_aligned_df, temp_series], axis=1)
return left_aligned_df
df = pd.DataFrame(dict(col1=[1, np.nan, np.nan, np.nan],
col2=[5, 2, np.nan, np.nan],
col3=[7, np.nan, 3, np.nan],
col4=[9, 8, 6, 4]))
df_expected = pd.DataFrame(dict(col1=[1, 2, 3, 4],
col2=[5, np.nan, 6, np.nan],
col3=[7, 8, np.nan, np.nan],
col4=[9, np.nan, np.nan, np.nan]))
df_left = get_left_aligned(df)
I appreciate any help with this.
Thanks!
or transpose the df and use shift to shift by column, when the NA num is increasing 1 by 1.
dfn = df.T.copy()
for i, col in enumerate(dfn.columns):
dfn[col] = dfn[col].shift(-i)
dfn = dfn.T
print(dfn)
col1 col2 col3 col4
0 1.0 5.0 7.0 9.0
1 2.0 NaN 8.0 NaN
2 3.0 6.0 NaN NaN
3 4.0 NaN NaN NaN
One way to resolve your challenge is to move the data into numpy territory, sort the data, then return it as a pandas DataFrame:
Numpy converts pandas NA to object data type; pd.to_numeric resolves that to data types numpy can work with.
pd.DataFrame(
np.sort(df.transform(pd.to_numeric).to_numpy(), axis=1),
columns=df.columns,
dtype="Int64",
)
col1 col2 col3
0 1 4 6
1 2 5 <NA>
2 3 <NA> <NA>
You can sort values on the row based on their positions, keeping the nan values at the end, giving them a very high value (np.nan, for example), rather than their actual position.
df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
Here an example:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
],
columns=['A', 'B', 'C', 'D']
)
df2 = df.T.apply(
lambda x: [z[1] for z in sorted(enumerate(x), key=(lambda k: np.inf if pd.isna(k[1]) else k[0]), reverse=False)],
axis=0
).T
And this id df2:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
EDIT
If you have rows with NaNs after the first not NaN value, you can use this approach based on first_valid_index:
df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
An example for this case:
import numpy as np
import pandas as pd
df = pd.DataFrame(
data = [
[np.nan, 2, 4, 7],
[np.nan, np.nan, 6, 9],
[np.nan, np.nan, np.nan, 10],
[np.nan, np.nan, np.nan, np.nan],
[np.nan, 5, np.nan, 3],
],
columns=['A', 'B', 'C', 'D']
)
df3 = df.apply(
lambda x: x.shift(-list(x.index).index(x.first_valid_index() or x.index[0])),
axis=1,
)
And df3 is:
A B C D
0 2.0 4.0 7.0 NaN
1 6.0 9.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN NaN NaN NaN
4 5.0 NaN 3.0 NaN

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.
Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Short way to replace values in a Series based on values in another Series?

I the below code, I am replacing all NaN values from column b with blank string if the corresponding value in column a is 1.
The code works, but I have to type df.loc[df.a == 1, 'b'] twice.
Is there a shorter/better way to do it?
import pandas as pd
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
filtered = df.loc[df.a == 1, 'b']
filtered.fillna('', inplace=True)
df.loc[df.a == 1, 'b'] = filtered
print(df)
how about the use of numpy where clause to check values in a and b and replace? see a mockup below. I have used column 'c' to illustrate
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': [1, None, 3],
'b': [None, 5, 6],
})
#replace b value if the corresponding value in column a is 1 and column b is NaN
df['c'] = np.where(((df['a'] == 1) & (df['b'].isna())), df['a'], df['b'])
df
original dataframe
a b
0 1.0 1.0
1 NaN 5.0
2 3.0 6.0
result:
a b c
0 1.0 NaN 1.0
1 NaN 5.0 5.0
2 3.0 6.0 6.0
Use where() to do it in one line
import numpy as np
df['b'] = np.where((df['b'].isnull()) & (df['a']==1),'',df['a'])
Use Series.fillna only for matched values by condition:
df.loc[df.a == 1, 'b'] = df['b'].fillna('')

Pandas fillna() not filling values from series

I'm trying to fill missing values in a column in a DataFrame with the value from another DataFrame's column. Here's the setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [2, 3, 5, np.nan, np.nan],
'b': [10, 11, 13, 14, 15]
})
df2 = pd.DataFrame({
'x': [1]
})
I can of course do this and it works:
df['a'] = df['a'].fillna(1)
However, this results in the missing values not being filled:
df['a'] = df['a'].fillna(df2['x'])
And this results in an error:
df['a'] = df['a'].fillna(df2['x'].values)
How can I use the value from df2['x'] to fill in missing values in df['a']?
If you can guarantee df2['x'] only has a single element, then use .item:
df['a'] = df['a'].fillna(df2.values.item())
Or,
df['a'] = df['a'].fillna(df2['x'].item())
df
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15
Otherwise, this isn't possible unless they're either the same length and/or index-aligned.
As a rule of thumb; either
pass a scalar, or
pass a dictionary mapping the index of the NaN value to its replacement value (e.g., df.a.fillna({3 : 1, 4 : 1})), or
index aligned series
I think one general solution is select first value by [0] for scalar:
print (df2['x'].values[0])
1
df['a'] = df['a'].fillna(df2['x'].values[0])
#similar solution for select by loc
#df['a'] = df['a'].fillna(df2.loc[0, 'x'])
print (df)
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15

Dropping columns with >N NaNs excluding specific columns

I'm wondering if the there is a consice way to do exclude all columns with more than N NaNs, excluding one column from this subset.
For example:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
Results in:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Running the following, I get:
df.dropna(thresh=2, axis=1)
B D
0 2.0 0
1 4.0 1
2 NaN 5
I would like to keep column 'C'. I.e., to perform this thresholding except on column 'C'.
Is that possible?
You can put the column back once you've done the thresholding. If you do this all on one line, you don't even need to store a reference to the column.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, np.nan, 5]],
columns=list('ABCD'))
df.dropna(thresh=2, axis=1).assign(C=df['C'])
You could also do
C = df['C']
df.dropna(thresh=2, axis=1)
df.assign(C=C)
As suggested by #Wen, you can also do an indexing operation that won't remove column C to begin with.
threshold = 2
df = df.loc[:, (df.isnull().sum(0) < threshold) | (df.columns == 'C')]
The index here for the column will select columns that have fewer than threshold NaN values, or whose name is C. If you wanted to include more than just one column in the exception, you can chain more conditions with the "or" operator |. For example:
df = df.loc[
:,
(df.isnull().sum(0) < threshold) |
(df.columns == 'C') |
(df.columns == 'D')]
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().sum(0)==len(df))]
Out[415]:
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5
As per Zero's suggestion
df.loc[:,(df.isnull().sum(0)<=1)|(df.isnull().all(0))]
EDIT :
df.loc[:,(df.isnull().sum(0)<=1)|(df.columns=='C')]
Another take that blends some concepts from other answers.
df.loc[:, df.isnull().assign(C=False).sum().lt(2)]
B C D
0 2.0 NaN 0
1 4.0 NaN 1
2 NaN NaN 5

Categories