Fill missing values of pandas based on values of other columns - python

I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?

You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))

If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4

Related

Fill nan gaps in pandas df only if gaps smaller than N nans

I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.

Fill NaN with the closest not NaN columns values in Python

I want to fill e column's NaN with its most closest (by position from left side) not NaN columns' values.
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN NaN
2 2 4.0 NaN NaN NaN
3 5 6.0 NaN NaN NaN
4 3 NaN NaN NaN NaN
For example, for the second row of e, its most closest Not NaN column is e by position, then we take 7.0, is it possible to do this in Pandas? Thanks.
The expected output is like this:
a b c d e
0 1 2.0 3.0 6.0 3.0
1 3 5.0 7.0 NaN 7.0
2 2 4.0 NaN NaN 4.0
3 5 6.0 NaN NaN 6.0
4 3 NaN NaN NaN 3.0
If answer should be simplify get all first non missing values from left side to last column use forward filling them and select last column by position:
df.e = df.ffill(axis=1).iloc[:, -1]

How to move values over in each Pandas data frame row where np.nan are located?

If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.

I want to subtract each column from the previous non-null column using the diff function

I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0

Pandas: operations with nans

Suppose I produce the following using pandas:
df = pd.DataFrame(np.ones(25).reshape((5,5)),index = ['A','B','C','D','E'])
df1 = pd.DataFrame(np.ones(25).reshape((5,5))*2,index = ['A','B','C','D','E'])
df[2] = np.nan
df1[3] = np.nan
df[4] = np.nan
df1[4] = np.nan
df2 = df+df1
print(df2)
0 1 2 3 4
A 3.0 3.0 NaN NaN NaN
B 3.0 3.0 NaN NaN NaN
C 3.0 3.0 NaN NaN NaN
D 3.0 3.0 NaN NaN NaN
E 3.0 3.0 NaN NaN NaN
What do I have to do to get this instead?
0 1 2 3 4
A 3 3 2 1 NaN
B 3 3 2 1 NaN
C 3 3 2 1 NaN
D 3 3 2 1 NaN
E 3 3 2 1 NaN
Use the fill_value argument of the DataFrame.add method:
fill_value : None or float value, default None Fill missing (NaN)
values with this value. If both DataFrame locations are missing, the
result will be missing.
df.add(df1, fill_value=0)
Out:
0 1 2 3 4
A 3.0 3.0 2.0 1.0 NaN
B 3.0 3.0 2.0 1.0 NaN
C 3.0 3.0 2.0 1.0 NaN
D 3.0 3.0 2.0 1.0 NaN
E 3.0 3.0 2.0 1.0 NaN

Categories