Trying to conditionally fill NaN's in a dataframe, based on:
1. value on A (done with groupby)
2. inside groupby(A), if value is nan and is first, fill as zero and then ffill (A=a in example)
3. inside groupby(A), if value is nan and isn't first, bfill (A=b in example)
4. inside groupby(A), if value is nan but there's no datapoint to follow, ffill (A=c in example)
I smell ternary + lambda, but would like a pythonic way of writing it.
Basically, starting point would be:
df
A B
0 a NaN
1 a NaN
2 a 3.0
3 a 4.0
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
And df should become:
df
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 6.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
We can do GroupBy.ffill, then we are sure each groups intermediate results are filled in correctly, after that we can do a Series.fillna with 0, because these are the only values left:
df['B'] = df.groupby('A')['B'].ffill().fillna(0)
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 4.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
Related
I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
I have a dataframe containing values as well as some NaN. Now I have the mean of the columns and I want to insert the mean of the particular column into the NaN values. For eg:
ColA and ColB have NaN to be replaced with the value of mean I have
I have the mean for ColA and ColB. I want to insert them into the NaN locations. I could do that individually using the replace method. But for many columns, is there any other way to achieve this?
EDIT:
If already has Series with means only use DataFrame.fillna:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
means = pd.Series([10,20], index=['B','E'])
df= df.fillna(means)
print (df)
A B C D E F
0 a 4.0 7 1.0 20.0 a
1 b 10.0 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 NaN 20.0 b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
If need replace missing values in all numeric columns use DataFrame.fillna by mean - it working because mean exclude non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
df1 = df.fillna(df.mean())
print (df1)
A B C D E F
0 a 4.0 7 1.0 3.75 a
1 b 4.4 8 3.0 3.00 a
2 c 4.0 9 5.0 6.00 a
3 d 5.0 4 2.0 3.75 b
4 e 5.0 2 1.0 2.00 b
5 f 4.0 3 0.0 4.00 b
If need specify columns for means only change solution with list of columns names:
cols = ['D','B']
df[cols] = df[cols].fillna(df[cols].mean())
print (df)
A B C D E F
0 a 4.0 7 1.0 NaN a
1 b 4.4 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 2.0 NaN b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
try this, for those column which you want to fill.
df['column1'] = df['column1'].fillna((df['column1'].mean()))
I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?
You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))
If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4