How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
Related
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Trying to conditionally fill NaN's in a dataframe, based on:
1. value on A (done with groupby)
2. inside groupby(A), if value is nan and is first, fill as zero and then ffill (A=a in example)
3. inside groupby(A), if value is nan and isn't first, bfill (A=b in example)
4. inside groupby(A), if value is nan but there's no datapoint to follow, ffill (A=c in example)
I smell ternary + lambda, but would like a pythonic way of writing it.
Basically, starting point would be:
df
A B
0 a NaN
1 a NaN
2 a 3.0
3 a 4.0
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
And df should become:
df
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 6.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
We can do GroupBy.ffill, then we are sure each groups intermediate results are filled in correctly, after that we can do a Series.fillna with 0, because these are the only values left:
df['B'] = df.groupby('A')['B'].ffill().fillna(0)
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 4.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
I'm currently having a problem with filling the missing values of my dataframe using a different dataframe.
Data samples:
df1
A B C
b 1.0 1.0
d NaN NaN
c 2.0 2.0
a NaN NaN
f NaN NaN
df2
A B C
c 1 5
b 2 6
a 3 7
d 4 8
I've tried to follow the solution in this question but it would appear that it is only possible if the values you're looking up is present in both dataframes you're joining.
My attempt
mask = df1["B"].isnull()
df1.loc[mask, "B"] = df2[df1.loc[mask, "A"]].values
Error:
"None of [Index(['d', 'a', 'f'], dtype='object')] are in the [columns]"
Expected result:
A B C
b 1.0 1.0
d 4.0 8.0
c 2.0 2.0
a 3.0 7.0
f NaN NaN
Also, can it be used it fill two columns?
You can use combine_first here, which is exactly aimed at filling NaNs by matching with another dataframe's columns:
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 a 3.0 7.0
1 b 1.0 1.0
2 c 2.0 2.0
3 d 4.0 8.0
4 f NaN NaN
If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.
I have a following dataframe:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Now I want to fill null values of A with the values in B or D. i.e. if the value is Null in B than check D. So resultant dataframe looks like this.
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5 NaN NaN 5
3 3.0 3.0 NaN 4
I can do this using following code:
df['A'] = df['A'].fillna(df['B'])
df['A'] = df['A'].fillna(df['D'])
But I want to do this in one line, how can I do that?
You could simply chain both .fillna():
df['A'] = df.A.fillna(df.B).fillna(df.D)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4
Or using fillna with combine_first:
df['A'] = df.A.fillna(df.B.combine_first(df.D))
If dont need chain because many columns better is use back filling missing values with seelcting first column by positions:
df['A'] = df['A'].fillna(df[['B','D']].bfill(axis=1).iloc[:, 0])
print (df)
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 5
3 3.0 3.0 NaN 4