I have a dataframe containing values as well as some NaN. Now I have the mean of the columns and I want to insert the mean of the particular column into the NaN values. For eg:
ColA and ColB have NaN to be replaced with the value of mean I have
I have the mean for ColA and ColB. I want to insert them into the NaN locations. I could do that individually using the replace method. But for many columns, is there any other way to achieve this?
EDIT:
If already has Series with means only use DataFrame.fillna:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
means = pd.Series([10,20], index=['B','E'])
df= df.fillna(means)
print (df)
A B C D E F
0 a 4.0 7 1.0 20.0 a
1 b 10.0 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 NaN 20.0 b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
If need replace missing values in all numeric columns use DataFrame.fillna by mean - it working because mean exclude non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
df1 = df.fillna(df.mean())
print (df1)
A B C D E F
0 a 4.0 7 1.0 3.75 a
1 b 4.4 8 3.0 3.00 a
2 c 4.0 9 5.0 6.00 a
3 d 5.0 4 2.0 3.75 b
4 e 5.0 2 1.0 2.00 b
5 f 4.0 3 0.0 4.00 b
If need specify columns for means only change solution with list of columns names:
cols = ['D','B']
df[cols] = df[cols].fillna(df[cols].mean())
print (df)
A B C D E F
0 a 4.0 7 1.0 NaN a
1 b 4.4 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 2.0 NaN b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
try this, for those column which you want to fill.
df['column1'] = df['column1'].fillna((df['column1'].mean()))
Related
Trying to conditionally fill NaN's in a dataframe, based on:
1. value on A (done with groupby)
2. inside groupby(A), if value is nan and is first, fill as zero and then ffill (A=a in example)
3. inside groupby(A), if value is nan and isn't first, bfill (A=b in example)
4. inside groupby(A), if value is nan but there's no datapoint to follow, ffill (A=c in example)
I smell ternary + lambda, but would like a pythonic way of writing it.
Basically, starting point would be:
df
A B
0 a NaN
1 a NaN
2 a 3.0
3 a 4.0
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
And df should become:
df
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 6.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
We can do GroupBy.ffill, then we are sure each groups intermediate results are filled in correctly, after that we can do a Series.fillna with 0, because these are the only values left:
df['B'] = df.groupby('A')['B'].ffill().fillna(0)
A B
0 a 0.0
1 a 0.0
2 a 3.0
3 a 4.0
4 b 4.0
5 b 4.0
6 b 6.0
7 b 6.0
8 c 7.0
9 c 7.0
10 c 7.0
How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
I have a dataframe df
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [np.nan, 1, 2,np.nan,2,np.nan,np.nan],
'B': [10, np.nan, np.nan,5,np.nan,np.nan,7],
'C': [1,1,2,2,3,3,3]})
which looks like :
A B C
0 NaN 10.0 1
1 1.0 NaN 1
2 2.0 NaN 2
3 NaN 5.0 2
4 2.0 NaN 3
5 NaN NaN 3
6 NaN 7.0 3
I want to replace all the NAN values in column A and B with the value from other records which are from the same group as mentioned in column C.
My expected output is :
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3
How can I do the same in pandas dataframe ?
Use GroupBy.apply with forward and back filling missing values:
df[['A','B']] = df.groupby('C')['A','B'].apply(lambda x: x.ffill().bfill())
print (df)
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3
I want to drop some rows from the data. I am using following code-
import pandas as pd
import numpy as np
vle = pd.read_csv('/home/user/Documents/MOOC dataset original/vle.csv')
df = pd.DataFrame(vle)
df.dropna(subset = ['week_from'],axis=1,inplace = True)
df.dropna(subset = ['week_to'],axis=1,inplace = True)
df.to_csv('/home/user/Documents/MOOC dataset cleaned/studentRegistration.csv')
but its throwing following error-
raise KeyError(list(np.compress(check,subset)))
KeyError: [' week_from ']
what is going wrong?
I think need omit axis=1, because default value is axis=0 for remove rows with NaNs (missing values) by dropna by subset of columns for check NaNs, also solution should be simplify to one line:
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'week_from':[np.nan,5,4,5,5,4],
'week_to':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A week_from week_to E F
0 a NaN 1.0 5.0 a
1 b 5.0 3.0 3.0 a
2 c 4.0 NaN 6.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
print (df)
A week_from week_to E F
1 b 5.0 3.0 3.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
If want remove columns by specifying rows for check NaNs:
df.dropna(subset = [2, 5], axis=1, inplace = True)
print (df)
A week_from F
0 a NaN a
1 b 5.0 a
2 c 4.0 a
3 d 5.0 b
4 e 5.0 b
5 f 4.0 b
But if need remove columns by names solution is different, need drop:
df.drop(['A','week_from'],axis=1, inplace = True)
print (df)
week_to E F
0 1.0 5.0 a
1 3.0 3.0 a
2 NaN 6.0 a
3 7.0 9.0 b
4 1.0 2.0 b
5 0.0 NaN b