I am trying to build a simple function to fill the pandas columns with
some distribution, but it fails to fill the whole table (df still have NaN after fillna ...)
def simple_impute_missing(df):
from numpy.random import normal
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns[3:]})
filled_df = df.fillna(rnd_filled)
return filled_df
But the returned df, still have NaNs !
I have checked to make sure that rnd_filled is full and have the right shape.
what is going on?
I think you need remove [:3] from df.columns[3:] for select all columns of df.
Sample:
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,np.nan],
'E':[5,np.nan,6],
'F':[7,np.nan,3]})
print (df)
A B C D E F
0 1.0 4 NaN 1.0 5.0 7.0
1 NaN 5 8.0 3.0 NaN NaN
2 3.0 6 9.0 NaN 6.0 3.0
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns})
filled_df = df.fillna(rnd_filled)
print (filled_df)
A B C D E F
0 1.000000 4 6.922458 1.000000 5.000000 7.000000
1 2.277218 5 8.000000 3.000000 5.714767 6.245759
2 3.000000 6 9.000000 0.119522 6.000000 3.000000
Related
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
I got data from sensors. And some certain period they return blank string to me for no reason!
During the data cleaning. I can manage to get the NaN column using this
df[df.isnull().values.any(axis=1)]
Time IL1 IL2 IL3 IN kVA kW kWh
12463 2018-09-17 10:30:00 63.7 78.4 53.3 25.2 NaN NaN 2039676.0
12464 2018-09-17 11:00:00 64.1 78.6 53.5 25.4 NaN NaN 2039698.0
How can I get kVA and kW out from the DataFrame?
Then I can find the median of kVA and KW from the other rows and replace the NaN with it
My usecase:
Right now I have to read file and find where the NaN columns are. It require my efforts. So I wants to automate that process by replace hardcode on column name.
trdb_a2_2018_df = pd.read_csv(PATH + 'dpm_trdb_a2_2018.csv', thousands=',', parse_dates=['Time'], date_parser=extract_dt)
trdb_a2_2018_df = trdb_a2_2018_df.replace(r'\s+', np.nan, regex=True)
median_kVA = trdb_a2_2018_df['kVA'].median()
trdb_a2_2018_df = trdb_a2_2018_df['kVA'].fillna(median_kVA)
I believe you need fillna with median:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.nan],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 NaN 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f NaN 3.0 0 4 b
df1 = df.fillna(df.median())
print (df1)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 4.0 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 5.0 3.0 0 4 b
If want also fiter NaNs in columns:
m = df.isnull().any()
df.loc[:, m] = df.loc[:, m].fillna(df.loc[:, m].median())
Alternative:
cols = df.columns[df.isnull().any()]
df[cols] = df[cols].fillna(df[cols].median())
Detail:
print (df.median())
B 5.0
C 4.0
D 2.0
E 4.5
dtype: float64
IIUC to filter out the column headers that contain NaN's use:
df.columns[df.isna().any()]
There are two ways for you to solve this question.
Use pandas.DataFrame.fillna to replace the NaN value with a certain value such as 0.
Use pandas.DataFrame.dropna to get a new DataFrame by filter origin DataFrame.
Reference:
Pandas dropna API
Pandas fillna API
Let's assume that this is an initial df:
df = pd.DataFrame([{'kVa': np.nan, 'kW':10.1}, {'kVa': 12.5, 'kW':14.3}, {'kVa': 16.1, 'kW':np.nan}])
In [51]: df
Out[51]:
kVa kW
0 NaN 10.1
1 12.5 14.3
2 16.1 NaN
You can use DataFrames's .fillna() method to replace NaN's and .notna() to get indexes of values other than NaN:
df.kVa.fillna(df.kVa[df.kVa.notna()].median(), inplace=True)
df.kW.fillna(df.kW[df.kW.notna()].median(), inplace=True)
Use inplace=True to avoid creating new Series instance.
Df after these manipulations:
In [54]: df
Out[54]:
kVa kW
0 14.3 10.1
1 12.5 14.3
2 16.1 12.2
I want to drop some rows from the data. I am using following code-
import pandas as pd
import numpy as np
vle = pd.read_csv('/home/user/Documents/MOOC dataset original/vle.csv')
df = pd.DataFrame(vle)
df.dropna(subset = ['week_from'],axis=1,inplace = True)
df.dropna(subset = ['week_to'],axis=1,inplace = True)
df.to_csv('/home/user/Documents/MOOC dataset cleaned/studentRegistration.csv')
but its throwing following error-
raise KeyError(list(np.compress(check,subset)))
KeyError: [' week_from ']
what is going wrong?
I think need omit axis=1, because default value is axis=0 for remove rows with NaNs (missing values) by dropna by subset of columns for check NaNs, also solution should be simplify to one line:
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'week_from':[np.nan,5,4,5,5,4],
'week_to':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A week_from week_to E F
0 a NaN 1.0 5.0 a
1 b 5.0 3.0 3.0 a
2 c 4.0 NaN 6.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
print (df)
A week_from week_to E F
1 b 5.0 3.0 3.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
If want remove columns by specifying rows for check NaNs:
df.dropna(subset = [2, 5], axis=1, inplace = True)
print (df)
A week_from F
0 a NaN a
1 b 5.0 a
2 c 4.0 a
3 d 5.0 b
4 e 5.0 b
5 f 4.0 b
But if need remove columns by names solution is different, need drop:
df.drop(['A','week_from'],axis=1, inplace = True)
print (df)
week_to E F
0 1.0 5.0 a
1 3.0 3.0 a
2 NaN 6.0 a
3 7.0 9.0 b
4 1.0 2.0 b
5 0.0 NaN b