I got data from sensors. And some certain period they return blank string to me for no reason!
During the data cleaning. I can manage to get the NaN column using this
df[df.isnull().values.any(axis=1)]
Time IL1 IL2 IL3 IN kVA kW kWh
12463 2018-09-17 10:30:00 63.7 78.4 53.3 25.2 NaN NaN 2039676.0
12464 2018-09-17 11:00:00 64.1 78.6 53.5 25.4 NaN NaN 2039698.0
How can I get kVA and kW out from the DataFrame?
Then I can find the median of kVA and KW from the other rows and replace the NaN with it
My usecase:
Right now I have to read file and find where the NaN columns are. It require my efforts. So I wants to automate that process by replace hardcode on column name.
trdb_a2_2018_df = pd.read_csv(PATH + 'dpm_trdb_a2_2018.csv', thousands=',', parse_dates=['Time'], date_parser=extract_dt)
trdb_a2_2018_df = trdb_a2_2018_df.replace(r'\s+', np.nan, regex=True)
median_kVA = trdb_a2_2018_df['kVA'].median()
trdb_a2_2018_df = trdb_a2_2018_df['kVA'].fillna(median_kVA)
I believe you need fillna with median:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,np.nan],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 NaN 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f NaN 3.0 0 4 b
df1 = df.fillna(df.median())
print (df1)
A B C D E F
0 a 4.0 7.0 1 5 a
1 b 5.0 4.0 3 3 a
2 c 4.0 9.0 5 6 a
3 d 5.0 4.0 7 9 b
4 e 5.0 2.0 1 2 b
5 f 5.0 3.0 0 4 b
If want also fiter NaNs in columns:
m = df.isnull().any()
df.loc[:, m] = df.loc[:, m].fillna(df.loc[:, m].median())
Alternative:
cols = df.columns[df.isnull().any()]
df[cols] = df[cols].fillna(df[cols].median())
Detail:
print (df.median())
B 5.0
C 4.0
D 2.0
E 4.5
dtype: float64
IIUC to filter out the column headers that contain NaN's use:
df.columns[df.isna().any()]
There are two ways for you to solve this question.
Use pandas.DataFrame.fillna to replace the NaN value with a certain value such as 0.
Use pandas.DataFrame.dropna to get a new DataFrame by filter origin DataFrame.
Reference:
Pandas dropna API
Pandas fillna API
Let's assume that this is an initial df:
df = pd.DataFrame([{'kVa': np.nan, 'kW':10.1}, {'kVa': 12.5, 'kW':14.3}, {'kVa': 16.1, 'kW':np.nan}])
In [51]: df
Out[51]:
kVa kW
0 NaN 10.1
1 12.5 14.3
2 16.1 NaN
You can use DataFrames's .fillna() method to replace NaN's and .notna() to get indexes of values other than NaN:
df.kVa.fillna(df.kVa[df.kVa.notna()].median(), inplace=True)
df.kW.fillna(df.kW[df.kW.notna()].median(), inplace=True)
Use inplace=True to avoid creating new Series instance.
Df after these manipulations:
In [54]: df
Out[54]:
kVa kW
0 14.3 10.1
1 12.5 14.3
2 16.1 12.2
Related
How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0
I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
I have a dataframe containing values as well as some NaN. Now I have the mean of the columns and I want to insert the mean of the particular column into the NaN values. For eg:
ColA and ColB have NaN to be replaced with the value of mean I have
I have the mean for ColA and ColB. I want to insert them into the NaN locations. I could do that individually using the replace method. But for many columns, is there any other way to achieve this?
EDIT:
If already has Series with means only use DataFrame.fillna:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
means = pd.Series([10,20], index=['B','E'])
df= df.fillna(means)
print (df)
A B C D E F
0 a 4.0 7 1.0 20.0 a
1 b 10.0 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 NaN 20.0 b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
If need replace missing values in all numeric columns use DataFrame.fillna by mean - it working because mean exclude non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
df1 = df.fillna(df.mean())
print (df1)
A B C D E F
0 a 4.0 7 1.0 3.75 a
1 b 4.4 8 3.0 3.00 a
2 c 4.0 9 5.0 6.00 a
3 d 5.0 4 2.0 3.75 b
4 e 5.0 2 1.0 2.00 b
5 f 4.0 3 0.0 4.00 b
If need specify columns for means only change solution with list of columns names:
cols = ['D','B']
df[cols] = df[cols].fillna(df[cols].mean())
print (df)
A B C D E F
0 a 4.0 7 1.0 NaN a
1 b 4.4 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 2.0 NaN b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
try this, for those column which you want to fill.
df['column1'] = df['column1'].fillna((df['column1'].mean()))
I want to drop some rows from the data. I am using following code-
import pandas as pd
import numpy as np
vle = pd.read_csv('/home/user/Documents/MOOC dataset original/vle.csv')
df = pd.DataFrame(vle)
df.dropna(subset = ['week_from'],axis=1,inplace = True)
df.dropna(subset = ['week_to'],axis=1,inplace = True)
df.to_csv('/home/user/Documents/MOOC dataset cleaned/studentRegistration.csv')
but its throwing following error-
raise KeyError(list(np.compress(check,subset)))
KeyError: [' week_from ']
what is going wrong?
I think need omit axis=1, because default value is axis=0 for remove rows with NaNs (missing values) by dropna by subset of columns for check NaNs, also solution should be simplify to one line:
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'week_from':[np.nan,5,4,5,5,4],
'week_to':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A week_from week_to E F
0 a NaN 1.0 5.0 a
1 b 5.0 3.0 3.0 a
2 c 4.0 NaN 6.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
print (df)
A week_from week_to E F
1 b 5.0 3.0 3.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
If want remove columns by specifying rows for check NaNs:
df.dropna(subset = [2, 5], axis=1, inplace = True)
print (df)
A week_from F
0 a NaN a
1 b 5.0 a
2 c 4.0 a
3 d 5.0 b
4 e 5.0 b
5 f 4.0 b
But if need remove columns by names solution is different, need drop:
df.drop(['A','week_from'],axis=1, inplace = True)
print (df)
week_to E F
0 1.0 5.0 a
1 3.0 3.0 a
2 NaN 6.0 a
3 7.0 9.0 b
4 1.0 2.0 b
5 0.0 NaN b
I am trying to build a simple function to fill the pandas columns with
some distribution, but it fails to fill the whole table (df still have NaN after fillna ...)
def simple_impute_missing(df):
from numpy.random import normal
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns[3:]})
filled_df = df.fillna(rnd_filled)
return filled_df
But the returned df, still have NaNs !
I have checked to make sure that rnd_filled is full and have the right shape.
what is going on?
I think you need remove [:3] from df.columns[3:] for select all columns of df.
Sample:
df = pd.DataFrame({'A':[1,np.nan,3],
'B':[4,5,6],
'C':[np.nan,8,9],
'D':[1,3,np.nan],
'E':[5,np.nan,6],
'F':[7,np.nan,3]})
print (df)
A B C D E F
0 1.0 4 NaN 1.0 5.0 7.0
1 NaN 5 8.0 3.0 NaN NaN
2 3.0 6 9.0 NaN 6.0 3.0
rnd_filled = pd.DataFrame( {c : normal(df[c].mean(), df[c].std(), len(df))
for c in df.columns})
filled_df = df.fillna(rnd_filled)
print (filled_df)
A B C D E F
0 1.000000 4 6.922458 1.000000 5.000000 7.000000
1 2.277218 5 8.000000 3.000000 5.714767 6.245759
2 3.000000 6 9.000000 0.119522 6.000000 3.000000