keyError when trying to drop a column in pandas. - python

I want to drop some rows from the data. I am using following code-
import pandas as pd
import numpy as np
vle = pd.read_csv('/home/user/Documents/MOOC dataset original/vle.csv')
df = pd.DataFrame(vle)
df.dropna(subset = ['week_from'],axis=1,inplace = True)
df.dropna(subset = ['week_to'],axis=1,inplace = True)
df.to_csv('/home/user/Documents/MOOC dataset cleaned/studentRegistration.csv')
but its throwing following error-
raise KeyError(list(np.compress(check,subset)))
KeyError: [' week_from ']
what is going wrong?

I think need omit axis=1, because default value is axis=0 for remove rows with NaNs (missing values) by dropna by subset of columns for check NaNs, also solution should be simplify to one line:
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'week_from':[np.nan,5,4,5,5,4],
'week_to':[1,3,np.nan,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A week_from week_to E F
0 a NaN 1.0 5.0 a
1 b 5.0 3.0 3.0 a
2 c 4.0 NaN 6.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
df.dropna(subset = ['week_from', 'week_to'], inplace = True)
print (df)
A week_from week_to E F
1 b 5.0 3.0 3.0 a
3 d 5.0 7.0 9.0 b
4 e 5.0 1.0 2.0 b
5 f 4.0 0.0 NaN b
If want remove columns by specifying rows for check NaNs:
df.dropna(subset = [2, 5], axis=1, inplace = True)
print (df)
A week_from F
0 a NaN a
1 b 5.0 a
2 c 4.0 a
3 d 5.0 b
4 e 5.0 b
5 f 4.0 b
But if need remove columns by names solution is different, need drop:
df.drop(['A','week_from'],axis=1, inplace = True)
print (df)
week_to E F
0 1.0 5.0 a
1 3.0 3.0 a
2 NaN 6.0 a
3 7.0 9.0 b
4 1.0 2.0 b
5 0.0 NaN b

Related

Get the name of the 2nd non-blank column for each row

I have the following pandas dataframe:
A B C
0 1.0 NaN 2.0
1 NaN 1.0 4.0
2 7.0 1.0 2.0
I know I can get, for each row, the name of the first non-blank column with this script:
df['first'] = df.dropna(how='all').notna().idxmax(axis=1).astype('string')
but how can I get the name of the second non blank column?
This is the expected output:
A B C first second
0 1.0 NaN 2.0 A C
1 NaN 1.0 4.0 B C
2 7.0 1.0 2.0 A B
Thanks
You can drop the NaNs with apply:
df[['first', 'second']] = df.apply(lambda x: pd.Series(x.dropna().index), axis=1)
Output:
A B C first second
0 1.0 NaN 2.0 A C
1 NaN 1.0 4.0 B C
2 7.0 1.0 NaN A B

pandas fill only group meeting criteria?

How do you fill only groups inside a dataframe which are not fully nulls?
In the dataframe below, only groups with df.A=b and df.A=c should get filled.
df
A B
0 a NaN
1 a NaN
2 a NaN
3 a NaN
4 b 4.0
5 b NaN
6 b 6.0
7 b 6.0
8 c 7.0
9 c NaN
10 c NaN
Was thinking something like:
if set(df[df.A==(need help here)].B.values) == {np.nan}:.
We can do groupby
df.B=df.groupby('A').B.apply(lambda x : x.ffill().bfill())
Get the indices that are not completely null, and then forwardfill/backwardfill on these indices
df = df.set_index("A")
#get index where entries in B are not completely full
ind = df.loc[df.groupby("A").B.transform(lambda x: x.eq(x))].index.unique()
df.loc[ind] = df.loc[ind].ffill().bfill()
print(df)
B
A
a NaN
a NaN
a NaN
a NaN
b 4.0
b 4.0
b 6.0
b 6.0
c 7.0
c 7.0
c 7.0

Combining multiple columns into one column pandas [duplicate]

I have a data frame like
df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
I want to add a new column 'D'. Expected output is
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
Thanks in advance!
Another way is to explicitly fill column D with A,B,C in that order.
df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
Another approach is to use the combine_first method of a pd.Series. Using your example df,
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
A B C
0 1.0 NaN 5
1 2.0 10.0 10
2 NaN NaN 7
we have
>>> df.A.combine_first(df.B).combine_first(df.C)
0 1.0
1 2.0
2 7.0
We can use reduce to abstract this pattern to work with an arbitrary number of columns.
>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
Let's put this all together in a function.
>>> def coalesce(*args):
... return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0 1.0
1 2.0
2 7.0
Name: A, dtype: float64
I think you need bfill with selecting first column by iloc:
df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
same as:
df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 1
pandas
df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
option 2
numpy
v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0
naive time test
over given data
over larger data
There is already a method for Series in Pandas that does this:
df['D'] = df['A'].combine_first(df['C'])
Or just stack them if you want to look up values sequentially:
df['D'] = df['A'].combine_first(df['B']).combine_first(df['C'])
This outputs the following:
>>> df
A B C D
0 1.0 NaN 5 1.0
1 2.0 10.0 10 2.0
2 NaN NaN 7 7.0

Replace NaN values with specific value per column

I have a dataframe containing values as well as some NaN. Now I have the mean of the columns and I want to insert the mean of the particular column into the NaN values. For eg:
ColA and ColB have NaN to be replaced with the value of mean I have
I have the mean for ColA and ColB. I want to insert them into the NaN locations. I could do that individually using the replace method. But for many columns, is there any other way to achieve this?
EDIT:
If already has Series with means only use DataFrame.fillna:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
means = pd.Series([10,20], index=['B','E'])
df= df.fillna(means)
print (df)
A B C D E F
0 a 4.0 7 1.0 20.0 a
1 b 10.0 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 NaN 20.0 b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
If need replace missing values in all numeric columns use DataFrame.fillna by mean - it working because mean exclude non numeric columns:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'E':[np.nan,3,6,np.nan,2,4],
'F':list('aaabbb')
})
df1 = df.fillna(df.mean())
print (df1)
A B C D E F
0 a 4.0 7 1.0 3.75 a
1 b 4.4 8 3.0 3.00 a
2 c 4.0 9 5.0 6.00 a
3 d 5.0 4 2.0 3.75 b
4 e 5.0 2 1.0 2.00 b
5 f 4.0 3 0.0 4.00 b
If need specify columns for means only change solution with list of columns names:
cols = ['D','B']
df[cols] = df[cols].fillna(df[cols].mean())
print (df)
A B C D E F
0 a 4.0 7 1.0 NaN a
1 b 4.4 8 3.0 3.0 a
2 c 4.0 9 5.0 6.0 a
3 d 5.0 4 2.0 NaN b
4 e 5.0 2 1.0 2.0 b
5 f 4.0 3 0.0 4.0 b
try this, for those column which you want to fill.
df['column1'] = df['column1'].fillna((df['column1'].mean()))

Replacing empty values in a DataFrame with value of a column

Say I have the following pandas dataframe:
df = pd.DataFrame([[3, 2, np.nan, 0],
[5, 4, 2, np.nan],
[7, np.nan, np.nan, 5],
[9, 3, np.nan, 4]],
columns=list('ABCD'))
which returns this:
A B C D
0 3 2.0 NaN 0.0
1 5 4.0 2.0 NaN
2 7 NaN NaN 5.0
3 9 3.0 NaN 4.0
I'd like that if a np.nan is found, that the value is replaced by a value in the A column. So that would mean the result to be this:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
I've tried multiple things, but I could not get anything to work. Can anyone help?
Here is necessary double transpose:
cols = ['B','C', 'D']
df[cols] = df[cols].T.fillna(df['A']).T
print(df)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
because:
df[cols] = df[cols].fillna(df['A'], axis=1)
print(df)
NotImplementedError: Currently only can fill with dict/Series column by column
Another solution with numpy.where and broadcasting column A:
df = pd.DataFrame(np.where(df.isnull(), df['A'].values[:, None], df),
index=df.index,
columns=df.columns)
print (df)
A B C D
0 3.0 2.0 3.0 0.0
1 5.0 4.0 2.0 5.0
2 7.0 7.0 7.0 5.0
3 9.0 3.0 9.0 4.0
Thank you #pir for another solution:
df = pd.DataFrame(np.where(df.isnull(), df[['A']], df),
index=df.index,
columns=df.columns)
Currently, fillna doesn't allow for broadcasting a series across columns while aligning the indices.
pandas.DataFrame.mask
This functions exactly like what we'd want fillna to do. Finds the the nulls, fills it in with df.A along axis=0
df.mask(df.isna(), df.A, axis=0)
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
pandas.DataFrame.fillna using a dictionary
However, you can pass a dictionary to fillna that tells it what to do for each column.
df.fillna({k: df.A for k in df})
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
DO fillna with reindex
df.fillna(df[['A']].reindex(columns=df.columns).ffill(1))
Out[20]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
Or combine_first
df.combine_first(df.fillna(0).add(df.A,0))
Out[35]:
A B C D
0 3 2.0 3.0 0.0
1 5 4.0 2.0 5.0
2 7 7.0 7.0 5.0
3 9 3.0 9.0 4.0
# for each column...
for col in df.columns:
# I select the np.nan and I replace then with the value of A
df.loc[df[col].isnull(), col] = df["A"]

Categories