Combine numerical and boolean indexing - python

I have the following dataframe:
import pandas as pd
df = pd.DataFrame(index=[0, 1, 2], columns=["test1", "test2"])
df.at[1, "test1"] = 3
df.at[2, "test2"] = 5
print(df)
test1 test2
0 NaN NaN
1 3 NaN
2 NaN 5
I tried the following line in order to set all NaN values at indices 1 and 2 to False:
df.loc[[1, 2] & pd.isna(df)] = False
However, this gives me an error.
My expected output would be:
test1 test2
0 NaN NaN
1 3 False
2 False 5

You can do this:
In [917]: df.loc[1:2] = df.loc[1:2].fillna(False)
In [918]: df
Out[918]:
test1 test2
0 NaN NaN
1 3 False
2 False 5

pd.isna(df)is a mask the shape of your DataFrame and you can't use that as a slice in a .loc call. In this case you want to selectively update the null values of your DataFrame on specific rows, so we can use .fillna with update to assing the changes back.
df.update(df.loc[[1, 2]].fillna(False))
print(df)
test1 test2
0 NaN NaN
1 3 False
2 False 5

Let us try with fillna and pass a dict
df = df.T.fillna(dict.fromkeys(df.index[1:],False),axis=0).T
test1 test2
0 NaN NaN
1 3 False
2 False 5

Related

Unable to update Pandas row in For loop

I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()

Empty DataFrame doesn't admit its empty

I must not understand something about emptiness when it comes to pandas DataFrames. I have a DF with empty rows but when I isolate one of these rows its not empty.
Here I've made a dataframe:
>>> df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
>>> df
1 2 3
0 1.0 2.0 3.0
1 1.0 NaN 3.0
2 NaN NaN NaN
3 3.0 2.0 1.0
4 4.0 5.0 6.0
5 NaN NaN NaN
6 NaN NaN NaN
Then I know row '2' is full of nothing so I check for that...
>>> df[2:3].empty
False
Odd. So I split it out into its own dataframe:
>>> df1 = df[2:3]
>>> df1
1 2 3
2 NaN NaN NaN
>>> df1.empty
False
How do I check for emptiness (all the elements in a row being None or NaN?)
http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.empty.html
You're misunderstanding what empty is for. It's meant to check that the size of a series/dataframe is greater than 0, meaning there are rows. For example,
df.iloc[1:0]
Empty DataFrame
Columns: [1, 2, 3]
Index: []
df.iloc[1:0].empty
True
If you want to check that a row has all NaNs, use isnull + all:
df.isnull().all(1)
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
For your example, this should do:
df[2:3].isnull().all(1).item()
True
Note that you can't use item if your slice is more than one row in size.
I guess you are looking for something like this:
In [296]: df[5:]
Out[296]:
1 2 3
5 NaN NaN NaN
6 NaN NaN NaN
In [297]: df[5:].isnull().all(1).all()
Out[297]: True
or even better (as proposed by #IanS):
In [300]: df[5:].isnull().all().all()
Out[300]: True
You can drop all null values from your selection and check if the result is empty:
>>> df[5:].dropna(how='all').empty
True
If you are do not want to count NaN value as real number , this will equal to
df.dropna().iloc[5:]
You select the line did not exist in your dataframe
df.dropna().iloc[5:].empty
Out[921]: True
If you have a dataframe and want to drop all rows containing NaN in each of the columns, you can do this
df.dropna(how='all')
Noticed that your dataframe also has NaN in one the columns in some cases. If you need to drop the entire row in such case:
df.dropna(how='any')
After you do this (which ever is your preference) you could check length of dataframe (number of rows it contains) using:
len(df)
I guess you have to use isnull() instead of empty().
import pandas
df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
df[2:3].isnull()
1 2 3
True True True

Replacing NaNs in a dataframe with a string value

I want to replace the missing value in one column of my df with "missing value".
I tried
result['emp_title'].fillna('missing')
or
result['emp_title'] = result['emp_title'].replace({ np.nan:'missing'})
the second one works, since when i count missing value after this code:
result['emp_title'].isnull().sum()
it gave me 0.
However, the first one does not work as I expected, which did not give me a 0, instead of the previous count for missing value.
Why the first one does not work? Thank you!
You need to fill inplace, or assign:
result['emp_title'].fillna('missing', inplace=True)
or
result['emp_title'] = result['emp_title'].fillna('missing')
MVCE:
In [1697]: df = pd.DataFrame({'Col1' : [1, 2, 3, np.nan, 4, 5, np.nan]})
In [1702]: df.fillna('missing'); df # changes not seen in the original
Out[1702]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1703]: df.fillna('missing', inplace=True); df
Out[1703]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 missing
You should be aware that if you are trying to apply fillna to slices, don't use inplace=True, instead, use df.loc/iloc and assign to sub-slices:
In [1707]: df.Col1.iloc[:5].fillna('missing', inplace=True); df # doesn't work
Out[1707]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1709]: df.Col1.iloc[:5] = df.Col1.iloc[:5].fillna('missing')
In [1710]: df
Out[1710]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 NaN

Pandas slicing and indexing use with fillna

I have a pandas dataframe tdf
I am extracting a slice based on boolean labels
idx = tdf['MYcol1'] == 1
myslice = tdf.loc[idx] //I want myslice to be a view not a copy
Now i want to fill the missing values in a column of myslice and i want this to be reflected in tdf my original dataframe
myslice.loc[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 1
myslice.ix[:,'MYcol2'].fillna(myslice['MYcol2'].mean(), inplace = True) // 2
Both 1 and 2 above throw the warning that: A value is trying to be set on a copy of a slice from a DataFrame
What am i doing wrong?
When you assign it to a new variable, it creates a copy. The things you do after that are irrelevant. Consider this:
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
idx = tdf['A'] > 0
myslice = tdf.loc[idx]
Fill NaN's in myslice:
myslice.loc[:,'B'].fillna(myslice['B'].mean(), inplace = True)
C:\Anaconda3\envs\p3\lib\site-packages\pandas\core\generic.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
myslice
Out:
A B C
2 0.616887 0.008628 NaN
4 0.767902 0.008628 NaN
6 1.133080 0.008628 -0.659892
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN NaN 0.037006
4 0.767902 NaN NaN
5 -0.805627 NaN NaN
6 1.133080 NaN -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
It is not reflected in tdf, because:
myslice.is_copy
Out: <weakref at 0x000001CC842FD318; to 'DataFrame' at 0x000001CC8422D6A0>
If you change it to:
tdf.loc[:, 'B'].fillna(tdf.loc[idx, 'B'].mean(), inplace=True)
tdf
Out:
A B C
0 NaN 0.195070 -1.781563
1 -0.729045 0.196557 0.354758
2 0.616887 0.008628 NaN
3 NaN 0.008628 0.037006
4 0.767902 0.008628 NaN
5 -0.805627 0.008628 NaN
6 1.133080 0.008628 -0.659892
7 -1.139802 0.784958 -0.554310
8 -0.470638 -0.216950 NaN
9 -0.392389 -3.046143 0.543312
then it works. In the last part you can also use myslice['B'].mean() because you are not updating those values. But the left side should be the original DataFrame.

Merge unaligned DataFrames while filling with empty string

I have multiple DataFrames that I want to merge where I would like the fill value an empty string rather than nan. Some of the DataFrames have already nan values in them. concat sort of does what I want but fill empty values with nan. How does one not fill them with nan, or specify the fill_value to achieve something like this:
>>> df1
Value1
0 1
1 NaN
2 3
>>> df2
Value2
1 5
2 Nan
3 7
>>> merge_multiple_without_nan([df1,df2])
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7
This is what concat does:
>>> concat([df1,df2], axis=1)
Value1 Value2
0 1 NaN
1 NaN 5
2 3 NaN
3 NaN 7
Well, I couldn't find any function in concat or merge that would handle this by itself, but the code below works without much hassel:
df1 = pd.DataFrame({'Value2': [1,np.nan,3]}, index = [0,1, 2])
df2 = pd.DataFrame({'Value2': [5,np.nan,7]}, index = [1, 2, 3])
# Add temporary Nan values for the data frames.
df = pd.concat([df1.fillna('X'), df2.fillna('Y')], axis=1)
df=
Value2 Value2
0 1 NaN
1 X 5
2 3 Y
3 NaN 7
Step 2:
df.fillna('', inplace=True)
df=
Value2 Value2
0 1
1 X 5
2 3 Y
3 7
Step 3:
df.replace(to_replace=['X','Y'], value=np.nan, inplace=True)
df=
Value2 Value2
0 1
1 NaN 5
2 3 NaN
3 7
After using concat, you can iterate over the DataFrames you merged, find the indices that are missing, and fill them in with an empty string. This should work for concatenating an arbitrary number of DataFrames, as long as your column names are unique.
# Concatenate all of the DataFrames.
merge_dfs = [df1, df2]
full_df = pd.concat(merge_dfs, axis=1)
# Find missing indices for each merged frame, fill with an empty string.
for partial_df in merge_dfs:
missing_idx = full_df.index.difference(partial_df.index)
full_df.loc[missing_idx, partial_df.columns] = ''
The resulting output using your sample data:
Value1 Value2
0 1
1 NaN 5
2 3 NaN
3 7

Categories