I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()
Related
I would like to sum values of a dataframe conditionally, based on the values of a different dataframe. Say for example I have two dataframes:
df1 = pd.DataFrame(data = [[1,-1,5],[2,1,1],[3,0,0]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 -1 5
1 2 1 1
2 3 0 0
df2 = pd.DataFrame(data = [[1,1,3],[1,1,2],[0,2,1]],index=[0,1,2],columns = [0,1,2])
index 0 1 2
-----------------
0 1 1 3
1 1 1 2
2 0 2 1
Now what I would like is that for example, if the row/index value of df1 equals 1, to sum the location of those values in df2.
In this example, if the condition is 1, then the sum of df2 would be 4. If the condition was 0, the result would be 3.
Another option with Pandas' query:
df2.query("#df1==1").sum().sum()
# 4
You can use a mask with where:
df2.where(df1.eq(1)).to_numpy().sum()
# or
# df2.where(df1.eq(1)).sum().sum()
output: 4.0
intermediate:
df2.where(df1.eq(1))
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Assuming that one wants to store the value in the variable value, there are various options to achieve that. Will leave below two of them.
Option 1
One can simply do the following
value = df2[df1 == 1].sum().sum()
[Out]: 4.0 # numpy.float64
# or
value = sum(df2[df1 == 1].sum())
[Out]: 4.0 # float
Option 2
Using pandas.DataFrame.where
value = df2.where(df1 == 1, 0).sum().sum()
[Out]: 4.0 # numpy.int64
# or
value = sum(df2.where(df1 == 1, 0).sum())
[Out]: 4 # int
Notes:
Both df2[df1 == 1] and df2.where(df1 == 1, 0) give the following output
0 1 2
0 1.0 NaN NaN
1 NaN 1.0 2.0
2 NaN NaN NaN
Depending on the desired output (float, int, numpy.float64,...) one method might be better than the other.
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON NaN NaN
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y NaN
this is my data frame. I wanted to create a user defined functions which returns the data frame which shows number of missing values in data frame by column and row number of missing value.
output df should look like this.
col_name index
st_num 2
st_num 6
st_name 8
Num_bedrooms 2
Num_bedrooms 5
Num_bedrooms 7
Num_bedrooms 8 .......
You can slice the index by the isnull for each column to get the indices. Also possible with stacking and groupby.
def summarize_missing(df):
# Null counts
s1 = df.isnull().sum().rename('No. Missing')
s2 = pd.Series(data=[df.index[m].tolist() for m in [df[col].isnull() for col in df.columns]],
index=df.columns,
name='Index')
# Other way, probably overkill
#s2 = (df.isnull().replace(False, np.NaN).stack().reset_index()
# .groupby('level_1')['level_0'].agg(list)
# .rename('Index'))
return pd.concat([s1, s2], axis=1, sort=False)
summarize_missing(df)
# No. Missing Index
#ST_NUM 1 [2]
#ST_NAME 0 NaN
#OWN_OCCUPIED 2 [1, 3]
#NUM_BEDROOMS 2 [1, 4]
Here's another way:
m = df.isna().sum().to_frame().rename(columns={0: 'No. Missing'})
m['index'] = m.index.map(lambda x: ','.join(map(str, df.loc[df[x].isna()].index.values)))
print(m)
No. Missing index
ST_NUM 1 2
ST_NAME 0
OWN_OCCUPIED 2 1,3
NUM_BEDROOMS 2 1,4
I want to replace the missing value in one column of my df with "missing value".
I tried
result['emp_title'].fillna('missing')
or
result['emp_title'] = result['emp_title'].replace({ np.nan:'missing'})
the second one works, since when i count missing value after this code:
result['emp_title'].isnull().sum()
it gave me 0.
However, the first one does not work as I expected, which did not give me a 0, instead of the previous count for missing value.
Why the first one does not work? Thank you!
You need to fill inplace, or assign:
result['emp_title'].fillna('missing', inplace=True)
or
result['emp_title'] = result['emp_title'].fillna('missing')
MVCE:
In [1697]: df = pd.DataFrame({'Col1' : [1, 2, 3, np.nan, 4, 5, np.nan]})
In [1702]: df.fillna('missing'); df # changes not seen in the original
Out[1702]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1703]: df.fillna('missing', inplace=True); df
Out[1703]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 missing
You should be aware that if you are trying to apply fillna to slices, don't use inplace=True, instead, use df.loc/iloc and assign to sub-slices:
In [1707]: df.Col1.iloc[:5].fillna('missing', inplace=True); df # doesn't work
Out[1707]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1709]: df.Col1.iloc[:5] = df.Col1.iloc[:5].fillna('missing')
In [1710]: df
Out[1710]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 NaN
Suppose I have a dataframe as follows,
import pandas as pd
columns=['A','B','C','D', 'E', 'F']
index=['1','2','3','4','5','6']
df = pd.DataFrame(columns=columns,index=index)
df['D']['1'] = 1
df['E'] = 1
df['F']['1'] = 1
df['A']['2'] = 1
df['B']['3'] = 1
df['C']['4'] = 1
df['A']['5'] = 1
df['B']['5'] = 1
df['C']['5'] = 1
df['D']['6'] = 1
df['F']['6'] = 1
df
A B C D E F
1 NaN NaN NaN 1 1 1
2 1 NaN NaN NaN 1 NaN
3 NaN 1 NaN NaN 1 NaN
4 NaN NaN 1 NaN 1 NaN
5 1 1 1 NaN 1 NaN
6 NaN NaN NaN 1 1 1
My condition is, I want to remove the columns which have value only when A,B,C(together) don't have a value. I want to find which column is mutually exclusive to A,B,C columns together. I am interested in finding the columns that have values only when A or B or C has values. The output here would be to remove D,F columns. But my dataframe has 400 columns and I want a way to check this for A,B,C vs rest of the columns.
One way I can think is,
Remove NA rows from A,B,C
df = df[np.isfinite(df['A'])]
df = df[np.isfinite(df['B'])]
df = df[np.isfinite(df['C'])]
and get NA count of all columns and check with the total number of rows,
df.isnull().sum()
and remove the counts that match.
Is there a better and efficient way to do this?
Thanks
Rather than delete rows, just select the others that don't have A, B, C equal to NaN at the same time.
mask = df[["A", "B", "C"]].isnull().all(axis=1)
df = df[~mask]