Replacing NaNs in a dataframe with a string value - python

I want to replace the missing value in one column of my df with "missing value".
I tried
result['emp_title'].fillna('missing')
or
result['emp_title'] = result['emp_title'].replace({ np.nan:'missing'})
the second one works, since when i count missing value after this code:
result['emp_title'].isnull().sum()
it gave me 0.
However, the first one does not work as I expected, which did not give me a 0, instead of the previous count for missing value.
Why the first one does not work? Thank you!

You need to fill inplace, or assign:
result['emp_title'].fillna('missing', inplace=True)
or
result['emp_title'] = result['emp_title'].fillna('missing')
MVCE:
In [1697]: df = pd.DataFrame({'Col1' : [1, 2, 3, np.nan, 4, 5, np.nan]})
In [1702]: df.fillna('missing'); df # changes not seen in the original
Out[1702]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1703]: df.fillna('missing', inplace=True); df
Out[1703]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 missing
You should be aware that if you are trying to apply fillna to slices, don't use inplace=True, instead, use df.loc/iloc and assign to sub-slices:
In [1707]: df.Col1.iloc[:5].fillna('missing', inplace=True); df # doesn't work
Out[1707]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1709]: df.Col1.iloc[:5] = df.Col1.iloc[:5].fillna('missing')
In [1710]: df
Out[1710]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 NaN

Related

How to pass the value of previous row to the dataframe apply function?

I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0

Divide several columns with the same column name ending by one other column in python

I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)

Get nth row of groups and fill with 'None' if row is missing

I have a df:
a b c
1 2 3 6
2 2 5 7
3 4 6 8
I want every nth row of groupby a:
w=df.groupby('a').nth(0) #first row
x=df.groupby('a').nth(1) #second row
The second group of the df has no second row, in this case I want to have 'None' values.
[In:] df.groupby('a').nth(1)
[Out:]
a b c
1 2 5 7
2 None None None
Or maybe simplier:
The df has 1-4 rows within groups. If a group has less than 4 rows, I want to extend the group, so that it has 4 rows and fill the missing rows with 'None'. Afterwards if I pick the nth row of groups, I have the desired output.
If you are just interested in a specific nth but not have enough rows in some groups, you can consider to use reindex with unique value from the column a like:
print (df.groupby('a').nth(1).reindex(df['a'].unique()).reset_index())
a b c
0 2 5.0 7.0
1 4 NaN NaN
One way is to assign a count/rank column and reindex/stack:
n=2
(df.assign(rank=df.groupby('a').cumcount())
.query(f'rank < #n')
.set_index(['a','rank'])
.unstack('rank')
.stack('rank', dropna=False)
.reset_index()
.drop('rank', axis=1)
)
Output:
a b c
0 2 3.0 6.0
1 2 5.0 7.0
2 4 6.0 8.0
3 4 NaN NaN

Unable to update Pandas row in For loop

I am using bnp-paribas-cardif-claims-management from Kaggle.
Dataset : https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data
df=pd.read_csv('F:\\Data\\Paribas_Claim\\train.csv',nrows=5000)
df.info() gives
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Columns: 133 entries, ID to v131
dtypes: float64(108), int64(6), object(19)
memory usage: 5.1+ MB
My requirement is :
I am trying to fill null values for columns with datatypes as int and object. I am trying to fill the nulls based on the target column.
My code is
df_obj = df.select_dtypes(['object','int64']).columns.to_list()
for cols in df_obj:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = df[df['target'] == 1][cols].mode()
df[( df['target'] == 0 )&( df[cols].isnull() )][cols] = df[df['target'] == 0][cols].mode()
I am able to get output in below print statement:
df[( df['target'] == 1 )&( df[cols].isnull() )][cols]
also the able to print the values for df[df['target'] == 0][cols].mode() if I substitute cols.
But unable to replace the null values with mode values.
I tried df.loc, df.at options instead of df[] and df[...] == np.nan instead of df[...].isnull() but of no use.
Please assist if I need to do any changes in the code. Thanks.
Here is problem is select integers columns, then no contain missing values (because NaN is float), so cannot be replaced. Possible solution is select all numeric columns and in loop set first value of mode per conditions with DataFrame.loc for avoid chain indexing and Series.iat for return only first value (mode should return sometimes 2 values):
df=pd.read_csv('train.csv',nrows=5000)
#only numeric columns
df_obj = df.select_dtypes(np.number).columns.to_list()
#all columns
#df_obj = df.columns.to_list()
#print (df_obj)
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1 & (df[cols].isnull()), cols] = df.loc[m1, cols].mode().iat[0]
df.loc[m2 & (df[cols].isnull()), cols] = df.loc[m2, cols].mode().iat[0]
Another solution with replace missing values by Series.fillna:
for cols in df_obj:
m1 = df['target'] == 1
m2 = df['target'] == 0
df.loc[m1, cols] = df.loc[m1, cols].fillna(df.loc[m1, cols].mode().iat[0])
df.loc[m2, cols] = df.loc[m2, cols].fillna(df.loc[m2, cols].mode().iat[0])
print (df.head())
ID target v1 v2 v3 v4 v5 v6 \
0 3 1 1.335739e+00 8.727474 C 3.921026 7.915266 2.599278e+00
1 4 1 -9.543625e-07 1.245405 C 0.586622 9.191265 2.126825e-07
2 5 1 9.438769e-01 5.310079 C 4.410969 5.326159 3.979592e+00
3 6 1 7.974146e-01 8.304757 C 4.225930 11.627438 2.097700e+00
4 8 1 -9.543625e-07 1.245405 C 0.586622 2.151983 2.126825e-07
v7 v8 ... v122 v123 v124 v125 \
0 3.176895e+00 1.294147e-02 ... 8.000000 1.989780 3.575369e-02 AU
1 -9.468765e-07 2.301630e+00 ... 1.499437 0.149135 5.988956e-01 AF
2 3.928571e+00 1.964513e-02 ... 9.333333 2.477596 1.345191e-02 AE
3 1.987549e+00 1.719467e-01 ... 7.018256 1.812795 2.267384e-03 CJ
4 -9.468765e-07 -7.783778e-07 ... 1.499437 0.149135 -9.962319e-07 Z
v126 v127 v128 v129 v130 v131
0 1.804126e+00 3.113719e+00 2.024285 0 0.636365 2.857144e+00
1 5.521558e-07 3.066310e-07 1.957825 0 0.173913 -9.932825e-07
2 1.773709e+00 3.922193e+00 1.120468 2 0.883118 1.176472e+00
3 1.415230e+00 2.954381e+00 1.990847 1 1.677108 1.034483e+00
4 5.521558e-07 3.066310e-07 0.100455 0 0.173913 -9.932825e-07
[5 rows x 133 columns]
You don't have a sample data so I'll just give the methods I think you can use to solve your problem.
Try to read your DataFrame with na_filter = False that way your columns with np.nan or has null values will be replaced by blanks instead.
Then, during your loop use the '' as your identifier for null values. Easier to tag than trying to use the type of the value you are parsing.
I think pd.fillna should help.
# random dataset
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 2, np.nan, 1],
[np.nan, np.nan, np.nan, 5],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
print(df)
A B C D
0 NaN 2.0 NaN 0
1 3.0 2.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Assuming you want to replace missing values with the mode value of a given column, I'd just use:
df.fillna({'A':df.A.mode()[0],'B':df.B.mode()[0]})
A B C D
0 3.0 2.0 NaN 0
1 3.0 2.0 NaN 1
2 3.0 2.0 NaN 5
3 3.0 3.0 NaN 4
This would also work if you needed a mode value from a subset of values from given column to fill NaNs with.
# let's add 'type' column
A B C D type
0 NaN 2.0 0 1
1 3.0 2.0 1 1
2 NaN NaN 5 2
3 NaN 3.0 4 2
For example, if you want to fill df['B'] NaNs with the mode value of each row that is equal to df['type'] 2:
df.fillna({
'B': df.loc[df.type.eq(2)].B.mode()[0] # type 2
})
A B C D type
0 NaN 2.0 NaN 0 1
1 3.0 2.0 NaN 1 1
2 NaN 3.0 NaN 5 2
3 NaN 3.0 NaN 4 2
# ↑ this would have been '2.0' hadn't we filtered the column with df.loc[]
Your problem is this
df[( df['target'] == 1 )&( df[cols].isnull() )][cols] = ...
Do NOT chain index, especially when assigning. See Why does assignment fail when using chained indexing? section in this doc.
Instead use loc:
df.loc[(df['target'] == 1) & (df[cols].isnull()),
cols] = df.loc[df['target'] == 1,
cols].mode()

Error while replacing '?' with mean value in dataframe in Python

I have a car dataset where I want to replace the '?' values in the column normalized-values to the mean of the remaining numerical values. The code I have used is:
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace("?",mean)
However, this produces the error:
ValueError: could not convert string to float: '???164164?158?158?192192188188??121988111811811814814814814811014513713710110110111078106106858585107????145??104104104113113150150150150129115129115?115118?93939393?142???161161161161153153???125125125137128128128122103128128122103168106106128108108194194231161161??161161??16116116111911915415415474?186??????1501041501041501048383831021021021021028989858587877477819191919191919191168168168168134134134134134134656565656519719790?1221229494949494?256???1037410374103749595959595'
Can anyone help with the way in which I can convert the '?' values to the mean values. Also, this is the first time I am working with the Pandas package so if I have made any silly mistakes, please forgive me.
Use to_numeric for convert non numeric values to NaNs and then fillna with mean:
vals = pd.to_numeric(df["normalized-losses"], errors='coerce')
df["normalized-losses"] = vals.fillna(vals.mean())
#data from jpp
print (df)
normalized-losses
0 1.0
1 2.0
2 3.0
3 3.4
4 5.0
5 6.0
6 3.4
Details:
print (vals)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
Name: normalized-losses, dtype: float64
print (vals.mean())
3.4
Use replace() followed byfillna():
df['normalized-losses'] = df['normalized-losses'].replace('?',np.NaN)
df['normalized-losses'].fillna(df['normalized-losses'].mean())
The mean of a series of mixed types is not defined. Convert to numeric and then use replace:
df = pd.DataFrame({'A': [1, 2, 3, '?', 5, 6, '??']})
mean = pd.to_numeric(df['A'], errors='coerce').mean()
df['B'] = df['A'].replace('?', mean)
print(df)
A B
0 1 1
1 2 2
2 3 3
3 ? 3.4
4 5 5
5 6 6
6 ?? ??
If you need to replace all non-numeric values, then use fillna:
nums = pd.to_numeric(df['A'], errors='coerce')
df['B'] = nums.fillna(nums.mean())
print(df)
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 ? 3.4
4 5 5.0
5 6 6.0
6 ?? 3.4

Categories