Error while replacing '?' with mean value in dataframe in Python - python

I have a car dataset where I want to replace the '?' values in the column normalized-values to the mean of the remaining numerical values. The code I have used is:
mean = df["normalized-losses"].mean()
df["normalized-losses"].replace("?",mean)
However, this produces the error:
ValueError: could not convert string to float: '???164164?158?158?192192188188??121988111811811814814814814811014513713710110110111078106106858585107????145??104104104113113150150150150129115129115?115118?93939393?142???161161161161153153???125125125137128128128122103128128122103168106106128108108194194231161161??161161??16116116111911915415415474?186??????1501041501041501048383831021021021021028989858587877477819191919191919191168168168168134134134134134134656565656519719790?1221229494949494?256???1037410374103749595959595'
Can anyone help with the way in which I can convert the '?' values to the mean values. Also, this is the first time I am working with the Pandas package so if I have made any silly mistakes, please forgive me.

Use to_numeric for convert non numeric values to NaNs and then fillna with mean:
vals = pd.to_numeric(df["normalized-losses"], errors='coerce')
df["normalized-losses"] = vals.fillna(vals.mean())
#data from jpp
print (df)
normalized-losses
0 1.0
1 2.0
2 3.0
3 3.4
4 5.0
5 6.0
6 3.4
Details:
print (vals)
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
Name: normalized-losses, dtype: float64
print (vals.mean())
3.4

Use replace() followed byfillna():
df['normalized-losses'] = df['normalized-losses'].replace('?',np.NaN)
df['normalized-losses'].fillna(df['normalized-losses'].mean())

The mean of a series of mixed types is not defined. Convert to numeric and then use replace:
df = pd.DataFrame({'A': [1, 2, 3, '?', 5, 6, '??']})
mean = pd.to_numeric(df['A'], errors='coerce').mean()
df['B'] = df['A'].replace('?', mean)
print(df)
A B
0 1 1
1 2 2
2 3 3
3 ? 3.4
4 5 5
5 6 6
6 ?? ??
If you need to replace all non-numeric values, then use fillna:
nums = pd.to_numeric(df['A'], errors='coerce')
df['B'] = nums.fillna(nums.mean())
print(df)
A B
0 1 1.0
1 2 2.0
2 3 3.0
3 ? 3.4
4 5 5.0
5 6 6.0
6 ?? 3.4

Related

How to pass the value of previous row to the dataframe apply function?

I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0

Rolling sum over a partition in python

Code:
data['rolling_sum'] = data.groupby(['User_id'])['Amount'].rolling().sum()
Error
TypeError: incompatible index of inserted column with frame index
Please help in figuring out the mistake in the code. An alternative method would also be appreciated.
Use DataFrame.reset_index with level=0 and drop=True for remove first level of MultiIndex, what is safer because aligned by original index values:
data = pd.DataFrame({
'Amount':[5,3,6,9,2,4],
'User_id':list('aababb')
})
data['rolling_sum1'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().reset_index(level=0, drop=True)
If assign only numpy array is possible values are added incorrectly:
data['rolling_sum2'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().values
print (data)
Amount User_id rolling_sum1 rolling_sum2
0 5 a NaN NaN
1 3 a 8.0 8.0
2 6 b NaN 12.0
3 9 a 12.0 NaN
4 2 b 8.0 8.0
5 4 b 6.0 6.0

How to do a Python DataFrame Boolean Mask on nan values [duplicate]

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Replacing NaNs in a dataframe with a string value

I want to replace the missing value in one column of my df with "missing value".
I tried
result['emp_title'].fillna('missing')
or
result['emp_title'] = result['emp_title'].replace({ np.nan:'missing'})
the second one works, since when i count missing value after this code:
result['emp_title'].isnull().sum()
it gave me 0.
However, the first one does not work as I expected, which did not give me a 0, instead of the previous count for missing value.
Why the first one does not work? Thank you!
You need to fill inplace, or assign:
result['emp_title'].fillna('missing', inplace=True)
or
result['emp_title'] = result['emp_title'].fillna('missing')
MVCE:
In [1697]: df = pd.DataFrame({'Col1' : [1, 2, 3, np.nan, 4, 5, np.nan]})
In [1702]: df.fillna('missing'); df # changes not seen in the original
Out[1702]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1703]: df.fillna('missing', inplace=True); df
Out[1703]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 missing
You should be aware that if you are trying to apply fillna to slices, don't use inplace=True, instead, use df.loc/iloc and assign to sub-slices:
In [1707]: df.Col1.iloc[:5].fillna('missing', inplace=True); df # doesn't work
Out[1707]:
Col1
0 1.0
1 2.0
2 3.0
3 NaN
4 4.0
5 5.0
6 NaN
In [1709]: df.Col1.iloc[:5] = df.Col1.iloc[:5].fillna('missing')
In [1710]: df
Out[1710]:
Col1
0 1
1 2
2 3
3 missing
4 4
5 5
6 NaN

How to find which columns contain any NaN value in Pandas dataframe

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Categories