How to do a Python DataFrame Boolean Mask on nan values [duplicate] - python

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?

UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0

You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.

I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]

This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.

df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows

I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644

i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))

This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here

Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)

df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest

To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)

features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe

The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Related

How to pass the value of previous row to the dataframe apply function?

I have the following pandas dataframe and would like to build a new column 'c' which is the summation of column 'b' value and column 'a' previous values. With shifting column 'a' it is possible to do so. However, I would like to know how I can pass the previous values of column 'a' in the apply() function.
l1 = [1,2,3,4,5]
l2 = [3,2,5,4,6]
df = pd.DataFrame(data=l1, columns=['a'])
df['b'] = l2
df['shifted'] = df['a'].shift(1)
df['c'] = df.apply(lambda row: row['shifted']+ row['b'], axis=1)
print(df)
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0
I appreciate your help.
Edit: this is a dummy example. I need to use the apply function because I'm passing another function to it which uses previous rows of some columns and checks some condition.
First let's make it clear that you do not need apply for this simple operation, so I'll consider it as a dummy example of a complex function.
Assuming non-duplicate indices, you can generate a shifted Series and reference it in apply using the name attribute:
s = df['a'].shift(1)
df['c'] =df.apply(lambda row: row['b']+s[row.name], axis=1)
output:
a b shifted c
0 1 3 NaN NaN
1 2 2 1.0 3.0
2 3 5 2.0 7.0
3 4 4 3.0 7.0
4 5 6 4.0 10.0

Rolling sum over a partition in python

Code:
data['rolling_sum'] = data.groupby(['User_id'])['Amount'].rolling().sum()
Error
TypeError: incompatible index of inserted column with frame index
Please help in figuring out the mistake in the code. An alternative method would also be appreciated.
Use DataFrame.reset_index with level=0 and drop=True for remove first level of MultiIndex, what is safer because aligned by original index values:
data = pd.DataFrame({
'Amount':[5,3,6,9,2,4],
'User_id':list('aababb')
})
data['rolling_sum1'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().reset_index(level=0, drop=True)
If assign only numpy array is possible values are added incorrectly:
data['rolling_sum2'] = data.groupby(['User_id'])['Amount'].rolling(2).sum().values
print (data)
Amount User_id rolling_sum1 rolling_sum2
0 5 a NaN NaN
1 3 a 8.0 8.0
2 6 b NaN 12.0
3 9 a 12.0 NaN
4 2 b 8.0 8.0
5 4 b 6.0 6.0

Empty DataFrame doesn't admit its empty

I must not understand something about emptiness when it comes to pandas DataFrames. I have a DF with empty rows but when I isolate one of these rows its not empty.
Here I've made a dataframe:
>>> df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
>>> df
1 2 3
0 1.0 2.0 3.0
1 1.0 NaN 3.0
2 NaN NaN NaN
3 3.0 2.0 1.0
4 4.0 5.0 6.0
5 NaN NaN NaN
6 NaN NaN NaN
Then I know row '2' is full of nothing so I check for that...
>>> df[2:3].empty
False
Odd. So I split it out into its own dataframe:
>>> df1 = df[2:3]
>>> df1
1 2 3
2 NaN NaN NaN
>>> df1.empty
False
How do I check for emptiness (all the elements in a row being None or NaN?)
http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.empty.html
You're misunderstanding what empty is for. It's meant to check that the size of a series/dataframe is greater than 0, meaning there are rows. For example,
df.iloc[1:0]
Empty DataFrame
Columns: [1, 2, 3]
Index: []
df.iloc[1:0].empty
True
If you want to check that a row has all NaNs, use isnull + all:
df.isnull().all(1)
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
For your example, this should do:
df[2:3].isnull().all(1).item()
True
Note that you can't use item if your slice is more than one row in size.
I guess you are looking for something like this:
In [296]: df[5:]
Out[296]:
1 2 3
5 NaN NaN NaN
6 NaN NaN NaN
In [297]: df[5:].isnull().all(1).all()
Out[297]: True
or even better (as proposed by #IanS):
In [300]: df[5:].isnull().all().all()
Out[300]: True
You can drop all null values from your selection and check if the result is empty:
>>> df[5:].dropna(how='all').empty
True
If you are do not want to count NaN value as real number , this will equal to
df.dropna().iloc[5:]
You select the line did not exist in your dataframe
df.dropna().iloc[5:].empty
Out[921]: True
If you have a dataframe and want to drop all rows containing NaN in each of the columns, you can do this
df.dropna(how='all')
Noticed that your dataframe also has NaN in one the columns in some cases. If you need to drop the entire row in such case:
df.dropna(how='any')
After you do this (which ever is your preference) you could check length of dataframe (number of rows it contains) using:
len(df)
I guess you have to use isnull() instead of empty().
import pandas
df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
df[2:3].isnull()
1 2 3
True True True

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

How to find which columns contain any NaN value in Pandas dataframe

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Categories