Empty DataFrame doesn't admit its empty - python

I must not understand something about emptiness when it comes to pandas DataFrames. I have a DF with empty rows but when I isolate one of these rows its not empty.
Here I've made a dataframe:
>>> df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
>>> df
1 2 3
0 1.0 2.0 3.0
1 1.0 NaN 3.0
2 NaN NaN NaN
3 3.0 2.0 1.0
4 4.0 5.0 6.0
5 NaN NaN NaN
6 NaN NaN NaN
Then I know row '2' is full of nothing so I check for that...
>>> df[2:3].empty
False
Odd. So I split it out into its own dataframe:
>>> df1 = df[2:3]
>>> df1
1 2 3
2 NaN NaN NaN
>>> df1.empty
False
How do I check for emptiness (all the elements in a row being None or NaN?)
http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.empty.html

You're misunderstanding what empty is for. It's meant to check that the size of a series/dataframe is greater than 0, meaning there are rows. For example,
df.iloc[1:0]
Empty DataFrame
Columns: [1, 2, 3]
Index: []
df.iloc[1:0].empty
True
If you want to check that a row has all NaNs, use isnull + all:
df.isnull().all(1)
0 False
1 False
2 True
3 False
4 False
5 True
6 True
dtype: bool
For your example, this should do:
df[2:3].isnull().all(1).item()
True
Note that you can't use item if your slice is more than one row in size.

I guess you are looking for something like this:
In [296]: df[5:]
Out[296]:
1 2 3
5 NaN NaN NaN
6 NaN NaN NaN
In [297]: df[5:].isnull().all(1).all()
Out[297]: True
or even better (as proposed by #IanS):
In [300]: df[5:].isnull().all().all()
Out[300]: True

You can drop all null values from your selection and check if the result is empty:
>>> df[5:].dropna(how='all').empty
True

If you are do not want to count NaN value as real number , this will equal to
df.dropna().iloc[5:]
You select the line did not exist in your dataframe
df.dropna().iloc[5:].empty
Out[921]: True

If you have a dataframe and want to drop all rows containing NaN in each of the columns, you can do this
df.dropna(how='all')
Noticed that your dataframe also has NaN in one the columns in some cases. If you need to drop the entire row in such case:
df.dropna(how='any')
After you do this (which ever is your preference) you could check length of dataframe (number of rows it contains) using:
len(df)

I guess you have to use isnull() instead of empty().
import pandas
df = pandas.DataFrame(columns=[1,2,3], data=[[1,2,3],[1,None,3],[None, None, None],[3,2,1],[4,5,6],[None,None,None],[None,None,None]])
df[2:3].isnull()
1 2 3
True True True

Related

Filter rows with list of strings and NaN altogether in pandas

I am trying to filter the rows having strings and NaN values.
The following code does not work
list=['ok','good','fine']
df_new = df[(df['b'].str.contains('|'.join(list))) | (df.b.isnull())]
Example
df = pd.DataFrame({'a':[1,2,np.nan, 4, 3 ], 'b':['ok',1,np.nan, np.nan,'fine' ]})
df
a b
0 1.0 ok
1 2.0 1
2 NaN NaN
3 4.0 NaN
4 3.0 fine
df[(df['b'].str.contains('|'.join(list))) | (df.b.isnull())]
a b
0 1.0 ok
4 3.0 fine
Desired Output
0 1.0 ok
2 NaN NaN
3 4.0 NaN
4 3.0 fine
I know there are many way to achieve this but wondering why this code is not working
Your problem is that df['b'].str.contains('|'.join(l)) produces NaNs, since a comparison to NaN here will yield a NaN in the result. That is,
>>> df['b'].str.contains('|'.join(l))
0 True
1 NaN
2 NaN
3 NaN
4 True
If you logical-or(*) that with the df.b.isnull() condition, the NaNs will be interpreted as False (or probably as NaN even, but be equivalent to False):
>>> df['b'].str.contains('|'.join(l)) | df.b.isnull()
0 True
1 False
2 False
3 False
4 True
Hence you lose a few rows.
If you reverse the order for the logical-or, shortcutting for comparisons will kick in: if sides are or-ed, and the left-hand side is True, there is no need to look at the right-hand side (whether it be False, True or NaN), and thus your rows remain:
>>> df.b.isnull() | df['b'].str.contains('|'.join(l))
0 True
1 False
2 True
3 True
4 True
Hence you'll get your result when you logical-or the sides in reverse order:
>>> df[df.b.isnull() | df.b.str.contains('|'.join(l))]
a b
0 1.0 ok
2 NaN NaN
3 4.0 NaN
4 3.0 fine
(You can remove some parentheses, as well as replace the ['b'] part.)
I don't think this is something to rely on, though. You would be better off, for example, by replacing the NaNs with a more relevant value (perhaps an empty string), so that the boolean combination works both ways. Because at some point, a trick like the one above, will bite you.
df.b = df.b.fillna("")
replaces the NaNs with an empty string.
(*) "logical-or" is not a verb that I know of, but for this context, I made it a verb. As some say: verbing weirds language.

Dataframe filter Null values from list of Columns

So, I've got a df like so,
ID,A,B,C,D,E,F,G
1,123,30,3G,1,123,30,3G
2,456,40,4G,NaN,NaN,NaN,4G
3,789,35,5G,NaN,NaN,NaN,NaN
I also have a list that has a subset of the header list of df like so,
header_list = ["D","E","F","G"]
Now I'd like to get those records from df that CONTAINS Null values FOR ALL OF the Column Names in the header_list.
Expected Output:
ID,A,B,C,D,E,F,G
3,789,35,5G,NaN,NaN,NaN,NaN
I tried,
new_df = df[df[header_list].isnull()] but this throws error, ValueError: Boolean array expected for the condition, not float64
I know I can do something like this,
new_df = df[(df['D'].isnull()) & (df['E'].isnull()) & (df['F'].isnull()) & (df['G'].isnull())]
But I don't want to hard code it like this. So is there a better way of doing this?
You can filter this with:
df[df[header_list].isnull().all(axis=1)]
We thus check if a row contains values where .all() values are .isnull().
For the given sample input, this gives the expected output:
>>> df[df[header_list].isnull().all(axis=1)]
A B C D E F G
3 789 35 5G NaN NaN NaN NaN
The .all(axis=1) [pandas-doc] will thus return True for a row, given all columns for that row are True, and False otherwise. So for the given sample input, we get:
>>> df[header_list]
D E F G
1 1.0 123.0 30.0 3G
2 NaN NaN NaN 4G
3 NaN NaN NaN NaN
>>> df[header_list].isnull()
D E F G
1 False False False False
2 True True True False
3 True True True True
>>> df[header_list].isnull().all(axis=1)
1 False
2 False
3 True
dtype: bool

How to do a Python DataFrame Boolean Mask on nan values [duplicate]

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

How to find which columns contain any NaN value in Pandas dataframe

Given a pandas dataframe containing possible NaN values scattered here and there:
Question: How do I determine which columns contain NaN values? In particular, can I get a list of the column names containing NaNs?
UPDATE: using Pandas 0.22.0
Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'
In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool
as list of columns:
In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']
to select those columns (containing at least one NaN value):
In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
OLD answer:
Try to use isnull():
In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1
In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool
or as #root proposed clearer version:
In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool
In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']
to select a subset - all columns containing at least one NaN value:
In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0
You can use df.isnull().sum(). It shows all columns and the total NaNs of each feature.
I had a problem where I had to many columns to visually inspect on the screen so a shortlist comp that filters and returns the offending columns is
nan_cols = [i for i in df.columns if df[i].isnull().any()]
if that's helpful to anyone
Adding to that if you want to filter out columns having more nan values than a threshold, say 85% then use
nan_cols85 = [i for i in df.columns if df[i].isnull().sum() > 0.85*len(data)]
This worked for me,
1. For getting Columns having at least 1 null value. (column names)
data.columns[data.isnull().any()]
2. For getting Columns with count, with having at least 1 null value.
data[data.columns[data.isnull().any()]].isnull().sum()
[Optional]
3. For getting percentage of the null count.
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]
In datasets having large number of columns its even better to see how many columns contain null values and how many don't.
print("No. of columns containing null values")
print(len(df.columns[df.isna().any()]))
print("No. of columns not containing null values")
print(len(df.columns[df.notna().all()]))
print("Total no. of columns in the dataframe")
print(len(df.columns))
For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.
Further you can also automatically remove cols and rows depending on which has more null values
Here is the code which does this intelligently:
df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
df = df.dropna(axis = 0).reset_index(drop=True)
Note: Above code removes all of your null values. If you want null values, process them before.
df.columns[df.isnull().any()].tolist()
it will return name of columns that contains null rows
I know this is a very well-answered question but I wanted to add a slight adjustment. This answer only returns columns containing nulls, and also still shows the count of the nulls.
As 1-liner:
pd.isnull(df).sum()[pd.isnull(df).sum() > 0]
Description
Count nulls in each column
null_count_ser = pd.isnull(df).sum()
True|False series describing if that column had nulls
is_null_ser = null_count_ser > 0
Use the T|F series to filter out those without
null_count_ser[is_null_ser]
Example Output
name 5
phone 187
age 644
i use these three lines of code to print out the column names which contain at least one null value:
for column in dataframe:
if dataframe[column].isnull().any():
print('{0} has {1} null values'.format(column, dataframe[column].isnull().sum()))
This is one of the methods..
import pandas as pd
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan],'c':[np.nan,2,np.nan], 'd':[np.nan,np.nan,np.nan]})
print(pd.isnull(df).sum())
enter image description here
Both of these should work:
df.isnull().sum()
df.isna().sum()
DataFrame methods isna() or isnull() are completely identical.
Note: Empty strings '' is considered as False (not considered NA)
df.isna() return True values for NaN, False for the rest. So, doing:
df.isna().any()
will return True for any column having a NaN, False for the rest
To see just the columns containing NaNs and just the rows containing NaNs:
isnulldf = df.isnull()
columns_containing_nulls = isnulldf.columns[isnulldf.any()]
rows_containing_nulls = df[isnulldf[columns_containing_nulls].any(axis='columns')].index
only_nulls_df = df[columns_containing_nulls].loc[rows_containing_nulls]
print(only_nulls_df)
features_with_na=[features for features in dataframe.columns if dataframe[features].isnull().sum()>0]
for feature in features_with_na:
print(feature, np.round(dataframe[feature].isnull().mean(), 4), '% missing values')
print(features_with_na)
it will give % of missing value for each column in dataframe
The code works if you want to find columns containing NaN values and get a list of the column names.
na_names = df.isnull().any()
list(na_names.where(na_names == True).dropna().index)
If you want to find columns whose values are all NaNs, you can replace any with all.

Categories