Count all NaNs in a pandas DataFrame - python

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'

The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).

You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()

oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()

Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())

If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

Related

interpolate(method="nearest") in a groupby in pandas

I have a dataset that I want to groupby("CustomerID") and fill NaNs with the nearest number within the group.
I can fill by nearest number irregardless of group like this:
df['num'] = df['num'].interpolate(method="nearest")
When I tried:
df['num'] = df.groupby('CustomerID')['num'].transform(lambda x: x.interpolate(method="nearest"))
I got ValueError: x and y arrays must have at least 2 entries, which I assume is because
some customers only have one entry with NaN or only NaNs.
However, when I extracted a select few rows that should have worked and made a new dataframe, nothing happened.
Is there a way I can group by customerID and fill NaNs with nearest number within the group, and skip customers with only NaNs or just one observation?
I ran into the same "ValueError: x and y arrays must have at least 2 entries" in my code. Adapted to your code (which I obviously could not reproduce) here is how I solved the problem:
import pandas as pd
import numpy as np
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
It does the following:
The first "groupby + apply" interpolates each group with the method 'nearest' ONLY if the group has at least two non NaNs values.
np.isnan(group) returns an array containing True where group has NaNs and False where it has values.
np.count_nonzero(np.isnan(group)) returns the number of True in the previous array (i.e. the number of NaNs in the group).
If the number of NaNs is strictly smaller than the length of the group minus 1 (i.e. there are at least two non NaNs in the group), the group is interpolated using 'nearest', otherwise it is left untouched.
The second "groupby + apply" finishes to interpolate each group, using method='linear' and argument limit_direction='both'.
If a group was fully interpolated in the previous step: nothing
happens.
If a group had only one non NaN value (therefore was left
untouched in the previous step): The non NaN value will be used to
fill the entire group.
If a group had only NaNs (therefore was left untouched in the previous step): the group remains full of NaNs.
Here's a dummy example using your notations:
df=pd.DataFrame({'CustomerID':['a']*3+['b']*3+['c']*3,'num':[1,np.nan,2,np.nan,1,np.nan,np.nan,np.nan,np.nan]})
df
CustomerID num
0 a 1.0
1 a NaN
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID')['num'].apply(lambda group: group.interpolate(method='nearest') if np.count_nonzero(np.isnan(group)) < (len(group) - 1) else group)
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 1.0
5 b NaN
6 c NaN
7 c NaN
8 c NaN
df.loc[:,'num'] = df.groupby('CustomerID').apply(lambda group: group.interpolate(method='linear', limit_area='outside', limit_direction='both'))
df
CustomerID num
0 a 1.0
1 a 1.0
2 a 2.0
3 b 1.0
4 b 1.0
5 b 1.0
6 c NaN
7 c NaN
8 c NaN
EDIT: important note
The interpolate method 'nearest' uses the numerical values of the index (see documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html). It works well in my dummy example above because the index is clean. If the index of your dataframe is messy (e.g. after concatenating dataframes) you may want to do df.reset_index(inplace=True) before you interpolate.

What happens when I pass a boolean dataframe to the indexing operator for another dataframe in pandas?

There's something fundamental about manipulating pandas dataframes which I am not getting.
TL,DR: passing a boolean series to the indexing operator [] of a pandas dataframe returns the rows or columns of that df where the series was True. But passing a boolean dataframe (ie: multidimensional) returns a weird dataframe consisting only of NaN values.
Edit: to rephrase: why is it possible to pass a dataframe of boolean values to another dataframe, and what does it do? With a series, this makes sense, but with a dataframe, I don't understand what's happening 'under the hood', and why in my example I get a dataframe of null NaN values.
In detail with examples:
When I pass a pandas boolean Series to the indexing operator, it returns a list of rows corresponding to indices where the Series is True:
test_list = [[1,2,3,4],[3,4,5],[4,5]]
test_df = pd.DataFrame(test_list)
test_df
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df[test_df[2].isnull()]
0 1 2 3
2 4 5 NaN NaN
So far, so good. But what happens when I do this:
test_df[test_df.isnull()]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
Why does this return a dataframe consisting of only NaN values? I would expect it to either return an error, or perhaps to return a new dataframe truncated using the boolean mask dataframe. But I find this output completely confusing.
Edit: As an outcome I would expect to get an error. I don't understand why it's possible to pass a dataframe under these circumstances, or why it returns this dataframe of NaN values
test_df[..] calls an indexing method __getitem__(). From the source code:
def __getitem__(self, key):
...
# Do we have a (boolean) DataFrame?
if isinstance(key, DataFrame):
return self.where(key)
# Do we have a (boolean) 1d indexer?
if com.is_bool_indexer(key):
return self._getitem_bool_array(key)
As you can see, if the key is a boolean DataFrame, it will call pandas.DataFrame.where(). The function of where() is to replace values where the condition is False with NaN by default.
# print(test_df.isnull())
0 1 2 3
0 False False False False
1 False False False True
2 False False True True
# print(test_df)
0 1 2 3
0 1 2 3.0 4.0
1 3 4 5.0 NaN
2 4 5 NaN NaN
test_df.where(test_df.isnull()) replaces not null values with NaN.
I believe all values are transformed to NaN because you passed the entire df. The error 'message', precisely, is that all the returned values are NaN (including those that were not NaN), which allows us to see that something wrong happened. But surely a more experienced user will be able to answer you in more detail. Also note most of the time you want to remove or transform these NaN--not just flag them.
Following my comment above and LoukasPap's answer, here is a way to flag, count, and then remove or transform these NaN values:
First flag NaN values:
test_df.isnull()
You might also be interested to count your NaN values:
test_df.isnull().sum() # sum NaN by column
test_df.isnull().sum().sum() # get grand total of NaN
You can now drop NaN values by row
test_df.dropna()
Or by column:
test_df.dropna(axis=1)
Or replace NaN values by median:
test_df.fillna(test_df.median())

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

Replace values in dataframe with values from series of different length

I would like to replace values in a column of a dataframe with values from a series. The catch is that I only want to replace values that are designated by a mask and the series does not have the same length as the dataframe.
More specifically, I want to replace all the values that are not null with values from a series that contains one value for each non-null value in the dataframe.
Assume the column in the dataframe contains [1,2,3,NaN,5] and the series contains [2,4,6,10]
I naively thought that this might work
df[pd.notna(df)] = s
But it will make the column look like [1,2,3,NaN,NaN]. I understand why it behaves this way, but I need to find something that will give me this: [2,4,6,NaN,10]
The approach you tried is possible, but with some changes:
Update some individual column, not the whole DataFrame.
To "escape" from different index values, take values from
the "updating" Series.
To show how to do it, let's define the DataFrame (df) as:
A B
0 1.0 11.0
1 2.0 12.0
2 3.0 13.0
3 NaN NaN
4 5.0 15.0
and the "updating" Series (upd) as:
11 2
12 4
13 6
14 10
dtype: int64
As you can see, indices in df and upd are different.
To update e.g. A column in df the way you want, run:
df.A[df.A.notna()] = upd.values
The result is:
A B
0 2.0 11.0
1 4.0 12.0
2 6.0 13.0
3 NaN NaN
4 10.0 15.0

Reassign NaN based on column name in Pandas

I have a data frame with several NaN values like so:
first second
0 1.0 3.0
1 2.0 NaN
2 NaN 5.0
and another with lookup values:
fill
second 200
first 100
Is there a way to replace the NaN values with the fill values based on the column name to get this?:
first second
0 1.0 3.0
1 2.0 200
2 100 5.0
This is just an example, as I'm trying to do it on a much larger dataframe. I know that I can rearrange the fields in the dataframes so that the indices match up and I could use pd.where, but I'm wondering if there's a way to make the match just based on column name.
You can use pandas.DataFrame.fillna() for this

Categories