Using isin with NaN in dataframe - python

Let's say I have the following dataframe:
t2 t5
0 NaN 2.0
1 2.0 NaN
2 3.0 1.0
Now I want to check if elements in t2 is in t5, ignoring NaN.
Therefore, I run the following code:
df['t2'].isin(df['t5'])
Which gives:
0 True
1 True
2 False
However, since NaN!=NaN, I expected
0 False
1 True
2 False
How do I get what I expected? And why does this behave this way?

This isn't so much a bug as it is an inconsistency of behavior between similar libraries. Your columns have a dtype of float64, and both Pandas and Numpy have their own ideas of whether or not nan is comparable to nan[1]. You can see this behavior with unique
>>> np.unique([np.nan, np.nan])
array([nan, nan])
>>> pd.unique([np.nan, np.nan])
array([nan])
So clearly, pandas detects some sort of similarity with nan, which is the behavior you are seeing with isin.
Now for large Series, you won't see this behavior[2]. I think I read somewhere that the cutoff is around 10e6, but don't take my word for it.
u = pd.Series(np.full(100000000, np.nan, dtype=np.float64))
>>> u.isin(u).any()
False
[1] For large Series (> 10e6), pandas uses numpy's definition of nan
[2] As #root points out, this is dtype dependent.

It is because np.nan is indeed in [np.nan]. That is to say in is equivalent to say np.any([a is b for b in lst]). To get what you want, you can filter out the NaNin df['t2'] first:
df['t2'].notna() & df['t2'].isin(df['t5'])
gives:
0 False
1 True
2 False
Name: t2, dtype: bool

Related

Python avoid dividing by zero in pandas dataframe

Apologies that this has been asked before, but I cannot get those solutions to work for me (am native MATLAB user coming to Python).
I have a dataframe where I am taking the row-wise mean of the first 7 columns of one df and dividing it by another. However, there are many zeros in this dataset and I want to replace the zero divion errors with zeros (as that's meaningful to me) instead of the naturally returned nan (as I'm implementing it).
My code so far:
col_ind = list(range(0,7))
df.iloc[:,col_ind].mean(axis=1)/other.iloc[:,col_ind].mean(axis=1)
Here, if other = 0, it returns nan, but if df = 0 it returns 0. I have tried a lot of proposed solutions but none seem to register. For instance:
def foo(x,y):
try:
return x/y
except ZeroDivisionError:
return 0
foo(df.iloc[:,col_ind].mean(axis1),other.iloc[:,col_ind].mean(axis=1))
However this returns the same values without using the defined foo. I'm suspecting this is because I am operating on series rather than single values, but I'm not sure nor how to fix it. There are also actual nans in these dataframes as well. Any help appreciated.
you can use np.where to conditionally do this as a vectorised calc.
import numpy as np
df = pd.DataFrame(data=np.concatenate([np.random.randint(1,10, (10,7)), np.random.randint(0,3,(10,1))], axis=1),
columns=[f"col_{i}" for i in range(7)]+["div"])
np.where(df["div"].gt(0), (df.loc[:,[c for c in df.columns if "col" in c]].mean(axis=1) / df["div"]), 0)
It's not clear which version you're using and I don't know if the behavior is version-dependent, but in Python 3.8.5 / Pandas 1.2.4, a 0 / 0 in a dataframe/series will evaluate to NaN, while a non-zero / 0 will evaluate to inf. Neither will raise an error, so a try/except wouldn't have anything to catch.
>>> import pandas as pd
>>> import numpy as np
>>> x = pd.DataFrame({'a': [0, 1, 2], 'b': [0, 0, 2]})
>>> x
a b
0 0 0
1 1 0
2 2 2
>>> x.a / x.b
0 NaN
1 inf
2 1.0
dtype: float64
You can replace NaN values in a pandas DataFrame or Series with the fillna() method, and you can replace inf using a standard replace():
>>> (x.a / x.b).replace(np.inf, np.nan)
0 NaN
1 NaN
2 1.0
dtype: float64
>>> (x.a / x.b).replace(np.inf, np.nan).fillna(0)
0 0.0
1 0.0
2 1.0
dtype: float64
(Note: A negative value divided by zero will evaluate to -inf, which would need to be replaced separately.)
You could replace nan after the calculation using df.fillna(0)

Filter a data frame containing NaN values, results in empty data frame as result [duplicate]

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Pandas: Selecting NaN values using np.nan

So I found out that the float NaN apparently doesn't equal itself. My question is how to deal with it. Let's start with a dataframe:
DF = pd.DataFrame({'X':[0, 3, None]})
DF
X
0 0.0
1 3.0
2 NaN
DF['test1'] = np.where(DF['X'] == np.nan, 1, 0)
DF['test2'] = np.where(DF['X'].isin([np.nan]), 1, 0)
DF
X test1 test2
0 0.0 0 0
1 3.0 0 0
2 NaN 0 1
So test1 and test2 aren't the same. Many others have mentioned that we should use pd.isnull(). My question is, is it safe to just use isin()? For example, if I need to create a new column using np.where, can I simply do:
DF['test3'] = np.where(DF['X'].isin([0, np.nan]), 1, 0)
Or should I always use pd.isnull like so:
DF['test3'] = np.where((DF['X'] == 0) | (pd.isnull(DF['X'])), 1, 0)
You should always use pd.isnull or np.isnan if you suspect there could be nans.
For example suppose you have an object-dtype column (unfortunately these aren't uncommon):
X
0 a
1 3
2 NaN
Then using isin won't give you correct results:
>>> df['X'].isin([np.nan])
0 False
1 False
2 False
Name: X, dtype: bool
While isnull still works correctly:
>>> df['X'].isnull()
0 False
1 False
2 True
Name: X, dtype: bool
Given that NaN support isn't explicitly mentioned in Series.isin nor DataFrame.isin it might just be an implementation detail that it correctly "finds" NaNs. And implementation details are always bad to rely on. They could change anytime...
Aside from this, it always pays off to be explicit. An explicit isnull or isnan check should be (in my opinion) preferred.

NaN is not recognized in pandas after np.where clause. Why? Or is it a bug?

NaN is not recognized in pandas after np.where clause. Why? Or is it a bug?
The last line of this code should be "True"
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: a=pd.Series([1,np.nan])
In [4]: b=pd.DataFrame(["a","b"])
In [5]: b["1"]=np.where(
a.isnull(),
np.nan,
"Hello"
)
In [6]: b
Out[6]:
0 1
0 a Hello
1 b nan
In [7]: b[1].isnull()
Out[7]:
0 False
1 False
Name: 1, dtype: bool
You can see why if you look at the result of the where:
>>> np.where(a.isnull(), np.nan, "Hello")
array([u'Hello', u'nan'],
dtype='<U32')
Because your other value is a string, where converts your NaN to a string as well and gives you a string-dtyped result. (The exact dtype you get may different depending on your platform and/or Python version.) So you don't actually have a NaN in your result at all, you just have the string "nan".
If you want to do this type of mapping (in particular, mapping that changes dtypes) in pandas, it's usually better to use pandas constructs like .map and avoid dropping into numpy, because as you saw, numpy tends to do unhelpful things when it has to resolve conflicting types. Here's an example of how to do it all in pandas:
>>> b["X"] = a.isnull().map({True: np.nan, False: "Hello"})
>>> b
0 X
0 a Hello
1 b NaN
>>> b.X.isnull()
0 False
1 True
Name: X, dtype: bool

Why does testing `NaN == NaN` not work for dropping from a pandas dataFrame?

Please explain how NaN's are treated in pandas because the following logic seems "broken" to me, I tried various ways (shown below) to drop the empty values.
My dataframe, which I load from a CSV file using read.csv, has a column comments, which is empty most of the time.
The column marked_results.comments looks like this; all the rest of the column is NaN, so pandas loads empty entries as NaNs, so far so good:
0 VP
1 VP
2 VP
3 TEST
4 NaN
5 NaN
....
Now I try to drop those entries, only this works:
marked_results.comments.isnull()
All these don't work:
marked_results.comments.dropna() only gives the same column, nothing gets dropped, confusing.
marked_results.comments == NaN only gives a series of all Falses. Nothing was NaNs... confusing.
likewise marked_results.comments == nan
I also tried:
comments_values = marked_results.comments.unique()
array(['VP', 'TEST', nan], dtype=object)
# Ah, gotya! so now ive tried:
marked_results.comments == comments_values[2]
# but still all the results are Falses!!!
You should use isnull and notnull to test for NaN (these are more robust using pandas dtypes than numpy), see "values considered missing" in the docs.
Using the Series method dropna on a column won't affect the original dataframe, but do what you want:
In [11]: df
Out[11]:
comments
0 VP
1 VP
2 VP
3 TEST
4 NaN
5 NaN
In [12]: df.comments.dropna()
Out[12]:
0 VP
1 VP
2 VP
3 TEST
Name: comments, dtype: object
The dropna DataFrame method has a subset argument (to drop rows which have NaNs in specific columns):
In [13]: df.dropna(subset=['comments'])
Out[13]:
comments
0 VP
1 VP
2 VP
3 TEST
In [14]: df = df.dropna(subset=['comments'])
You need to test NaN with math.isnan() function (Or numpy.isnan). NaNs cannot be checked with the equality operator.
>>> a = float('NaN')
>>> a
nan
>>> a == 'NaN'
False
>>> isnan(a)
True
>>> a == float('NaN')
False
Help Function ->
isnan(...)
isnan(x) -> bool
Check if float x is not a number (NaN).

Categories