So I found out that the float NaN apparently doesn't equal itself. My question is how to deal with it. Let's start with a dataframe:
DF = pd.DataFrame({'X':[0, 3, None]})
DF
X
0 0.0
1 3.0
2 NaN
DF['test1'] = np.where(DF['X'] == np.nan, 1, 0)
DF['test2'] = np.where(DF['X'].isin([np.nan]), 1, 0)
DF
X test1 test2
0 0.0 0 0
1 3.0 0 0
2 NaN 0 1
So test1 and test2 aren't the same. Many others have mentioned that we should use pd.isnull(). My question is, is it safe to just use isin()? For example, if I need to create a new column using np.where, can I simply do:
DF['test3'] = np.where(DF['X'].isin([0, np.nan]), 1, 0)
Or should I always use pd.isnull like so:
DF['test3'] = np.where((DF['X'] == 0) | (pd.isnull(DF['X'])), 1, 0)
You should always use pd.isnull or np.isnan if you suspect there could be nans.
For example suppose you have an object-dtype column (unfortunately these aren't uncommon):
X
0 a
1 3
2 NaN
Then using isin won't give you correct results:
>>> df['X'].isin([np.nan])
0 False
1 False
2 False
Name: X, dtype: bool
While isnull still works correctly:
>>> df['X'].isnull()
0 False
1 False
2 True
Name: X, dtype: bool
Given that NaN support isn't explicitly mentioned in Series.isin nor DataFrame.isin it might just be an implementation detail that it correctly "finds" NaNs. And implementation details are always bad to rely on. They could change anytime...
Aside from this, it always pays off to be explicit. An explicit isnull or isnan check should be (in my opinion) preferred.
Related
Apologies that this has been asked before, but I cannot get those solutions to work for me (am native MATLAB user coming to Python).
I have a dataframe where I am taking the row-wise mean of the first 7 columns of one df and dividing it by another. However, there are many zeros in this dataset and I want to replace the zero divion errors with zeros (as that's meaningful to me) instead of the naturally returned nan (as I'm implementing it).
My code so far:
col_ind = list(range(0,7))
df.iloc[:,col_ind].mean(axis=1)/other.iloc[:,col_ind].mean(axis=1)
Here, if other = 0, it returns nan, but if df = 0 it returns 0. I have tried a lot of proposed solutions but none seem to register. For instance:
def foo(x,y):
try:
return x/y
except ZeroDivisionError:
return 0
foo(df.iloc[:,col_ind].mean(axis1),other.iloc[:,col_ind].mean(axis=1))
However this returns the same values without using the defined foo. I'm suspecting this is because I am operating on series rather than single values, but I'm not sure nor how to fix it. There are also actual nans in these dataframes as well. Any help appreciated.
you can use np.where to conditionally do this as a vectorised calc.
import numpy as np
df = pd.DataFrame(data=np.concatenate([np.random.randint(1,10, (10,7)), np.random.randint(0,3,(10,1))], axis=1),
columns=[f"col_{i}" for i in range(7)]+["div"])
np.where(df["div"].gt(0), (df.loc[:,[c for c in df.columns if "col" in c]].mean(axis=1) / df["div"]), 0)
It's not clear which version you're using and I don't know if the behavior is version-dependent, but in Python 3.8.5 / Pandas 1.2.4, a 0 / 0 in a dataframe/series will evaluate to NaN, while a non-zero / 0 will evaluate to inf. Neither will raise an error, so a try/except wouldn't have anything to catch.
>>> import pandas as pd
>>> import numpy as np
>>> x = pd.DataFrame({'a': [0, 1, 2], 'b': [0, 0, 2]})
>>> x
a b
0 0 0
1 1 0
2 2 2
>>> x.a / x.b
0 NaN
1 inf
2 1.0
dtype: float64
You can replace NaN values in a pandas DataFrame or Series with the fillna() method, and you can replace inf using a standard replace():
>>> (x.a / x.b).replace(np.inf, np.nan)
0 NaN
1 NaN
2 1.0
dtype: float64
>>> (x.a / x.b).replace(np.inf, np.nan).fillna(0)
0 0.0
1 0.0
2 1.0
dtype: float64
(Note: A negative value divided by zero will evaluate to -inf, which would need to be replaced separately.)
You could replace nan after the calculation using df.fillna(0)
I'm trying to create a column that would identify whether a url is present or not from an existing column called "links". I'd like all NaN values to become zeros and any urls to be denoted as 1, in the new column. I tried the following but was unable to get the correct values.
def url(x):
if x == 'NaN':
return 0
else:
return 1
df['url1'] = df['links'].apply(url)
df.head()
You can use pd.isnull(x) instead of the x == 'NaN' comparison
import pandas as pd
df['url1'] = df['links'].apply(lambda x: 0 if pd.isnull(x) else 1)
See my comment, but the simplest and most performant thing you can to do to get your desired output is to use a pandas method:
input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'links' : [np.nan, 'a', 'b', np.nan]})
In[1]:
links
0 NaN
1 a
2 b
3 NaN
output:
df['url1'] = df['links'].notnull().astype(int)
df
Out[801]:
links url1
0 NaN 0
1 a 1
2 b 1
3 NaN 0
notnull() returns True or False and .astype(int) change True to 1 and False to 0, because True and False are boolean values with an underlying value of 1 and 0, respectively, even though they say True and False. So, when you change the data type to int, it will show its integer underlying value of 1 or 0.
Related to my comment 'True' would also not equal to True and 'False' not equal to False , just like 'NaN' does not equal NaN (notice apostrophes versus no apostrophes).
Let's say I have the following dataframe:
t2 t5
0 NaN 2.0
1 2.0 NaN
2 3.0 1.0
Now I want to check if elements in t2 is in t5, ignoring NaN.
Therefore, I run the following code:
df['t2'].isin(df['t5'])
Which gives:
0 True
1 True
2 False
However, since NaN!=NaN, I expected
0 False
1 True
2 False
How do I get what I expected? And why does this behave this way?
This isn't so much a bug as it is an inconsistency of behavior between similar libraries. Your columns have a dtype of float64, and both Pandas and Numpy have their own ideas of whether or not nan is comparable to nan[1]. You can see this behavior with unique
>>> np.unique([np.nan, np.nan])
array([nan, nan])
>>> pd.unique([np.nan, np.nan])
array([nan])
So clearly, pandas detects some sort of similarity with nan, which is the behavior you are seeing with isin.
Now for large Series, you won't see this behavior[2]. I think I read somewhere that the cutoff is around 10e6, but don't take my word for it.
u = pd.Series(np.full(100000000, np.nan, dtype=np.float64))
>>> u.isin(u).any()
False
[1] For large Series (> 10e6), pandas uses numpy's definition of nan
[2] As #root points out, this is dtype dependent.
It is because np.nan is indeed in [np.nan]. That is to say in is equivalent to say np.any([a is b for b in lst]). To get what you want, you can filter out the NaNin df['t2'] first:
df['t2'].notna() & df['t2'].isin(df['t5'])
gives:
0 False
1 True
2 False
Name: t2, dtype: bool
I have come across the .any() method several times. I used it quite a few times to check if a particular string is contained in a dataframe. In that case it returns a n array/dataframe (depending on how I wish to structure it) of Trues and Falses depending on whether the string matches the values of the cell. I also found .any(1) method but I am not sure how or in which cases I should use it.
.any(1) is the same as .any(axis=1), which means look row-wise instead of per column.
With this sample dataframe:
x1 x2 x3
0 1 1 0
1 0 0 0
2 1 0 0
See the different outcomes:
import pandas as pd
df = pd.read_csv('bool.csv')
print(df.any())
>>>
x1 True
x2 True
x3 False
dtype: bool
So .any() checks if any value in a column is True
print(df.any(1))
>>>
0 True
1 False
2 True
dtype: bool
So .any(1) checks if any value in a row is True
The Document is self explanatory, However for the sake of the question.
This is Series and Dataframe methods any(). It checks whether any of value in the caller object (Dataframe or series) is not 0 and returns True for that. If all values are 0, it will return False.
Note: However, Even if the caller method contains Nan it will not considered 0.
Example DataFrame:
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
calling df.any() column wise.
>>> df.any(axis=1)
0 True
1 True
dtype: bool
any is true if at least one is true
any is False if all are False
Here is nice Blog Documentation about any() & all() by Guido van Rossum.
I am trying to add a new column to a dataframe based on an if statement depending on the values of two columns. i.e. if column x == None then column y else column x
below is the script I have written but doesn't work. any ideas?
dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x)
Also I got this error message:
AttributeError: ("'Series' object has no attribute 'Retention_x'", u'occurred at index BUSINESSUNIT_NAME')
fyi: BUSINESSUNIT_NAME is the first column name
Additional Info:
My data printed out looks like this and I want to add a 3rd column to take a value if there is one else keep NaN.
Retention_x Retention_y
0 1 NaN
1 NaN 0.672183
2 NaN 1.035613
3 NaN 0.771469
4 NaN 0.916667
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
UPDATE:
In the end I was having issues referencing the Null or is Null in my dataframe the final line of code I used also including the axis = 1 answered my question.
dfCurrentReportResults['RetentionLambda'] = dfCurrentReportResults.apply(lambda x : x['Retention_y'] if pd.isnull(x['Retention_x']) else x['Retention_x'], axis = 1)
Thanks #EdChum, #strim099 and #aus_lacy for all your input. As my data set gets larger I may switch to the np.where option if I notice performance issues.
You'r lambda is operating on the 0 axis which is columnwise. Simply add axis=1 to the apply arg list. This is clearly documented.
In [1]: import pandas
In [2]: dfCurrentReportResults = pandas.DataFrame([['a','b'],['c','d'],['e','f'],['g','h'],['i','j']], columns=['Retention_y', 'Retention_x'])
In [3]: dfCurrentReportResults['Retention_x'][1] = None
In [4]: dfCurrentReportResults['Retention_x'][3] = None
In [5]: dfCurrentReportResults
Out[5]:
Retention_y Retention_x
0 a b
1 c None
2 e f
3 g None
4 i j
In [6]: dfCurrentReportResults['Retention'] = dfCurrentReportResults.apply(lambda x : x.Retention_y if x.Retention_x == None else x.Retention_x, axis=1)
In [7]: dfCurrentReportResults
Out[7]:
Retention_y Retention_x Retention
0 a b b
1 c None c
2 e f f
3 g None g
4 i j j
Just use np.where:
dfCurrentReportResults['Retention'] = np.where(df.Retention_x == None, df.Retention_y, else df.Retention_x)
This uses the test condition, the first param and sets the value to df.Retention_y else df.Retention_x
also avoid using apply where possible as this is just going to loop over the values, np.where is a vectorised method and will scale much better.
UPDATE
OK no need to use np.where just use the following simpler syntax:
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x == None, df.Retention_x)
Further update
dfCurrentReportResults['Retention'] = df.Retention_y.where(df.Retention_x.isnull(), df.Retention_x)