Pandas .str.isnumeric() for floats? - python

I'm trying to filter a column in pandas to keep floats and NaNs. I've found that .str.isnumeric() doesn't consider numbers with non-ints as numeric. This surprised me since pd.to_numeric() does what I'm looking for (keeps floats and NaNs). So in this example:
test_num = pd.DataFrame(data={'col1': ['6', '5.5', '4E-05', np.nan, 'bear']})
test_num['col1'].str.isnumeric()
I'd expect the output to be
0 True
1 True
2 True
3 NaN
4 False
But instead it's
0 True
1 False
2 False
3 NaN
4 False
Has anyone else created this kind of basic numeric Series filter?

This is the expected behaviour as str.isnumeric() checks "whether all characters in each string are numeric" (docs).
But the question is interesting, because np.nan is a float, so if you wish to remove bear but keep the nan, pd.to_numeric(test_num['col1'], errors='coerce') will not work, as it converts bear to a float (nan).
So you could first filter out all values that cannot be converted to float prior to running pd.to_numeric on the Series:
def check_numeric(x):
try:
float(x)
return True
except:
return False
test_num = test_num[test_num['col1'].apply(check_numeric)]
test_num['col1'] = pd.to_numeric(test_num['col1'])

This should work for you.
pd.Series(np.where(test_num['col1'].isna(), np.nan,
pd.to_numeric(test_num['col1'], errors='coerce').notnull()
)).replace({1:True, 0:False})
0 True
1 True
2 True
3 NaN
4 False

Related

Using isin with NaN in dataframe

Let's say I have the following dataframe:
t2 t5
0 NaN 2.0
1 2.0 NaN
2 3.0 1.0
Now I want to check if elements in t2 is in t5, ignoring NaN.
Therefore, I run the following code:
df['t2'].isin(df['t5'])
Which gives:
0 True
1 True
2 False
However, since NaN!=NaN, I expected
0 False
1 True
2 False
How do I get what I expected? And why does this behave this way?
This isn't so much a bug as it is an inconsistency of behavior between similar libraries. Your columns have a dtype of float64, and both Pandas and Numpy have their own ideas of whether or not nan is comparable to nan[1]. You can see this behavior with unique
>>> np.unique([np.nan, np.nan])
array([nan, nan])
>>> pd.unique([np.nan, np.nan])
array([nan])
So clearly, pandas detects some sort of similarity with nan, which is the behavior you are seeing with isin.
Now for large Series, you won't see this behavior[2]. I think I read somewhere that the cutoff is around 10e6, but don't take my word for it.
u = pd.Series(np.full(100000000, np.nan, dtype=np.float64))
>>> u.isin(u).any()
False
[1] For large Series (> 10e6), pandas uses numpy's definition of nan
[2] As #root points out, this is dtype dependent.
It is because np.nan is indeed in [np.nan]. That is to say in is equivalent to say np.any([a is b for b in lst]). To get what you want, you can filter out the NaNin df['t2'] first:
df['t2'].notna() & df['t2'].isin(df['t5'])
gives:
0 False
1 True
2 False
Name: t2, dtype: bool

Filter a data frame containing NaN values, results in empty data frame as result [duplicate]

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

Replace numbers by `nan` in pandas data frame

I have an issue with a column on a pandas data frame. Due to data input errors I have a column with true and false, but it also contains around 71 decimals.
I am trying to get rid of the decimals and turn them into nan so I can ignore those rows for further analysis.
When I try:
datafinal['any_misread'] = datafinal['any_misread'].where(datafinal['any_misread'] < 1, np.nan)
I get the error:
TypeError: unorderable types: str() < int()
I have also tried logics with .replace and with no success.
What am I missing here?
Let's try using where and astype:
df = pd.DataFrame({'col1':[True, False, 0.12, True, False, .3]})
df.where((df.col1.astype(str) == 'True') | (df.col1.astype(str) == 'False'))
Output:
col1
0 True
1 False
2 NaN
3 True
4 False
5 NaN
You can check if the type of each item in the column is not a bool and change the value.
df = pd.DataFrame([[True],[True],[False],[10.2],[1.0],[False],[0]], columns=['misread'])
df.misread[df.misread.apply(lambda x: not isinstance(x, bool))] = pd.np.nan
df
# returns
misread
0 True
1 True
2 False
3 NaN
4 NaN
5 False
6 NaN

Python Dataframe get null value counts

I am trying to find the null values in a DataFrame. Though I reviewed the following post from Stackoverflow that describes the process to determine the null values, I am having a hard time to do the same for my dataset.
How to count the Nan values in the column in Panda Data frame
Working code:
import pandas as pd
a = ['america','britain','brazil','','china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.isnull()
#Output:
False
1 False
2 False
3 False
4 False
5 False
a.isnull().sum()
#Output
#0 0
#dtype: int64
What am I doing wrong?
If you want '', None and NaN to all count as null, you can use the applymap method on each value in the dataframe coerced to a boolean and then use .sum subsequently:
import pandas as pd
import numpy as np
a = ['america','britain','brazil',None,'', np.nan, 'china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.applymap(lambda x: not x or pd.isnull(x)).sum()
# 0 3
# dtype: int64
I hope this helps.
The '' in your list isn't a null value, it's an empty string. To get a null, use None instead. This is described in the pandas.isnull() documentation that missing values are "NaN in numeric arrays, [or] None/NaN in object arrays".
import pandas as pd
a = ['america','britain','brazil',None,'china','jamaica']
a = pd.DataFrame(a)
a.isnull()
0
0 False
1 False
2 False
3 True
4 False
5 False
You can see the difference by printing the two dataframes. In the first case, the dataframe looks like:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3
4 china
5 jamaica
Notice that the value at index 3 is an empty string.
In the second case, you get:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3 None
4 china
5 jamaica
The other posts addressed that '' is not a null value and therefore isn't counted as such with the isnull method...
...However, '' does evaluate to False when interpreted as a bool.
a.astype(bool)
0
0 True
1 True
2 True
3 False
4 True
5 True
This might be useful if you have '' in your dataframe and want to process it this way.

pandas dataframe where clause with dot versus brackets column selection

I have a regular DataFrame with a string type (object) column. When I try to filter on the column using the equivalent of a WHERE clause, I get a KeyError when I use the dot notation. When in bracket notation, all is well.
I am referring to these instructions:
df[df.colA == 'blah']
df[df['colA'] == 'blah']
The first gives the equivalent of
KeyError: False
Not posting an example as I cannot reproduce the issue on a bespoke DataFrame built for the purpose of illustration: when I do, both notations yield the same result.
Asking then if there is a difference in the two and why.
The dot notation is just a convenient shortcut for accessing things vs. the standard brackets. Notably, they don't work when the column name is something like sum that is already a DataFrame method. My bet would be that the column name in your real example is running into that issue, and so it works fine with the bracket selection but is otherwise testing whether a method is equal to 'blah'.
Quick example below:
In [67]: df = pd.DataFrame(np.arange(10).reshape(5,2), columns=["number", "sum"])
In [68]: df
Out[68]:
number sum
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [69]: df.number == 0
Out[69]:
0 True
1 False
2 False
3 False
4 False
Name: number, dtype: bool
In [70]: df.sum == 0
Out[70]: False
In [71]: df['sum'] == 0
Out[71]:
0 False
1 False
2 False
3 False
4 False
Name: sum, dtype: bool

Categories