Replace numbers by `nan` in pandas data frame - python

I have an issue with a column on a pandas data frame. Due to data input errors I have a column with true and false, but it also contains around 71 decimals.
I am trying to get rid of the decimals and turn them into nan so I can ignore those rows for further analysis.
When I try:
datafinal['any_misread'] = datafinal['any_misread'].where(datafinal['any_misread'] < 1, np.nan)
I get the error:
TypeError: unorderable types: str() < int()
I have also tried logics with .replace and with no success.
What am I missing here?

Let's try using where and astype:
df = pd.DataFrame({'col1':[True, False, 0.12, True, False, .3]})
df.where((df.col1.astype(str) == 'True') | (df.col1.astype(str) == 'False'))
Output:
col1
0 True
1 False
2 NaN
3 True
4 False
5 NaN

You can check if the type of each item in the column is not a bool and change the value.
df = pd.DataFrame([[True],[True],[False],[10.2],[1.0],[False],[0]], columns=['misread'])
df.misread[df.misread.apply(lambda x: not isinstance(x, bool))] = pd.np.nan
df
# returns
misread
0 True
1 True
2 False
3 NaN
4 NaN
5 False
6 NaN

Related

Pandas .str.isnumeric() for floats?

I'm trying to filter a column in pandas to keep floats and NaNs. I've found that .str.isnumeric() doesn't consider numbers with non-ints as numeric. This surprised me since pd.to_numeric() does what I'm looking for (keeps floats and NaNs). So in this example:
test_num = pd.DataFrame(data={'col1': ['6', '5.5', '4E-05', np.nan, 'bear']})
test_num['col1'].str.isnumeric()
I'd expect the output to be
0 True
1 True
2 True
3 NaN
4 False
But instead it's
0 True
1 False
2 False
3 NaN
4 False
Has anyone else created this kind of basic numeric Series filter?
This is the expected behaviour as str.isnumeric() checks "whether all characters in each string are numeric" (docs).
But the question is interesting, because np.nan is a float, so if you wish to remove bear but keep the nan, pd.to_numeric(test_num['col1'], errors='coerce') will not work, as it converts bear to a float (nan).
So you could first filter out all values that cannot be converted to float prior to running pd.to_numeric on the Series:
def check_numeric(x):
try:
float(x)
return True
except:
return False
test_num = test_num[test_num['col1'].apply(check_numeric)]
test_num['col1'] = pd.to_numeric(test_num['col1'])
This should work for you.
pd.Series(np.where(test_num['col1'].isna(), np.nan,
pd.to_numeric(test_num['col1'], errors='coerce').notnull()
)).replace({1:True, 0:False})
0 True
1 True
2 True
3 NaN
4 False

Create a numerical column out of a url, with 1 for url present and 0 for all NaNs

I'm trying to create a column that would identify whether a url is present or not from an existing column called "links". I'd like all NaN values to become zeros and any urls to be denoted as 1, in the new column. I tried the following but was unable to get the correct values.
def url(x):
if x == 'NaN':
return 0
else:
return 1
df['url1'] = df['links'].apply(url)
df.head()
You can use pd.isnull(x) instead of the x == 'NaN' comparison
import pandas as pd
df['url1'] = df['links'].apply(lambda x: 0 if pd.isnull(x) else 1)
See my comment, but the simplest and most performant thing you can to do to get your desired output is to use a pandas method:
input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'links' : [np.nan, 'a', 'b', np.nan]})
In[1]:
links
0 NaN
1 a
2 b
3 NaN
output:
df['url1'] = df['links'].notnull().astype(int)
df
Out[801]:
links url1
0 NaN 0
1 a 1
2 b 1
3 NaN 0
notnull() returns True or False and .astype(int) change True to 1 and False to 0, because True and False are boolean values with an underlying value of 1 and 0, respectively, even though they say True and False. So, when you change the data type to int, it will show its integer underlying value of 1 or 0.
Related to my comment 'True' would also not equal to True and 'False' not equal to False , just like 'NaN' does not equal NaN (notice apostrophes versus no apostrophes).

Filter a data frame containing NaN values, results in empty data frame as result [duplicate]

How can I check if a given value is NaN?
e.g. if (a == np.NaN) (doesn't work)
Please note that:
Numpy's isnan method throws errors with data types like string
Pandas docs only provide methods to drop rows containing NaNs, or ways to check if/when DataFrame contains NaNs. I'm asking about checking if a specific value is NaN.
Relevant Stackoverflow questions and Google search results seem to be about checking "if any value is NaN" or "which values in a DataFrame"
There must be a clean way to check if a given value is NaN?
You can use the inate property that NaN != NaN
so a == a will return False if a is NaN
This will work even for strings
Example:
In[52]:
s = pd.Series([1, np.NaN, '', 1.0])
s
Out[52]:
0 1
1 NaN
2
3 1
dtype: object
for val in s:
print(val==val)
True
False
True
True
This can be done in a vectorised manner:
In[54]:
s==s
Out[54]:
0 True
1 False
2 True
3 True
dtype: bool
but you can still use the method isnull on the whole series:
In[55]:
s.isnull()
Out[55]:
0 False
1 True
2 False
3 False
dtype: bool
UPDATE
As noted by #piRSquared if you compare None==None this will return True but pd.isnull will return True so depending on whether you want to treat None as NaN you can still use == for comparison or pd.isnull if you want to treat None as NaN
Pandas has isnull, notnull, isna, and notna
These functions work for arrays or scalars.
Setup
a = np.array([[1, np.nan],
[None, '2']])
Pandas functions
pd.isna(a)
# same as
# pd.isnull(a)
array([[False, True],
[ True, False]])
pd.notnull(a)
# same as
# pd.notna(a)
array([[ True, False],
[False, True]])
DataFrame (or Series) methods
b = pd.DataFrame(a)
b.isnull()
# same as
# b.isna()
0 1
0 False True
1 True False
b.notna()
# same as
# b.notnull()
0 1
0 True False
1 False True

How to avoid convertion of data type when substituting certain values to NaN within a pandas DataFrame?

I have a pandas DataFrame and I need to replace certain values to NaN based on a filter. I'm facing a change in data type when doing so. How can I avoid this data type conversion?
Toy example code
import pandas as pd
import numpy as np
df = pd.Series([False, True, False, True])
filter = pd.Series([True, True, False, False])
df[filter] = np.nan
I would expect df to have True and False values as well, appart from NaN. However True values were converted to 1 and False values were converted to 0 as seen in below output.
>>> df
0 NaN
1 NaN
2 0.0
3 1.0
dtype: float64
Partial solution
Only partial solution I can think of now is as follows:
df[df==1] = True
df[df==0] = False
print df
Which results in:
>>> df
0 NaN
1 NaN
2 False
3 True
dtype: object
Question
I know that if I check if a value is 1 and I compare to True it resolves to True, and same happens between 0 and False. However I would like to avoid my values True and False to be changed to 0 and 1 respectively when I convert any value to NaN. Is this possible so that I donĀ“t need to use the partial solution I stated?
Change to object before filter
df = pd.Series([False, True, False, True])
filter = pd.Series([True, True, False, False])
df=df.astype('object')
df[filter] = np.nan
df
Out[623]:
0 NaN
1 NaN
2 False
3 True
dtype: object
More info
df.apply(type)
Out[625]:
0 <class 'float'>
1 <class 'float'>
2 <class 'bool'>
3 <class 'bool'>
dtype: object

Python Dataframe get null value counts

I am trying to find the null values in a DataFrame. Though I reviewed the following post from Stackoverflow that describes the process to determine the null values, I am having a hard time to do the same for my dataset.
How to count the Nan values in the column in Panda Data frame
Working code:
import pandas as pd
a = ['america','britain','brazil','','china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.isnull()
#Output:
False
1 False
2 False
3 False
4 False
5 False
a.isnull().sum()
#Output
#0 0
#dtype: int64
What am I doing wrong?
If you want '', None and NaN to all count as null, you can use the applymap method on each value in the dataframe coerced to a boolean and then use .sum subsequently:
import pandas as pd
import numpy as np
a = ['america','britain','brazil',None,'', np.nan, 'china','jamaica'] #I deliberately introduce a NULL value
a = pd.DataFrame(a)
a.applymap(lambda x: not x or pd.isnull(x)).sum()
# 0 3
# dtype: int64
I hope this helps.
The '' in your list isn't a null value, it's an empty string. To get a null, use None instead. This is described in the pandas.isnull() documentation that missing values are "NaN in numeric arrays, [or] None/NaN in object arrays".
import pandas as pd
a = ['america','britain','brazil',None,'china','jamaica']
a = pd.DataFrame(a)
a.isnull()
0
0 False
1 False
2 False
3 True
4 False
5 False
You can see the difference by printing the two dataframes. In the first case, the dataframe looks like:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3
4 china
5 jamaica
Notice that the value at index 3 is an empty string.
In the second case, you get:
pd.DataFrame(['america','britain','brazil',None,'china','jamaica'])
0
0 america
1 britain
2 brazil
3 None
4 china
5 jamaica
The other posts addressed that '' is not a null value and therefore isn't counted as such with the isnull method...
...However, '' does evaluate to False when interpreted as a bool.
a.astype(bool)
0
0 True
1 True
2 True
3 False
4 True
5 True
This might be useful if you have '' in your dataframe and want to process it this way.

Categories