Replace string 'Null' with pandas NaN - Vectorised - python

The last question that asked this was in 2016 and used the astropy package.
Replacing masked values (--) with a Null or None value using fiil_value from ma numpy in Python
I was wondering if since then there is a faster, vectorised way, than using applymap:
df.applymap(lambda x: np.nan if x=='NULL' else x)

Use replace or mask, which by default change value to NaNs by condition:
df = df.replace('NULL', np.nan)
For compare mixed data use values or cast to string:
df = df.mask(df.values == 'NULL')
df = df.mask(df.astype(str) == 'NULL')

You can do it in-place:
df[df.astype(str)=='NULL'] = np.nan
Example:
>>> df
a b
0 10 NULL
1 NULL 20
>>> df[df=='NULL'] = np.nan
>>> df
a b
0 10 NaN
1 NaN 20

Related

How do I replace a string-value in a specific column using method chaining?

I have a pandas data frame, where some string values are "NA". I want to replace these values in a specific column (i.e. the 'strCol' in the example below) using method chaining.
How do I do this? (I googled quite a bit without success even though this should be easy?! ...)
Here is a minimal example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4],
'B':['val1','val2','NA','val3']})
df = (
df
.rename(columns={'A':'intCol', 'B':'strCol'}) # method chain example operation 1
.astype({'intCol':float}) # method chain example operation 2
# .where(df['strCol']=='NA', pd.NA) # how to replace the sting 'NA' here? this does not work ...
)
df
You can try replace instead of where:
df.replace({'strCol':{'NA':pd.NA}})
Use lambda in where clause to evaluate the chained dataframe:
df = (df.rename(columns={'A':'intCol', 'B':'strCol'})
.astype({'intCol':float})
.where(lambda x: x['strCol']=='NA', pd.NA))
Output:
>>> df
intCol strCol
0 NaN <NA>
1 NaN <NA>
2 3.0 NA
3 NaN <NA>
Many methods like where, mask, groupby, apply can take a callable or a function so you can pass a lambda function.
pandas.DataFrame.where does
Replace values where the condition is False.
So you need condition to not hold where you want to make replacement, simple example
import pandas as pd
df = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9]})
df2 = df.where(df.x%2==0,-1)
print(df2)
gives output
x
0 -1
1 2
2 -1
3 4
4 -1
5 6
6 -1
7 8
8 -1
Observe that odd values were replaced by -1s, whilst condition does hold for even values.

Create a numerical column out of a url, with 1 for url present and 0 for all NaNs

I'm trying to create a column that would identify whether a url is present or not from an existing column called "links". I'd like all NaN values to become zeros and any urls to be denoted as 1, in the new column. I tried the following but was unable to get the correct values.
def url(x):
if x == 'NaN':
return 0
else:
return 1
df['url1'] = df['links'].apply(url)
df.head()
You can use pd.isnull(x) instead of the x == 'NaN' comparison
import pandas as pd
df['url1'] = df['links'].apply(lambda x: 0 if pd.isnull(x) else 1)
See my comment, but the simplest and most performant thing you can to do to get your desired output is to use a pandas method:
input:
import numpy as np
import pandas as pd
df = pd.DataFrame({'links' : [np.nan, 'a', 'b', np.nan]})
In[1]:
links
0 NaN
1 a
2 b
3 NaN
output:
df['url1'] = df['links'].notnull().astype(int)
df
Out[801]:
links url1
0 NaN 0
1 a 1
2 b 1
3 NaN 0
notnull() returns True or False and .astype(int) change True to 1 and False to 0, because True and False are boolean values with an underlying value of 1 and 0, respectively, even though they say True and False. So, when you change the data type to int, it will show its integer underlying value of 1 or 0.
Related to my comment 'True' would also not equal to True and 'False' not equal to False , just like 'NaN' does not equal NaN (notice apostrophes versus no apostrophes).

Replace numeric values in a pandas dataframe

Problem: Polluted Dataframe.
Details: Frame consists of NaNs string values which i know the meaning of and numeric values.
Task: Replaceing the numeric values with NaNs
Example
import numpy as np
import pandas as pd
df = pd.DataFrame([['abc', 'cdf', 1], ['k', 'sum', 'some'], [1000, np.nan, 'nothing']])
out:
0 1 2
0 abc cdf 1
1 k sum some
2 1000 NaN nothing
Attempt 1 (Does not work, because regex only looks at string cells)
df.replace({'\d+': np.nan}, regex=True)
out:
0 1 2
0 abc cdf 1
1 k sum some
2 1000 NaN nothing
Preliminary Solution
val_set = set()
[val_set.update(i) for i in df.values]
def dis_nums(myset):
str_s = set()
num_replace_dict = {}
for i in range(len(myset)):
val = myset.pop()
if type(val) == str:
str_s.update([val])
else:
num_replace_dict.update({val:np.nan})
return str_s, num_replace_dict
strs, rpl_dict = dis_nums(val_set)
df.replace(rpl_dict, inplace=True)
out:
0 1 2
0 abc cdf NaN
1 k sum some
2 NaN NaN nothing
Question
Is there any easier/ more pleasant solution?
You can do a round-conversion to str to replace the values and back.
df.astype('str').replace({'\d+': np.nan, 'nan': np.nan}, regex=True).astype('object')
#This makes sure already existing np.nan are not lost
Output
0 1 2
0 abc cdf NaN
1 k sum some
2 NaN NaN nothing
You can use a loop to go through each columns, and check each item. If it is an integer or float then replace it with np.nan. It can be done easily with map function applied on the column.
you can change the condition of the if to incorporate any data type u want.
for x in df.columns:
df[x] = df[x].map(lambda item : np.nan if type(item) == int or type(item) == float else item)
This is a naive approach and there have to be better solutions than this.!!

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Querying for NaN and other names in Pandas

Say I have a dataframe df with a column value holding some float values and some NaN. How can I get the part of the dataframe where we have NaN using the query syntax?
The following, for example, does not work:
df.query( '(value < 10) or (value == NaN)' )
I get name NaN is not defined (same for df.query('value ==NaN'))
Generally speaking, is there any way to use numpy names in query, such as inf, nan, pi, e, etc.?
According to this answer you can use:
df.query('value < 10 | value.isnull()', engine='python')
I verified that it works.
In general, you could use #local_variable_name, so something like
>>> pi = np.pi; nan = np.nan
>>> df = pd.DataFrame({"value": [3,4,9,10,11,np.nan,12]})
>>> df.query("(value < 10) and (value > #pi)")
value
1 4
2 9
would work, but nan isn't equal to itself, so value == NaN will always be false. One way to hack around this is to use that fact, and use value != value as an isnan check. We have
>>> df.query("(value < 10) or (value == #nan)")
value
0 3
1 4
2 9
but
>>> df.query("(value < 10) or (value != value)")
value
0 3
1 4
2 9
5 NaN
You can use the isna and notna Series methods, which is concise and readable.
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
available = df.query("value.notna()")
print(available)
# value
# 0 3.0
# 1 4.0
# 2 9.0
# 3 10.0
# 4 11.0
# 6 12.0
not_available = df.query("value.isna()")
print(not_available)
# value
# 5 NaN
In case you have numexpr installed, you need to pass engine="python" to make it work with .query.
numexpr is recommended by pandas to speed up the performance of .query on larger datasets.
available = df.query("value.notna()", engine="python")
print(available)
Alternatively, you can use the toplevel pd.isna function, by referencing it as a local variable. Again, passing engine="python" is required when numexpr is present.
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
df.query("#pd.isna(value)")
# value
# 5 NaN
For rows where value is not null
df.query("value == value")
For rows where value is null
df.query("value != value")
Pandas fills empty cells in a DataFrame with NumPy's nan values. As it turns out, this has some funny properties. For one, nothing is equal to this kind of null, even itself. As a result, you can't search for it by checking for any particular equality.
In : 'nan' == np.nan
Out: False
In : None == np.nan
Out: False
In : np.nan == np.nan
Out: False
However, because a cell containing a np.nan value will not be equal to anything, including another np.nan value, we can check to see if it is unequal to itself.
In : np.nan != np.nan
Out: True
You can take advantage of this using Pandas query method by simply searching for cells where the value in a particular column is unequal to itself.
df.query('a != a')
or
df[df['a'] != df['a']]
This should also work: df.query("value == 'NaN'")
I think other answers will normally be better. In one case, my query had to go through eval (use eval very carefully) and the syntax below was useful. Requiring a number to be both less than and greater than or equal to excludes all numbers, leaving only null-like values.
df = pd.DataFrame({'value':[3,4,9,10,11,np.nan, 12]})
df.query("value < 10 or (~(value < 10) and ~(value >= 10))")

Categories