Python avoid dividing by zero in pandas dataframe - python

Apologies that this has been asked before, but I cannot get those solutions to work for me (am native MATLAB user coming to Python).
I have a dataframe where I am taking the row-wise mean of the first 7 columns of one df and dividing it by another. However, there are many zeros in this dataset and I want to replace the zero divion errors with zeros (as that's meaningful to me) instead of the naturally returned nan (as I'm implementing it).
My code so far:
col_ind = list(range(0,7))
df.iloc[:,col_ind].mean(axis=1)/other.iloc[:,col_ind].mean(axis=1)
Here, if other = 0, it returns nan, but if df = 0 it returns 0. I have tried a lot of proposed solutions but none seem to register. For instance:
def foo(x,y):
try:
return x/y
except ZeroDivisionError:
return 0
foo(df.iloc[:,col_ind].mean(axis1),other.iloc[:,col_ind].mean(axis=1))
However this returns the same values without using the defined foo. I'm suspecting this is because I am operating on series rather than single values, but I'm not sure nor how to fix it. There are also actual nans in these dataframes as well. Any help appreciated.

you can use np.where to conditionally do this as a vectorised calc.
import numpy as np
df = pd.DataFrame(data=np.concatenate([np.random.randint(1,10, (10,7)), np.random.randint(0,3,(10,1))], axis=1),
columns=[f"col_{i}" for i in range(7)]+["div"])
np.where(df["div"].gt(0), (df.loc[:,[c for c in df.columns if "col" in c]].mean(axis=1) / df["div"]), 0)

It's not clear which version you're using and I don't know if the behavior is version-dependent, but in Python 3.8.5 / Pandas 1.2.4, a 0 / 0 in a dataframe/series will evaluate to NaN, while a non-zero / 0 will evaluate to inf. Neither will raise an error, so a try/except wouldn't have anything to catch.
>>> import pandas as pd
>>> import numpy as np
>>> x = pd.DataFrame({'a': [0, 1, 2], 'b': [0, 0, 2]})
>>> x
a b
0 0 0
1 1 0
2 2 2
>>> x.a / x.b
0 NaN
1 inf
2 1.0
dtype: float64
You can replace NaN values in a pandas DataFrame or Series with the fillna() method, and you can replace inf using a standard replace():
>>> (x.a / x.b).replace(np.inf, np.nan)
0 NaN
1 NaN
2 1.0
dtype: float64
>>> (x.a / x.b).replace(np.inf, np.nan).fillna(0)
0 0.0
1 0.0
2 1.0
dtype: float64
(Note: A negative value divided by zero will evaluate to -inf, which would need to be replaced separately.)

You could replace nan after the calculation using df.fillna(0)

Related

Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7

Using isin with NaN in dataframe

Let's say I have the following dataframe:
t2 t5
0 NaN 2.0
1 2.0 NaN
2 3.0 1.0
Now I want to check if elements in t2 is in t5, ignoring NaN.
Therefore, I run the following code:
df['t2'].isin(df['t5'])
Which gives:
0 True
1 True
2 False
However, since NaN!=NaN, I expected
0 False
1 True
2 False
How do I get what I expected? And why does this behave this way?
This isn't so much a bug as it is an inconsistency of behavior between similar libraries. Your columns have a dtype of float64, and both Pandas and Numpy have their own ideas of whether or not nan is comparable to nan[1]. You can see this behavior with unique
>>> np.unique([np.nan, np.nan])
array([nan, nan])
>>> pd.unique([np.nan, np.nan])
array([nan])
So clearly, pandas detects some sort of similarity with nan, which is the behavior you are seeing with isin.
Now for large Series, you won't see this behavior[2]. I think I read somewhere that the cutoff is around 10e6, but don't take my word for it.
u = pd.Series(np.full(100000000, np.nan, dtype=np.float64))
>>> u.isin(u).any()
False
[1] For large Series (> 10e6), pandas uses numpy's definition of nan
[2] As #root points out, this is dtype dependent.
It is because np.nan is indeed in [np.nan]. That is to say in is equivalent to say np.any([a is b for b in lst]). To get what you want, you can filter out the NaNin df['t2'] first:
df['t2'].notna() & df['t2'].isin(df['t5'])
gives:
0 False
1 True
2 False
Name: t2, dtype: bool

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Transforming outliers in Pandas DataFrame using .apply, .applymap, .groupby

I'm attempting to transform a pandas DataFrame object into a new object that contains a classification of the points based upon some simple thresholds:
Value transformed to 0 if the point is NaN
Value transformed to 1 if the point is negative or 0
Value transformed to 2 if it falls outside certain criteria based on the entire column
Value is 3 otherwise
Here is a very simple self-contained example:
import pandas as pd
import numpy as np
df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]})
print(df)
The transformation process created so far:
#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers'
outliers = pd.DataFrame()
for datapoint in df.columns:
tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())])
outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer')
outliers[outliers.isnull() == False] = 2
#Classify everything else as "3"
df[df > 0] = 3
#Classify negative and zero points as a "1"
df[df <= 0] = 1
#Update with the outliers
df.update(outliers)
#Everything else is a "0"
df.fillna(value=0, inplace=True)
Resulting in:
I have tried to use .applymap() and/or .groupby() in order to speed up the process with no luck. I found some guidance in this answer however, I'm still unsure how .groupby() is useful when you're not grouping within a pandas column.
Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer.
>>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns )
a b
0 NaN 2
1 2 3
2 3 -4
3 4 5
4 5 6
5 0 2
6 -7 7
7 9 9
8 10 NaN
You could also do it with apply, but it will be slower than the np.where approach (but approximately the same speed as what you are currently doing), though much simpler. That's probably a good example of why you should always avoid apply if possible, when you care about speed.
>>> df[ df.apply( lambda x: abs(x) > x.mean() ) ] = 2
You could also do this, which is faster than apply but slower than np.where:
>>> mask = np.abs(df) > df.mean()
>>> df[mask] = 2
Of course, these things don't always scale linearly, so test them on your real data and see how that compares.

Querying for NaN and other names in Pandas

Say I have a dataframe df with a column value holding some float values and some NaN. How can I get the part of the dataframe where we have NaN using the query syntax?
The following, for example, does not work:
df.query( '(value < 10) or (value == NaN)' )
I get name NaN is not defined (same for df.query('value ==NaN'))
Generally speaking, is there any way to use numpy names in query, such as inf, nan, pi, e, etc.?
According to this answer you can use:
df.query('value < 10 | value.isnull()', engine='python')
I verified that it works.
In general, you could use #local_variable_name, so something like
>>> pi = np.pi; nan = np.nan
>>> df = pd.DataFrame({"value": [3,4,9,10,11,np.nan,12]})
>>> df.query("(value < 10) and (value > #pi)")
value
1 4
2 9
would work, but nan isn't equal to itself, so value == NaN will always be false. One way to hack around this is to use that fact, and use value != value as an isnan check. We have
>>> df.query("(value < 10) or (value == #nan)")
value
0 3
1 4
2 9
but
>>> df.query("(value < 10) or (value != value)")
value
0 3
1 4
2 9
5 NaN
You can use the isna and notna Series methods, which is concise and readable.
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
available = df.query("value.notna()")
print(available)
# value
# 0 3.0
# 1 4.0
# 2 9.0
# 3 10.0
# 4 11.0
# 6 12.0
not_available = df.query("value.isna()")
print(not_available)
# value
# 5 NaN
In case you have numexpr installed, you need to pass engine="python" to make it work with .query.
numexpr is recommended by pandas to speed up the performance of .query on larger datasets.
available = df.query("value.notna()", engine="python")
print(available)
Alternatively, you can use the toplevel pd.isna function, by referencing it as a local variable. Again, passing engine="python" is required when numexpr is present.
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': [3, 4, 9, 10, 11, np.nan, 12]})
df.query("#pd.isna(value)")
# value
# 5 NaN
For rows where value is not null
df.query("value == value")
For rows where value is null
df.query("value != value")
Pandas fills empty cells in a DataFrame with NumPy's nan values. As it turns out, this has some funny properties. For one, nothing is equal to this kind of null, even itself. As a result, you can't search for it by checking for any particular equality.
In : 'nan' == np.nan
Out: False
In : None == np.nan
Out: False
In : np.nan == np.nan
Out: False
However, because a cell containing a np.nan value will not be equal to anything, including another np.nan value, we can check to see if it is unequal to itself.
In : np.nan != np.nan
Out: True
You can take advantage of this using Pandas query method by simply searching for cells where the value in a particular column is unequal to itself.
df.query('a != a')
or
df[df['a'] != df['a']]
This should also work: df.query("value == 'NaN'")
I think other answers will normally be better. In one case, my query had to go through eval (use eval very carefully) and the syntax below was useful. Requiring a number to be both less than and greater than or equal to excludes all numbers, leaving only null-like values.
df = pd.DataFrame({'value':[3,4,9,10,11,np.nan, 12]})
df.query("value < 10 or (~(value < 10) and ~(value >= 10))")

Categories