Python Pandas: condition to apply null - python

Hi I would like to transform my numeric variable that If it exceeds 1,000 value then it should be null or NA. Otherwise still use the value. Below is my code.
df['PREMIUM'] = pd.to_numeric( df["PREMIUM"])
df['PREMIUM_V2'] = np.where(df['PREMIUM']>1000,np.NaN,df['PREMIUM'])
I tried this but it makes my PREMIUM_V2 not a numeric value. It became just an object.

Use mask:
df = pd.DataFrame({'PREMIUM': [0,1,100,10000]})
df['PREMIUM2'] = df['PREMIUM'].mask(df['PREMIUM'].gt(1000))
output:
PREMIUM PREMIUM2
0 0 0.0
1 1 1.0
2 100 100.0
3 10000 NaN

I cant understand your question if you want to change the value in the column
df['PREMIUM'] to NaN if the value greater than 1000 :
df['PREMIUM'] = pd.to_numeric( df["PREMIUM"])
df['PREMIUM'] = np.where(df['PREMIUM']>1000,df['PREMIUM'],np.NaN)
if you want to create a different column in the dataframe and keep the less than 1000 value as it is and change the value grater than 1000 as Nan you can use :
df['PREMIUM'] = pd.to_numeric( df["PREMIUM"])
df['PREMIUM_V2'] = np.where(df['PREMIUM']>1000,df['PREMIUM'],np.NaN)
note : numpy.where(condition, [dataframe], value)

Related

Python calculate increment rows till a condition

How to obtain the below result.
Sample Data with Output
Time To default is the column which is to be calculated. We need to get the increment number as Time to default and this increment should be till Default column is 1.
Above shown is for a sample account and the same has to be applied for multiple account id.
Use:
m = df['Default'].eq(1).groupby(df['acct_id']).transform(lambda x: x.shift(fill_value=False).cummax())
df.loc[~m, 'new'] = df[~m].groupby('acct_id').cumcount()
print (df)
acct_id Default new
0 xxx123 0 0.0
1 xxx123 0 1.0
2 xxx123 1 2.0
3 xxx123 1 NaN

loop over all rows in a dataframe and generate new column based on comparing other columns

I have the following dataframe:
ID
name
value
mean
std
upper
lower
894.68
154.00
2.33
203.16
189.18
1045.28
196.17
4.50
204.00
186
For each row, I'm trying to create a new column by comparing the value with upper and lower as follow:
df['new_col'] = df[df['mean'].notnull()].apply(lambda x: False if x['value']>x['upper'] or x['value']<x['lower'] else True)
It gives me an error which is not ver clear to me: KeyError: 'value'. I guess it can't find x['value'], right? How do I fix it?
Try putting your condition directly that will give you the boolean Series directly:
df['new_col'] = (~((df['value']>df['upper']) | (df['value']<df['lower'])) & df['mean'].notnull())
OR
via apply() but it will be slow because it is loop under the hood so pass axis=1:
df['new_col'] = df[df['mean'].notnull()].apply(lambda x: False if x['value']>x['upper'] or x['value']<x['lower'] else True,axis=1)
output of df:
ID name value mean std upper lower new_col
0 NaN NaN 894.68 154.00 2.33 203.16 189.18 False
1 NaN NaN 1045.28 196.17 4.50 204.00 186.00 False

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

How to get columns index which meet some condition in pandas?

I have the following:
x = pd.DataFrame({'a':[1,5,5], 'b':[7,0,7]})
And for every row, i want to get the index of the first column that met the condition that it's value is greater than some value, let's say greater than
4.
In this example, the answer is 1, (correspond to the index of the value 7 in the first row) and 0 (correspond to the index of the value 5 in the second row), and 1(correspond to the index of the value 5 in the third row).
Which means the answer is [1,0,0].
I tried it with apply method:
def get_values_from_row(row, th=0.9):
"""Get a list of column names that meet some condition that their values are larger than a threshold.
Args:
row(pd.DataFrame): a row.
th(float): the threshold.
Returns:
string. contains the columns that it's value met the condition.
"""
return row[row > th].index.tolist()[0]
It works, but i have a large data set, and it's quite slow.
What is a better alternative.
I think you need first_valid_index with get_loc:
print (x[x > 4])
a b
0 NaN 7.0
1 5.0 NaN
2 7.0 5.0
print (x[x > 4].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1))
0 1
1 0
2 0
dtype: int64

Pandas. Selection by label. One-row output

I'm trying to select every entry in a pandas DataFrame D, correspoding to some certain userid, filling missing etime values with zeros as follows:
user_entries = D.loc[userid]
user_entries.index = user_entries.etime
user_entries = user_entries.reindex(range(distinct_time_entries_num))
user_entries = user_entries.fillna(0)
The problem is, for some ids, there exists exactly one entry, and thus .loc() method is returning a Series object with an unexpected index:
(Pdb) user_entries.index = user_entries.etime
*** TypeError: Index(...) must be called with a collection of some kind, 388 was passed
(Pdb) user_entries
etime 388
requested 1
rejected 0
Name: 351, dtype: int64
(Pdb) user_entries.index
Index([u'etime', u'requested', u'rejected'], dtype='object')
which is painful to handle. I'd seiously prefer a DataFrame object with one row. Is there any way around it? Thanks.
UPD: A have to apologize for unintengible formulation, this is my first post here. I'll try again.
So the deal is: there is a dataframe, indexed by userid. Every userid can possibly have up to some number N corresponding dataframe rows (columns are: 'etime','requested','rejected') for which 'etime' is basically the key. For some 'userid', there exist all of the N corresponding entries, but for the most of them, there are missing entries for some 'etime'.
My intensions are: for every 'userid' construct an explicit DataFrame object, containing all N entries indexed by 'etime', filled with zeros for the missing entries. That's why I'm changing index to 'etime' and then reindexing selected row subset with the full 'etime' range.
The problem is: for some 'userid' there is exactly one corresponding 'etime', for which.loc() subsetting returns not a dataframe with one row indexed by 'userid' but a series object indexed by the array:
Index([u'etime', u'requested', u'rejected'], dtype='object')
And that's why changing index fails. Checking dimensions and index every time I select some dataframe subset looks pretty ugly. What else can I do about it?
UPD2: here is the script demonstrating the case
full_etime_range = range(10)
df = DataFrame(index=[0,0,1],
columns=['etime','requested'],
data=[[0,1],[1,1],[1,1]])
for i in df.index:
tmp = df.loc[i]
tmp.index = tmp['etime']
tmp = tmp.reindex(full_etime_range,fill_value = 0)
print tmp
So, starting with df being your dataframe, we can do the following safely:
In[215]: df.set_index([df.index, 'etime'], inplace=True)
In[216]: df
Out[216]:
requested
etime
0 0 1
1 1
1 1 1
DF = pd.DataFrame(index=full_etime_range, columns=[])
df0 = DF.copy()
In[225]: df0.join(df.loc[0])
Out[225]:
requested
0 1
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
In[230]: df1 = DF.copy()
In[231]: df1.join(df.loc[1])
Out[231]:
requested
0 NaN
1 1
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
which is technically what you want. But behold, we can do this nicer:
listOfDf = [DF.copy().join(df.loc[i]) for i in df.index.get_level_values(1).unique()]
I wanted to do it even one level nicer, but the following did not work - maybe someone can chip in why.
df.groupby(level=0).apply(lambda x: DF.copy().join(x))
Are you just trying to fill nas? Why are you reindexing the dataframe?
Just
user_entries = D.loc[userid]
user_entries.fillna(0)
Should do the trick. But if you are willing to fillna just for the etime field, what you should do is:
user_entries = D.loc[userid]
temp = user_entries["etime"].fillna(0)
user_extries["etime"] = temp
Hope it helps. If not, clarify what you're trying to achieve

Categories