Replacing value based on conditional pandas - python

How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned
df = df.where(operator.and_(df > (-1 * .2), df < 0),0)
df = df.where(df > 0 , df * 1.2)
Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2

You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic:
df = df.where((df <= (-1 * .2)) | (df >= 0), 0)
df = df.where(df <= 0 , df * 1.2)

where allows you to have a one-line solution, which is great. I prefer to use a mask like so.
idx = (df < 0) & (df >= -0.2)
df[idx] = 0
I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well.
df[(df < 0) & (df >= -0.2)] = 0
Just another option.

Related

Create dynamic conditions list of ranges for numpy.select

I'm relatively new to Python and hoping someone can help point me in the right direction.
For context, I want create a new column in a Pandas dataframe that assigns a score of linear integer values to a new column based on values in an existing column being within certain ranges.
There is a lower and upper bound, say 0 and 0.75. Being below or above those respectively will yield the lowest / highest value.
Written manually with relatively few conditions it looks like this using np.select():
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions = [
(df['col1'] < 0),
(df['col1'] >= 0) & (df['col1'] <= .25),
(df['col1'] >= .25) & (df['col1'] <= .5),
(df['col1'] >= .5) & (df['col1'] <= .75),
(df['col1'] >= .75)
]
values = [0, 1, 2, 3, 4]
df['col3'] = np.select(conditions,values,default=None)
I would like to be able to dynamically divide the mid-range between bounds into many more conditions, which is easy enough using np.linspace.
Where I'm having trouble is in assigning the values. I have tried to do this using pd.cut and operating on a list to feed into np.select. This is the closest I have come with these:
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions_no = 9 # Choose number of conditions to divide the mid-range
choices = [n for n in range(1, conditions_no + 2)] # Assign values to apply starting from 1
mid_range = np.linspace(0,.75,conditions_no) # Divide mid-range by number of conditions
mid_range = np.append(mid_range[0],mid_range) # Repeat lower bound at start for < condition
cols = ['df["col1"]' for c in range(0, conditions + 1)] # Generate list of column references
conditions = list(zip(cols,mid_range)) # List with range as values, df as key
conditions = [f'{k} >= {v}' for k, v in conditions] # Combine column references and
conditions[0] = conditions[0].replace('>=','<') # Change first condition to less than lower bound
conditions = conditions[::-1] # Reverse values and assigned choices to check > highest value first
choices = choices[::-1]
Here the conditions are a list of strings rather than code:
['df["col1"] >= 0.75',
'df["col1"] >= 0.65625',
'df["col1"] >= 0.5625',
'df["col1"] >= 0.46875',
'df["col1"] >= 0.375',
'df["col1"] >= 0.28125',
'df["col1"] >= 0.1875',
'df["col1"] >= 0.09375',
'df["col1"] >= 0.0',
'df["col1"] < 0.0']
So they understandably throw an error:
df['col3'] = np.select(conditions, choices, default=None)
# TypeError: invalid entry 0 in condlist: should be boolean ndarray
I understand that eval() might be able to help here, but haven't been able to find a way to get that to run with np.select. I've also read that it's best to try and avoid using eval().
This is the effort so far using pd.cut:
conditions = 9
choices = [n for n in range(1, conditions + 2)]
mid_range = np.linspace(0,.75,conditions)
mid_range = np.append(-float("inf"),mid_range)
mid_range = np.append(mid_range,float("inf"))
df['col3'] = pd.cut(df['col1'], mid_range, labels=choices)
df['col4'] = pd.cut(df['col2'], mid_range, labels=choices)
This works, but assigns a categorical that I then can't operate on as needed:
df['col3'] + df['col4']
# TypeError: unsupported operand type(s) for +: 'Categorical' and 'Categorical'
After everything I've looked up, I keep coming back to np.select as likely being the best solution here. However, I can't figure out how to dynamically create the conditions - are any of these efforts along the right lines or is there a better approach I should look at?

How to use df.apply to switch between columns?

Consider the following code.
import pandas as pd
np.random.seed(0)
df_H = pd.DataFrame( {'L0': np.random.randn(100),
'OneAndZero': np.random.randn(100),
'OneAndTwo': np.random.randn(100),
'GTwo': np.random.randn(100),
'Decide': np.random.randn(100)})
I would like to create a new column named Result, which depends on the value of the column Decide. So if the value in Decide is less than 0, I would like Result to have the corresponding value of the row in L0. If the value on the row in Decide is between 1 and 0, it should grab the value in OneAndZero, between 1 and 2, it should grab OneAndTwo and if the value of decide is > 2, then it should grab GTwo.
How would one do this with df.apply since I have only seen examples with fixed values and not values from other columns?
Just because it is Good Friday, we can try the following. Else it is a commonly asked question.
c1=df_H['Decide'].le(0)
c2=df_H['Decide'].between(0,1)
c3=df_H['Decide'].between(1,2)
c4=df_H['Decide'].gt(2)
cond=[c1,c2,c3,c4]
choices=[df_H['L0'],df_H['OneAndZero'],df_H['OneAndTwo'],df_H['GTwo']]
df_H['Result']=np.select(cond,choices)
df_H
If you really want to use apply
def choose_res(x):
if x['Decide'] <= 0:
return x['L0']
if 0 < x['Decide'] <= 1:
return x['OneAndZero']
if 1 < x['Decide'] <= 2:
return x['OneAndTwo']
if x['Decide'] > 2:
return x['GTwo']
df_H['Result'] = df_H.apply(axis=1, func=choose_res, result_type='expand')
df.iloc
df_H.reset_index(drop=True, inplace=True)
for i in range(len(df_H)):
a = df_H['Decide'].iloc[i]
if 0 <= a <=1 :
b = df_H['OneAndZero'].iloc[i]
df_H.loc[i,'Result']= b
if 1.1 <= a <= 2:
b = df_H['OneAndTwo'].iloc[i]
df_H.loc[i,'Result']= b
maybe you can try this way.
df_apply
if you want to use apply..
create the function that have the condition, and the output,
then used this code:
df_H['Result'] = df_H.apply(your function name)

pandas calculate/show dataframe cumsum() only for positive values and other condition

How to make this cumsum() calculate and show values on new column rows only when df.col_2 == 'closed' and df.col_values > 0 :
df['new_col'] = df.groupby('col_1')['col_values'].cumsum()
Here is a solution (but there might be a more elegant one):
indexes = (df.col_2 == 'closed') & (df.col_values > 0)
df.loc[indexes, 'new_col'] = df.loc[indexes].groupby('col_1')['col_values'].cumsum()

Pandas Mask on multiple Conditions

In my dataframe I want to substitute every value below 1 and higher than 5 with nan.
This code works
persDf = persDf.mask(persDf < 1000)
and I get every value as an nan but this one does not:
persDf = persDf.mask((persDf < 1) and (persDf > 5))
and I have no idea why this is so. I have checked the man page and different solutions on apparentely similar problems but could not find a solution. Does anyone have have an idea that could help me on this?
Use the | operator, because a value cant be < 1 AND > 5:
persDf = persDf.mask((persDf < 1) | (persDf > 5))
Another method would be to use np.where and call that inside pd.DataFrame:
pd.DataFrame(data=np.where((df < 1) | (df > 5), np.NaN, df),
columns=df.columns)

UserWarning from Python pandas about reindexed Boolean Series key

When I use indexes to visit a pandas.DataFrame, it gave a userswarnings and they didn't interfere its output. I expect to know how did this userwarning happen and what should I do to avoid these userwarning? Thanks for everyone's attention.
df = pandas.DataFrame([[k, ass4Dict[k], k[:2], k[-2:]] for k in ass4Dict])
df.columns = ['string', 'count', 'lstr', 'rstr']
df = df[df['count'] >= 10]
**df = df[df['lstr'].map(lambda x:x in gram2Dict)][df['rstr'].map(lambda x:x in gram2Dict)]**
df['lstrCNT'] = df['lstr'].map(lambda x: float(gram2Dict[x]))
df['rstrCNT'] = df['rstr'].map(lambda x: float(gram2Dict[x]))
df['conPow'] = df['lstrCNT'] * df['rstrCNT']
df['lstrPow'] = df['count'] / df['lstrCNT']
df['rstrPow'] = df['count'] / df['rstrCNT']
df['aux4Ratio'] = df['count'] / df['conPow']
df['aux4Log'] = df['aux4Ratio'].map(lambda x: -log(x))
**df = df[df['aux4Log'] < 11][df['lstrPow'] >= 0.5][df['rstrPow'] >= 0.5]**
....
沈钦言 359
纪小蕊 158
顾持钧 949
林晋修 642
4
0.256721019745 1.22976207733
ch_extract.py:153: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[df['lstr'].map(lambda x:x in gram2Dict)][df['rstr'].map(lambda x:x in gram2Dict)]
ch_extract.py:161: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df = df[df['aux4Log'] < 11][df['lstrPow'] >= 0.5][df['rstrPow'] >= 0.5]
If we just take the last line and recreate it as the following example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,50,size=(50, 4)), columns=list('ABCD'))
df[df.A < 11][df.B >= 25][df.C >= 25]
The final line is stringing together a series of slices. After the first slice, each subsequent slice needs to be reindexed as the resulting data frame does not contain all the items in the previous slice.
In this case, the proper form would be to combine your boolean slice into one expression:
df[(df.A < 11) & (df.B >= 25) & (df.C >= 25)]
Some other cases that might cause this warning would be as follows:
df[df.sort_values(['A'], ascending=[False]).duplicated(subset='B', keep='first')]
In which case, use the loc command:
df.loc[df.sort_values(['A'], ascending=[False]).duplicated(subset='B', keep='first')]

Categories