pandas change values on multiple column based on condition

pandas change values on multiple column based on condition - python

I have a data frame like
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1513.040037 1186.770072 514.840753 86.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
I want to select rows based on certain value ranges in x and y and find the values in all four x,y,w,h and perform addition or subtraction on those values and replace them with the calculated value in that row.
I am doing something like
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']] = df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29
and getting error:
"Must have equal len keys and value when setting with an ndarray"
when I tried this
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), 'x'] = df['x'] - 20
it works but I want to perform operation on all four columns in one go and update the values.
My desired answer is it should select row 5 and my answer should be like
x y w h
5 1493.040037 1021.770072 540.840753 57.909160
Any help will be much appreciated.

Let us fix your code
m = (df['x'] >= 1000) & (df['x'] < 1800) \
& (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] += [-20, -165, 26, -29]
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160 *** updated
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253

With your approach , you can use pd.concat on the R.H.S
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']]=pd.concat((df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29),axis=1)
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253

You have to assign with an array of the same shape. Easiest way is to use the original df:
m = (df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] = df.assign(x=df["x"]-20, y=df["y"]-165, w=df['w']+26, h=df['h']-29)
print (df[m])
x y w h
5 1493.040037 1021.770072 540.840753 57.90916

Related

Python: Add a complex conditional column without for loop

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0

Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)

Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1

What will be the mean of a conditional output

let's take a condition as :
(df['a'] > 10) & (df['a'] < 20)
This condition will give a true false output.
What will be the mean of this conditional output?
i.e np.mean((df['a'] > 10) & (df['a'] < 20)) = ?

It will give the mean of all the values that is > 10 and < 20.
to get the mean value you have to use square bracket
np.mean(df[(df['a'] > 10) & (df['a'] < 20)])

It working same like 1 and 0 values instead True and False values, so it return percentage of matched values of both conditions:
df = pd.DataFrame({'a':[9,13,23,16,23]})
m = (df['a'] > 10) & (df['a'] < 20)
print (m)
0 False
1 True
2 False
3 True
4 False
Name: a, dtype: bool
There is 2 matched values from 5 values, so percentage is 2/5=0.4:
print (m.mean())
0.4

Creating a function to iterate through DataFrame

I am running into an issue creating a function that will recognize if a particular value in a column is between two values.
def bid(x):
if df['tla'] < 85000:
return 1
elif (df['tla'] >= 85000) & (df['tla'] < 110000):
return 2
elif (df['tla'] >= 111000) & (df['tla'] < 126000):
return 3
elif (df['tla'] >= 126000) & (df['tla'] < 150000):
return 4
elif (df['tla'] >= 150000) & (df['tla'] < 175000):
return 5
elif (df['tla'] >= 175000) & (df['tla'] < 200000):
return 6
elif (df['tla'] >= 200000) & (df['tla'] < 250000):
return 7
elif (df['tla'] >= 250000) & (df['tla'] < 300000):
return 8
elif (df['tla'] >= 300000) & (df['tla'] < 375000):
return 9
elif (df['tla'] >= 375000) & (df['tla'] < 453100):
return 10
elif df['tla'] >= 453100:
return 11
I apply that to my new column:
df['bid_bucket'] = df['bid_bucket'].apply(bid)
And I am getting this error back:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Anyone have any ideas?

try the following using numpy.select
import numpy as np
values = [1,2,3,4,5,6,7,8,9,10,11]
cond = [df['tla']<85000, (df['tla'] >= 850000) & (df['tla'] < 110000), .... ]
df['bid_bucket'] = np.select(cond, values)

This can already be accomplished with pd.cut, defining the bin edges, and adding +1 to the labels to get your numbering to start at 1.
import pandas as pd
import numpy as np
df = pd.DataFrame({'tla': [7, 85000, 111000, 88888, 51515151]})
df['bid_bucket'] = pd.cut(df.tla, right=False,
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf],
labels=False)+1
Output: df
tla bid_bucket
0 7 1
1 85000 2
2 111000 3
3 88888 2
4 126000 4
5 51515151 11

You can simply use the np.digitize function to assign the ranges
df['bid_bucket'] = np.digitize(df['bid_bucket'],np.arange(85000,453100,25000))
Example
a = np.random.randint(85000,400000,10)
#array([305628, 134122, 371486, 119856, 321423, 346906, 319321, 165714,360896, 206404])
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf]
np.digitize(a,bins)
Out:
array([9, 4, 9, 3, 9, 9, 9, 5, 9, 7])

To keep it in pandas: I think referencing df['tla'] in your function means to reference a series instead of a single value which leads to the ambiguity. You should provide the specific value instead. You could use lambda x, then your code could be something like this
df = pd.DataFrame({'tla':[10,123456,999999]})
def bid(x):
if x < 85000:
return 1
elif (x >= 85000 and x < 110000):
return 2
elif (x >= 111000 and x < 126000):
return 3
elif (x >= 126000 and x < 150000):
return 4
elif (x >= 150000 and x < 175000):
return 5
elif (x >= 175000 and x < 200000):
return 6
elif (x >= 200000 and x < 250000):
return 7
elif (x >= 250000 and x < 300000):
return 8
elif (x >= 300000 and x < 375000):
return 9
elif (x >= 375000 and x < 453100):
return 10
elif x >= 453100:
return 11
df['bid_bucket'] = df['tla'].apply(lambda x: bid(x))
df

You have two possibilities.
Either apply a function defined on a row on the pandas DataFrame in a row-wise way:
def function_on_a_row(row):
if row.tla > ...
...
df.apply(function_on_a_row, axis=1)
In which case keep bid the way you defined it but replace the parameter x with a word like "row" and then the df with "row" to keep the parameters name meaningful, and use:
df.bid_bucket = df.apply(bid, axis=1)
Or apply a function defined on an element on a pandas Series.
def function_on_an_elt(element_of_series):
if element_of_series > ...
...
df.new_column = df.my_column_of_interest.apply(function_on_an_elt)
In your case redefine bid accordingly.
Here you tried to mix both approaches, which does not work.

Pandas categorizing age variable into groups

I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you

Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this

You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0

This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)

Error when masking 2d numpy array

I'm not sure what the correct terminology is here but I'm trying to mask out some values in a numpy array using multiple conditions from several arrays. For example, I want to find and mask out the areas in X where arrays t/l,lat2d,x, and m meet certain criteria. All the arrays are of the same shape: (250,500). I tried this:
cs[t < 274.0 |
l > 800.0 |
lat2d > 60 |
lat2d < -60 |
(x > 0 & m > 0.8) |
(x < -25 & m < 0.2)] = np.nan
ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''.
I replaced the &,| with and/or and got the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I've tried creating a mask: mask = t < 274.0 | l > 800.0 | lat2d > 60 | lat2d < -60 | (x > 0 & m > 0.8) | (x < -25 & m < 0.2), in order to use in a masked array but got the same error.
any idea how to do this in Python 3?

This is just a matter of operator precedence:
cs[(t < 274.0) |
(l > 800.0) |
(lat2d > 60) |
(lat2d < -60) |
((x > 0) & (m > 0.8)) |
((x < -25) & (m < 0.2))] = np.nan
should work

You could do it using with a python function and then applying that function on the array.
def cond(x):
if (np.all(t < 274.0) or np.all(l > 800.0) or np.all(lat2d > 60) or \
np.all(lat2d < -60) or (np.all(x > 0) and np.all(m > 0.8)) or \
(np.all(x < -25) and np.all(m < 0.2))):
return np.nan
Then apply this function on the array:
cs[:] = np.apply_along_axis(cond, 0, cs)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas change values on multiple column based on condition - python

Related

Python: Add a complex conditional column without for loop

What will be the mean of a conditional output

Creating a function to iterate through DataFrame

Pandas categorizing age variable into groups

Error when masking 2d numpy array

Categories

Resources