What will be the mean of a conditional output - python

let's take a condition as :
(df['a'] > 10) & (df['a'] < 20)
This condition will give a true false output.
What will be the mean of this conditional output?
i.e np.mean((df['a'] > 10) & (df['a'] < 20)) = ?

It will give the mean of all the values that is > 10 and < 20.
to get the mean value you have to use square bracket
np.mean(df[(df['a'] > 10) & (df['a'] < 20)])

It working same like 1 and 0 values instead True and False values, so it return percentage of matched values of both conditions:
df = pd.DataFrame({'a':[9,13,23,16,23]})
m = (df['a'] > 10) & (df['a'] < 20)
print (m)
0 False
1 True
2 False
3 True
4 False
Name: a, dtype: bool
There is 2 matched values from 5 values, so percentage is 2/5=0.4:
print (m.mean())
0.4

Related

applying multiple conditions in pandas dataframe

I want to fill an column with true or false, depending on whether a condition is met.
I know to use any() method, but I need to compare values of two columns. I tried and have not succeeded- using & gives type error.
my data looks something like
A B condition_met
1 2
3 3
5 9
7 2
the expected output is something like
my data looks something like
A B condition_met
1 2 true
3 3 true
5 9 true
7 2 false
I want the value in condition_met if A>3 and B>4
What I tried was
df.loc[df['A'] > 3 & 'B' > 4, 'condition_met'] = 'True'
upd: I need to check if condition is met. i.e., if A>3 then B>4.
if A<=3 then it must still be true, since the condition doesn't exist.
Run:
df['condition_met'] = (df.A > 3) & (df.B > 4)
Another possible approach is to use logical_and function from Numpy:
df['condition_met'] = np.logical_and(df.A.gt(3), df.B.gt(4))
You can assign mask with parantheses, because priority of operators:
df['condition_met'] = (df.A>3) & (df.B>4)
Or:
df['condition_met'] = df.A.gt(3) & df.B.gt(4)
Your solution - 'True' if match else NaNs:
df.loc[(df['A'] > 3) & (df['B'] > 4), 'condition_met'] = 'True'
EDIT:
df['condition_met'] = (df.A>3) & (df.B>4) | (df.A <= 3)
print (df)
A B condition_met
0 1 2 True
1 3 3 True
2 5 9 True
3 7 2 False

Python: Add a complex conditional column without for loop

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1

pandas change values on multiple column based on condition

I have a data frame like
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1513.040037 1186.770072 514.840753 86.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
I want to select rows based on certain value ranges in x and y and find the values in all four x,y,w,h and perform addition or subtraction on those values and replace them with the calculated value in that row.
I am doing something like
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']] = df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29
and getting error:
"Must have equal len keys and value when setting with an ndarray"
when I tried this
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), 'x'] = df['x'] - 20
it works but I want to perform operation on all four columns in one go and update the values.
My desired answer is it should select row 5 and my answer should be like
x y w h
5 1493.040037 1021.770072 540.840753 57.909160
Any help will be much appreciated.
Let us fix your code
m = (df['x'] >= 1000) & (df['x'] < 1800) \
& (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] += [-20, -165, 26, -29]
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160 *** updated
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
With your approach , you can use pd.concat on the R.H.S
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']]=pd.concat((df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29),axis=1)
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
You have to assign with an array of the same shape. Easiest way is to use the original df:
m = (df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] = df.assign(x=df["x"]-20, y=df["y"]-165, w=df['w']+26, h=df['h']-29)
print (df[m])
x y w h
5 1493.040037 1021.770072 540.840753 57.90916

How do I correct use of If Statement using Python

I have some Values like and using python and if statement
a = 11
b = 36
c = 70
if (a > 5 and a < 15) and (b > 25 and b < 40) and (c < 100):
#do something
and while the vales are negative
a = -11
b = -36
c = -70
if (a < -5 and a < -15) and (b < -25 and b < -40) and (c > -100):
#do something
but IF statement is doing anything no error
The reason your if statement is not doing anything is because it evaluates down to being false. This is because your comparison operators (the < and >) are looking for a to be less than -5 (True when a = - 11) and -15 (False when a = -11), and for b to be less than -25 (True when b = -36) and -40 (False when b = -36).
If I evaluate your code it looks like this:
a = -11
b = -36
c = -70
if (a < -5 and a < -15) and (b < -25 and b < -40) and (c > -100):
# The first comparison paranthesis: (a < -5 and a < -15) evaluates to (True and False)
# The second comparison paranthesis: (b < -25 and b < -40) evaluates to (True and False)
# The last comparison paranthesis: (c > -100) evaluates to (True)
# if (True and False) and (True and False) and (True)
# if False and False and True
# if False
#do something

Pandas: Filtering multiple conditions

I'm trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:
temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()
However, if I do this (which I think should be equivalent), I get no rows back:
temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()
Any idea what accounts for the difference?
Use () because operator precedence:
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
Alternatively, create conditions on separate rows:
cond1 = df["bin"] == 3
cond2 = df["days since"] > 7
cond3 = ~df["Def"]
temp2 = df[cond1 & cond2 & cond3]
Sample:
df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})
print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13
OR
df_train[(df_train["fold"]==1) | (df_train["fold"]==2)]
AND
df_train[(df_train["fold"]==1) & (df_train["fold"]==2)]
Alternatively, you can use the method query:
df.query('not Def & (`days since` > 7) & (bin == 3)')
If you want multiple conditions:
Del_Det_5k_top_10 = Del_Det[(Del_Det['State'] == 'NSW') & (Del_Det['route'] == 2) |
(Del_Det['State'] == 'VIC') & (Del_Det['route'] == 3)]

Categories