I'm trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:
temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()
However, if I do this (which I think should be equivalent), I get no rows back:
temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()
Any idea what accounts for the difference?
Use () because operator precedence:
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
Alternatively, create conditions on separate rows:
cond1 = df["bin"] == 3
cond2 = df["days since"] > 7
cond3 = ~df["Def"]
temp2 = df[cond1 & cond2 & cond3]
Sample:
df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})
print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13
OR
df_train[(df_train["fold"]==1) | (df_train["fold"]==2)]
AND
df_train[(df_train["fold"]==1) & (df_train["fold"]==2)]
Alternatively, you can use the method query:
df.query('not Def & (`days since` > 7) & (bin == 3)')
If you want multiple conditions:
Del_Det_5k_top_10 = Del_Det[(Del_Det['State'] == 'NSW') & (Del_Det['route'] == 2) |
(Del_Det['State'] == 'VIC') & (Del_Det['route'] == 3)]
Related
I want to fill an column with true or false, depending on whether a condition is met.
I know to use any() method, but I need to compare values of two columns. I tried and have not succeeded- using & gives type error.
my data looks something like
A B condition_met
1 2
3 3
5 9
7 2
the expected output is something like
my data looks something like
A B condition_met
1 2 true
3 3 true
5 9 true
7 2 false
I want the value in condition_met if A>3 and B>4
What I tried was
df.loc[df['A'] > 3 & 'B' > 4, 'condition_met'] = 'True'
upd: I need to check if condition is met. i.e., if A>3 then B>4.
if A<=3 then it must still be true, since the condition doesn't exist.
Run:
df['condition_met'] = (df.A > 3) & (df.B > 4)
Another possible approach is to use logical_and function from Numpy:
df['condition_met'] = np.logical_and(df.A.gt(3), df.B.gt(4))
You can assign mask with parantheses, because priority of operators:
df['condition_met'] = (df.A>3) & (df.B>4)
Or:
df['condition_met'] = df.A.gt(3) & df.B.gt(4)
Your solution - 'True' if match else NaNs:
df.loc[(df['A'] > 3) & (df['B'] > 4), 'condition_met'] = 'True'
EDIT:
df['condition_met'] = (df.A>3) & (df.B>4) | (df.A <= 3)
print (df)
A B condition_met
0 1 2 True
1 3 3 True
2 5 9 True
3 7 2 False
I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1
I am attempting to change the values of two columns in my dataset from specific numeric values (2, 10, 25 etc.) to single values (1, 2, 3 or 4) based on the percentile of the specific value within the dataset.
Using the pandas quantile() function I have got the ranges I wish to replace between, but I haven't figured out a working method to do so.
age1 = datasetNB.Age.quantile(0.25)
age2 = datasetNB.Age.quantile(0.5)
age3 = datasetNB.Age.quantile(0.75)
fare1 = datasetNB.Fare.quantile(0.25)
fare2 = datasetNB.Fare.quantile(0.5)
fare3 = datasetNB.Fare.quantile(0.75)
My current solution attempt for this problem is as follows:
for elem in datasetNB['Age']:
if elem <= age1:
datasetNB[elem].replace(to_replace = elem, value = 1)
print("set to 1")
elif (elem > age1) & (elem <= age2):
datasetNB[elem].replace(to_replace = elem, value = 2)
print("set to 2")
elif (elem > age2) & (elem <= age3):
datasetNB[elem].replace(to_replace = elem, value = 3)
print("set to 3")
elif elem > age3:
datasetNB[elem].replace(to_replace = elem, value = 4)
print("set to 4")
else:
pass
for elem in datasetNB['Fare']:
if elem <= fare1:
datasetNB[elem] = 1
elif (elem > fare1) & (elem <= fare2):
datasetNB[elem] = 2
elif (elem > fare2) & (elem <= fare3):
datasetNB[elem] = 3
elif elem > fare3:
datasetNB[elem] = 4
else:
pass
What should I do to get this working?
pandas already has one function to do that, pandas.qcut.
You can simply do
q_list = [0, 0.25, 0.5, 0.75, 1]
labels = range(1, 5)
df['Age'] = pd.qcut(df['Age'], q_list, labels=labels)
df['Fare'] = pd.qcut(df['Fare'], q_list, labels=labels)
Input
import numpy as np
import pandas as pd
# Generate fake data for the sake of example
df = pd.DataFrame({
'Age': np.random.randint(10, size=6),
'Fare': np.random.randint(10, size=6)
})
>>> df
Age Fare
0 1 6
1 8 2
2 0 0
3 1 9
4 9 6
5 2 2
Output
DataFrame after running the above code
>>> df
Age Fare
0 1 3
1 4 1
2 1 1
3 1 4
4 4 3
5 3 1
Note that in your specific case, since you want quartiles, you can just assign q_list = 4.
let's take a condition as :
(df['a'] > 10) & (df['a'] < 20)
This condition will give a true false output.
What will be the mean of this conditional output?
i.e np.mean((df['a'] > 10) & (df['a'] < 20)) = ?
It will give the mean of all the values that is > 10 and < 20.
to get the mean value you have to use square bracket
np.mean(df[(df['a'] > 10) & (df['a'] < 20)])
It working same like 1 and 0 values instead True and False values, so it return percentage of matched values of both conditions:
df = pd.DataFrame({'a':[9,13,23,16,23]})
m = (df['a'] > 10) & (df['a'] < 20)
print (m)
0 False
1 True
2 False
3 True
4 False
Name: a, dtype: bool
There is 2 matched values from 5 values, so percentage is 2/5=0.4:
print (m.mean())
0.4
Here is the head of my dataframe
df_s['makes'] = df_s['result']
df_s['misses'] = df_s['result']
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) &(df_s['result'] == 'made'), 'makes'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'missed') | (df_s['makes'] == 'made'), 'makes'] = 0
df_s.fillna(0, inplace=True)
df_s.loc[(df_s['team'] == 'BOS') & (df_s['shot_distance'] >= 23) & (df_s['result'] == 'missed'), 'misses'] = 1
df_s.loc[(df_s['team'] != 'BOS') | (df_s['shot_distance'] < 23) | (df_s['result'] == 'made'), 'misses'] = 0
df_s.fillna(0, inplace=True)
Is the following a better way to do this, or is there an easier solution?:
>>> df['filter'] = (df['a'] >= 20) & (df['b'] >= 20)
a b c filter
0 1 50 1 False
1 10 60 30 False
2 20 55 1 True
3 3 0 0 False
4 10 0 0 False
A more readable way is to create masks
mask1 = df_s['team'] == 'BOS'
mask2 = df_s['shot_distance'] >= 23
mask3 = df_s['result'] == 'made'
df_s.loc[(mask1 & mask2 & mask3), 'makes'] = 1
df_s.loc[(~mask1 | ~mask2 | ~mask3), 'makes'] = 0
df_s.fillna(0, inplace=True)