I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1
Related
We're trying to figure out a way to easily pull values from what I guess I would describe as a grid of conditional statements. We've got two variables, x and y, and depending on those values, we want to pull one of (something1, ..., another1, ... again1...). We could definitely do this using if statements, but we were wondering if there was a better way. Some caveats: we would like to be able to easily change the bounds on the x and y conditionals. The problem with a bunch of if statements is that it's not very easy to compare the values of those bounds with the values in the example table below.
Example:
So if x = 4% and y = 30%, we would get back another1. Whereas if x = 50% and y = 10%, we would get something3.
Overall two questions:
Is there a general name for this kind of problem?
Is there an easy framework or library that could do this for us without if statements?
Even though Pandas is not really made for this kind of usage, with function aggregation and boolean indexing it allows for an elegant-ish solution for your problem. Alternatively, constraint-based programing might be an option (see python-constraint on pypi).
Define the constraints as functions.
x_constraints = [lambda x: 0 <= x < 5,
lambda x: 5 <= x < 10,
lambda x: 10<= x < 15,
lambda x: x >= 15
]
y_constraints = [lambda y: 0 <= y < 20,
lambda y: 20 <= y < 50,
lambda y: y >= 50]
x = 15
y = 30
Now we want to make two dataframes: One that only holds the x-values,
and another that only holds the y-values where number of columns = number of x-constraints and number of rows = number of y-constraints.
import pandas as pd
def make_dataframe(value):
return pd.DataFrame(data=value,
index=range(len(y_constraints)),
columns=range(len(x_constraints)))
x_df = make_dataframe(x)
y_df = make_dataframe(y)
The dataframes look like this:
>>> x_df
0 1 2 3
0 15 15 15 15
1 15 15 15 15
2 15 15 15 15
>>> y_df
0 1 2 3
0 30 30 30 30
1 30 30 30 30
2 30 30 30 30
Next, we need the dataframe label_df that holds the possible outcomes. The shape must match the dimension of x_df and y_df above. (What's cool about this is that you can store the data in a
CSV-file and directly read it into a dataframe with pd.read_csv if you wish.)
label_df = pd.DataFrame([[f"{w}{i+1}" for i in range(len(x_constraints))] for w in "something another again".split()])
>>> label_df
0 1 2 3
0 something1 something2 something3 something4
1 another1 another2 another3 another4
2 again1 again2 again3 again4
Next, we want to apply the x_constraints to the columns of x_df, and the y_constraints to the rows of y_df. .aggregate takes
a dictionary that maps column or row names to functions {colname: func},
which we construct inline using dict(zip(...)). axis=1 means "apply the functions row-wise".
x_mask = x_df.aggregate(dict(zip(x_df.columns, x_constraints)))
y_mask = y_df.aggregate(dict(zip(y_df.columns, y_constraints)), axis=1)
The result are two dataframes holding boolean values, and ideally,
there should be exactly one column in x_mask and one row in y_mask that's all True, e.g.
>>> x_mask
0 1 2 3
0 False False False True
1 False False False True
2 False False False True
>>> y_mask
0 1 2 3
0 False False False False
1 True True True True
2 False False False False
If we combine them with bit-wise and &, we get a boolean mask with exactly
one True value.
>>> m = x_mask & y_mask
>>> m
0 1 2 3
0 False False False False
1 False False False True
2 False False False False
Use m to select the target value from label_df. The result df is all NaN except one value, which we extract with df.stack().iloc[0]:
>>> df = label_df[m]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN another4
2 NaN NaN NaN NaN
>>> df.stack().iloc[0]
'another4'
And that's it! It should be very easy to maintain, by just changing the list of constraints and adapting the possible outcomes in label_df.
I didn't hear about any name.
If (ha-ha) it should be more conceptually close to you, I might suggest that you create two mapper functions that would map x and y values to the categories of your contingency table.
map_x = lambda x: 0 if x < 0.05 else 1 if x < 0.1 else 2
map_y = lambda y: 0 if y < 0.2 else 1 if y < 0.5 else 2
df.iloc[map_x(x), map_y(y)]
If you have just a handful of conditionals then you may define two lists with the upper bounds, and use a simple linear search:
x_bounds = [0.05, 0.1, 1.0]
y_bounds = [0.2, 0.5, 1.0]
def linear(x_bounds, y_bounds, x, y):
for i,xb in enumerate(x_bounds):
if x <= xb:
break
for j,yb in enumerate(y_bounds):
if y <= yb:
break
return i,j
linear(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
If there are many conditionals a binary search will be better:
def binary(x_bounds, y_bounds, x, y):
lower = 0
upper = len(x_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if x_bounds[mid] < x:
lower = mid
elif x_bounds[mid] >= x:
if mid > 0 and x_bounds[mid-1] < x:
xmid = mid
break
else:
xmid = mid-1
break
else:
upper = mid
lower = 0
upper = len(y_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if y_bounds[mid] < y:
lower = mid
elif y_bounds[mid] >= y:
if mid > 0 and y_bounds[mid-1] < y:
ymid = mid
break
else:
ymid = mid-1
break
else:
upper = mid
return xmid,ymid
binary(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
I have a pandas dataframe(100,000 obs) with 11 columns.
I'm trying to assign df['trade_sign'] values based on the df['diff'] (which is a pd.series object of integer values)
If diff is positive, then trade_sign = 1
if diff is negative, then trade_sign = -1
if diff is 0, then trade_sign = 0
What I've tried so far:
pos['trade_sign'] = (pos['trade_sign']>0) <br>
pos['trade_sign'].replace({False: -1, True: 1}, inplace=True)
But this obviously doesn't take into account 0 values.
I also tried for loops with if conditions but that didn't work.
Essentially, how do I fix my .replace function to take account of diff values of 0.
Ideally, I'd prefer a solution that uses numpy over for loops with if conditions.
There's a sign function in numpy:
df["trade_sign"] = np.sign(df["diff"])
If you want integers,
df["trade_sign"] = np.sign(df["diff"]).astype(int)
a = [-1 if df['diff'].values[i] < 0 else 1 for i in range(len(df['diff'].values))]
df['trade_sign'] = a
You could do it this way:
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
The boolean results of the element-wise > and < comparisons automatically get converted to int in order to allow multiplication with 1 and -1, respectively.
This sample input and test code:
import pandas as pd
pos = pd.DataFrame({'diff':[-9,0,9,-8,0,8,-7-6-5,4,3,2,0]})
pos['trade_sign'] = (pos['diff'] > 0) * 1 + (pos['diff'] < 0) * -1
print(pos)
... gives this output:
diff trade_sign
0 -9 -1
1 0 0
2 9 1
3 -8 -1
4 0 0
5 8 1
6 -18 -1
7 4 1
8 3 1
9 2 1
10 0 0
UPDATE: In addition to the solution above, as well as some of the other excellent ideas in other answers, you can use numpy where:
pos['trade_sign'] = np.where(pos['diff'] > 0, 1, np.where(pos['diff'] < 0, -1, 0))
I'm trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:
temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()
However, if I do this (which I think should be equivalent), I get no rows back:
temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()
Any idea what accounts for the difference?
Use () because operator precedence:
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
Alternatively, create conditions on separate rows:
cond1 = df["bin"] == 3
cond2 = df["days since"] > 7
cond3 = ~df["Def"]
temp2 = df[cond1 & cond2 & cond3]
Sample:
df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})
print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13
temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13
OR
df_train[(df_train["fold"]==1) | (df_train["fold"]==2)]
AND
df_train[(df_train["fold"]==1) & (df_train["fold"]==2)]
Alternatively, you can use the method query:
df.query('not Def & (`days since` > 7) & (bin == 3)')
If you want multiple conditions:
Del_Det_5k_top_10 = Del_Det[(Del_Det['State'] == 'NSW') & (Del_Det['route'] == 2) |
(Del_Det['State'] == 'VIC') & (Del_Det['route'] == 3)]
I want to perform for loop in pandas: for each row i I want to take column x1 and perform the test(if else statements)
In R I will do like this:
df <- data.frame(x1 = rnorm(10),x2 = rexp(10))
for(i in 1:length(df$x1)){
if(df[i,'x1'] >0){
print('+')
} else{
print('-')
}
}
How can I do this in pandas data frame?
P.S I need to perfom a loop like this. But if you have better ideas, I will appreciate it
EDIT:
In case multiple comparison:
Thank you for the answer!
And maybe you can give me an advise, how can i do the iteration if i have multiple if/else statements? For example:
if x>0:
if x%2 == 0:
#do stuff 1
else:
#do other stuff 2
elif x<0:
if x%2 == 0:
#do stuff 3
else:
#do other stuff 4
If need new column use numpy.where:
np.random.seed(54)
df = pd.DataFrame({'x1':np.random.randint(10, size=10)}) - 5
df['new'] = np.where(df['x1'] > 0, '+', '-')
print (df)
x1 new
0 0 -
1 -3 -
2 2 +
3 -4 -
4 -5 -
5 3 +
6 2 +
7 -4 -
8 4 +
9 1 +
But if need loop (obviously avoid it, because slow) is possible use iteritems or items():
for i, x in df['x1'].iteritems():
if x > 0:
print ('+')
else:
print ('-')
EDIT:
df['new'] = np.where(df['x1'] > 0, 'a',
np.where(df['x1'] & 2, 'b', 'c'))
print (df)
x1 new
0 0 c
1 -3 c
2 2 a
3 -4 c
4 -5 b
5 3 a
6 2 a
7 -4 c
8 4 a
9 1 a
But if have many conditions (4 or more) use apply with custom function:
def f(x):
#x == 0
y = 5
if x>0:
if x%2 == 0:
y = 0
#do stuff 1
else:
y = 1
#do other stuff 2
elif x<0:
if x%2 == 0:
y = 2
#do stuff 3
else:
y = 3
#do other stuff 4
return y
df['new'] = df['x1'].apply(f)
print (df)
x1 new
0 0 5
1 -3 3
2 2 0
3 -4 2
4 -5 3
5 3 1
6 2 0
7 -4 2
8 4 0
9 1 1
You can use this code to print out each index with the correct symbol:
print(df['x1'].map(lambda x: '+' if x > 0 else '-').to_string(index=False))
What the above code does is creates a new Series object, for which you use the map function to convert each symbol into a + if i>0 and a - if i<=0. Then, the Series is converted to a string and printed out without indices.
But if you absolutely need to loop through each row, you can use the following code, which is what you have but condensed into 2 lines:
for i in df['x1']:
print('+' if i > 0 else '-')
I have a dataframe df with age and I am working on categorizing the file into age groups with 0s and 1s.
df:
User_ID | Age
35435 22
45345 36
63456 18
63523 55
I tried the following
df['Age_GroupA'] = 0
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
but get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
To avoid it, I am going for .loc
df['Age_GroupA'] = 0
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
However, this marks all ages as 1
This is what I get
User_ID | Age | Age_GroupA
35435 22 1
45345 36 1
63456 18 1
63523 55 1
while this is the goal
User_ID | Age | Age_GroupA
35435 22 1
45345 36 0
63456 18 1
63523 55 0
Thank you
Due to peer pressure (#DSM), I feel compelled to breakdown your error:
df['Age_GroupA'][(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
this is chained indexing/assignment
so what you tried next:
df['Age_GroupA'] = df.loc[(df['Age'] >= 1) & (df['Age'] <= 25)] = 1
is incorrect form, when using loc you want:
df.loc[<boolean mask>, cols of interest] = some scalar or calculated value
like this:
df.loc[(df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 'Age_GroupA'] = 1
You could also have done this using np.where:
df['Age_GroupA'] = np.where( (df['Age_MDB_S'] >= 1) & (df['Age_MDB_S'] <= 25), 1, 0)
To do this in 1 line, there are many ways to do this
You can convert boolean mask to int - True are 1 and False are 0:
df['Age_GroupA'] = ((df['Age'] >= 1) & (df['Age'] <= 25)).astype(int)
print (df)
User ID Age Age_GroupA
0 35435 22 1
1 45345 36 0
2 63456 18 1
3 63523 55 0
This worked for me. Jezrael already explained it.
dataframe['Age_GroupA'] = ((dataframe['Age'] >= 1) & (dataframe['Age'] <= 25)).astype(int)