I want to apply a lambda function to a DataFrame column using if...elif...else within the lambda function.
The df and the code are something like:
df=pd.DataFrame({"one":[1,2,3,4,5],"two":[6,7,8,9,10]})
df["one"].apply(lambda x: x*10 if x<2 elif x<4 x**2 else x+10)
Obviously, this doesn't work. Is there a way to apply if....elif....else to a lambda?
How can I get the same result with List Comprehension?
Nest if .. elses:
lambda x: x*10 if x<2 else (x**2 if x<4 else x+10)
I do not recommend the use of apply here: it should be avoided if there are better alternatives.
For example, if you are performing the following operation on a Series:
if cond1:
exp1
elif cond2:
exp2
else:
exp3
This is usually a good use case for np.where or np.select.
numpy.where
The if else chain above can be written using
np.where(cond1, exp1, np.where(cond2, exp2, ...))
np.where allows nesting. With one level of nesting, your problem can be solved with,
df['three'] = (
np.where(
df['one'] < 2,
df['one'] * 10,
np.where(df['one'] < 4, df['one'] ** 2, df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
numpy.select
Allows for flexible syntax and is easily extensible. It follows the form,
np.select([cond1, cond2, ...], [exp1, exp2, ...])
Or, in this case,
np.select([cond1, cond2], [exp1, exp2], default=exp3)
df['three'] = (
np.select(
condlist=[df['one'] < 2, df['one'] < 4],
choicelist=[df['one'] * 10, df['one'] ** 2],
default=df['one'] + 10))
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
and/or (similar to the if/else)
Similar to if-else, requires the lambda:
df['three'] = df["one"].apply(
lambda x: (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10)
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
List Comprehension
Loopy solution that is still faster than apply.
df['three'] = [x*10 if x<2 else (x**2 if x<4 else x+10) for x in df['one']]
# df['three'] = [
# (x < 2 and x * 10) or (x < 4 and x ** 2) or x + 10) for x in df['one']
# ]
df
one two three
0 1 6 10
1 2 7 4
2 3 8 9
3 4 9 14
4 5 10 15
For readability I prefer to write a function, especially if you are dealing with many conditions. For the original question:
def parse_values(x):
if x < 2:
return x * 10
elif x < 4:
return x ** 2
else:
return x + 10
df['one'].apply(parse_values)
You can do it using multiple loc operators. Here is a newly created column labelled 'new' with the conditional calculation applied:
df.loc[(df['one'] < 2), 'new'] = df['one'] * 10
df.loc[(df['one'] < 4), 'new'] = df['one'] ** 2
df.loc[(df['one'] >= 4), 'new'] = df['one'] + 10
Related
suppose I have following dataframe :
data = {"age":[2,3,2,5,9,12,20,43,55,60],'alpha' : [0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
I want to change value of column alpha based on column age using df.loc and an arithmetic sequences but I got syntax error:
df.loc[((df.age <=4)) , "alpha"] = ".4"
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df$age - 4)/(20 - 4))
df.loc[((df.age > 20)) , "alpha"] = "1"
thank you in davance.
Reference the age column using a . not a $
df.loc[((df.age >= 5)) & ((df.age <= 20)), "alpha"] = 0.4 + (1 - 0.4)*((df.age - 4)/(20 - 4))
Instead of multiple .loc assignments you can combine all conditions at once using chained np.where clauses:
df['alpha'] = np.where(df.age <= 4, ".4", np.where((df.age >= 5) & (df.age <= 20),
0.4 + (1 - 0.4) *((df.age - 4)/(20 - 4)),
np.where(df.age > 20, "1", df.alpha)))
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1
Besides the synthax error (due to $), to reduce visible noise, I would go for numpy.select :
import numpy as np
conditions = [df["age"].le(4),
df["age"].gt(4) & df["age"].le(20),
df["age"].gt(20)]
values = [".4", 0.4 + (1 - 0.4) * ((df["age"] - 4) / (20 - 4)), 1]
df["alpha"] = np.select(condlist= conditions, choicelist= values)
Output :
print(df)
age alpha
0 2 .4
1 3 .4
2 2 .4
3 5 0.4375
4 9 0.5875
5 12 0.7
6 20 1.0
7 43 1
8 55 1
9 60 1
We're trying to figure out a way to easily pull values from what I guess I would describe as a grid of conditional statements. We've got two variables, x and y, and depending on those values, we want to pull one of (something1, ..., another1, ... again1...). We could definitely do this using if statements, but we were wondering if there was a better way. Some caveats: we would like to be able to easily change the bounds on the x and y conditionals. The problem with a bunch of if statements is that it's not very easy to compare the values of those bounds with the values in the example table below.
Example:
So if x = 4% and y = 30%, we would get back another1. Whereas if x = 50% and y = 10%, we would get something3.
Overall two questions:
Is there a general name for this kind of problem?
Is there an easy framework or library that could do this for us without if statements?
Even though Pandas is not really made for this kind of usage, with function aggregation and boolean indexing it allows for an elegant-ish solution for your problem. Alternatively, constraint-based programing might be an option (see python-constraint on pypi).
Define the constraints as functions.
x_constraints = [lambda x: 0 <= x < 5,
lambda x: 5 <= x < 10,
lambda x: 10<= x < 15,
lambda x: x >= 15
]
y_constraints = [lambda y: 0 <= y < 20,
lambda y: 20 <= y < 50,
lambda y: y >= 50]
x = 15
y = 30
Now we want to make two dataframes: One that only holds the x-values,
and another that only holds the y-values where number of columns = number of x-constraints and number of rows = number of y-constraints.
import pandas as pd
def make_dataframe(value):
return pd.DataFrame(data=value,
index=range(len(y_constraints)),
columns=range(len(x_constraints)))
x_df = make_dataframe(x)
y_df = make_dataframe(y)
The dataframes look like this:
>>> x_df
0 1 2 3
0 15 15 15 15
1 15 15 15 15
2 15 15 15 15
>>> y_df
0 1 2 3
0 30 30 30 30
1 30 30 30 30
2 30 30 30 30
Next, we need the dataframe label_df that holds the possible outcomes. The shape must match the dimension of x_df and y_df above. (What's cool about this is that you can store the data in a
CSV-file and directly read it into a dataframe with pd.read_csv if you wish.)
label_df = pd.DataFrame([[f"{w}{i+1}" for i in range(len(x_constraints))] for w in "something another again".split()])
>>> label_df
0 1 2 3
0 something1 something2 something3 something4
1 another1 another2 another3 another4
2 again1 again2 again3 again4
Next, we want to apply the x_constraints to the columns of x_df, and the y_constraints to the rows of y_df. .aggregate takes
a dictionary that maps column or row names to functions {colname: func},
which we construct inline using dict(zip(...)). axis=1 means "apply the functions row-wise".
x_mask = x_df.aggregate(dict(zip(x_df.columns, x_constraints)))
y_mask = y_df.aggregate(dict(zip(y_df.columns, y_constraints)), axis=1)
The result are two dataframes holding boolean values, and ideally,
there should be exactly one column in x_mask and one row in y_mask that's all True, e.g.
>>> x_mask
0 1 2 3
0 False False False True
1 False False False True
2 False False False True
>>> y_mask
0 1 2 3
0 False False False False
1 True True True True
2 False False False False
If we combine them with bit-wise and &, we get a boolean mask with exactly
one True value.
>>> m = x_mask & y_mask
>>> m
0 1 2 3
0 False False False False
1 False False False True
2 False False False False
Use m to select the target value from label_df. The result df is all NaN except one value, which we extract with df.stack().iloc[0]:
>>> df = label_df[m]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN another4
2 NaN NaN NaN NaN
>>> df.stack().iloc[0]
'another4'
And that's it! It should be very easy to maintain, by just changing the list of constraints and adapting the possible outcomes in label_df.
I didn't hear about any name.
If (ha-ha) it should be more conceptually close to you, I might suggest that you create two mapper functions that would map x and y values to the categories of your contingency table.
map_x = lambda x: 0 if x < 0.05 else 1 if x < 0.1 else 2
map_y = lambda y: 0 if y < 0.2 else 1 if y < 0.5 else 2
df.iloc[map_x(x), map_y(y)]
If you have just a handful of conditionals then you may define two lists with the upper bounds, and use a simple linear search:
x_bounds = [0.05, 0.1, 1.0]
y_bounds = [0.2, 0.5, 1.0]
def linear(x_bounds, y_bounds, x, y):
for i,xb in enumerate(x_bounds):
if x <= xb:
break
for j,yb in enumerate(y_bounds):
if y <= yb:
break
return i,j
linear(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
If there are many conditionals a binary search will be better:
def binary(x_bounds, y_bounds, x, y):
lower = 0
upper = len(x_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if x_bounds[mid] < x:
lower = mid
elif x_bounds[mid] >= x:
if mid > 0 and x_bounds[mid-1] < x:
xmid = mid
break
else:
xmid = mid-1
break
else:
upper = mid
lower = 0
upper = len(y_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if y_bounds[mid] < y:
lower = mid
elif y_bounds[mid] >= y:
if mid > 0 and y_bounds[mid-1] < y:
ymid = mid
break
else:
ymid = mid-1
break
else:
upper = mid
return xmid,ymid
binary(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1
This question already has answers here:
Mapping ranges of values in pandas dataframe [duplicate]
(2 answers)
Closed 2 years ago.
I have a pandas dataframe and I want to create categories in a new column based on the values of another column. I can solve my basic problem by doing this:
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
range(11, 1000) : 'above'
}
df['range'] = df['value'].map(range)
In the final dictionary key I have chosen a large upper value for range to ensure it captures all the values I am trying to map. However, this seems an ugly hack and am wondering how to generalise this without specifying the upper limit. ie. if > 10 : 'above'.
Thanks
Assume you have a dataframe like this:
range value
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Then you can apply the following function to the column 'value':
def get_value(range):
if range < 5:
return 'Below 5'
elif range < 10:
return 'Between 5 and 10'
else:
return 'Above 10'
df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)
To get the result you want.
You could set all values first to 'above', and then use map() for the remaining options (thus with your range dict having only two items):
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
}
df['range'] = 'above'
df['range'] = df['value'].map(range)
Thanks for the hints. I see I can achieve the same with:
df['range'] = df['value'].map(range).fillna('above')
pandas.Series.map accepts also function as first argument, so you could do:
def fun(x):
if x in range(0, 5):
return 'Below 5'
elif x in range(6, 10):
return 'between'
elif x >= 11:
return 'above'
then:
df['range'] = df['value'].map(fun)
Here's another approach using numpy.select, where you specify a list of boolean conditions, and a list of choices:
import numpy as np
# Setup
df = pd.DataFrame({'value': [1, 3, 6, 8, 20, 10000000]})
condlist = [
df['value'].lt(5),
df['value'].between(5, 10),
df['value'].gt(10)]
choicelist = ['Below 5', 'between', 'above']
df['out'] = np.select(condlist, choicelist)
print(df)
[out]
value out
0 1 Below 5
1 3 Below 5
2 6 between
3 8 between
4 20 above
5 10000000 above
Another idea would be to use pandas.cut with bins and labels parameters specified:
df['out'] = pd.cut(df['value'], bins=[-np.inf, 5, 10, np.inf],
labels=['below', 'between', 'above'])
value out
0 1 below
1 3 below
2 6 between
3 8 between
4 20 above
5 10000000 above
df['range'] = pd.cut(df['value'], bins = [0, 5, 10, 1000], labels = ["below 5", "between", "above"])
I want to perform for loop in pandas: for each row i I want to take column x1 and perform the test(if else statements)
In R I will do like this:
df <- data.frame(x1 = rnorm(10),x2 = rexp(10))
for(i in 1:length(df$x1)){
if(df[i,'x1'] >0){
print('+')
} else{
print('-')
}
}
How can I do this in pandas data frame?
P.S I need to perfom a loop like this. But if you have better ideas, I will appreciate it
EDIT:
In case multiple comparison:
Thank you for the answer!
And maybe you can give me an advise, how can i do the iteration if i have multiple if/else statements? For example:
if x>0:
if x%2 == 0:
#do stuff 1
else:
#do other stuff 2
elif x<0:
if x%2 == 0:
#do stuff 3
else:
#do other stuff 4
If need new column use numpy.where:
np.random.seed(54)
df = pd.DataFrame({'x1':np.random.randint(10, size=10)}) - 5
df['new'] = np.where(df['x1'] > 0, '+', '-')
print (df)
x1 new
0 0 -
1 -3 -
2 2 +
3 -4 -
4 -5 -
5 3 +
6 2 +
7 -4 -
8 4 +
9 1 +
But if need loop (obviously avoid it, because slow) is possible use iteritems or items():
for i, x in df['x1'].iteritems():
if x > 0:
print ('+')
else:
print ('-')
EDIT:
df['new'] = np.where(df['x1'] > 0, 'a',
np.where(df['x1'] & 2, 'b', 'c'))
print (df)
x1 new
0 0 c
1 -3 c
2 2 a
3 -4 c
4 -5 b
5 3 a
6 2 a
7 -4 c
8 4 a
9 1 a
But if have many conditions (4 or more) use apply with custom function:
def f(x):
#x == 0
y = 5
if x>0:
if x%2 == 0:
y = 0
#do stuff 1
else:
y = 1
#do other stuff 2
elif x<0:
if x%2 == 0:
y = 2
#do stuff 3
else:
y = 3
#do other stuff 4
return y
df['new'] = df['x1'].apply(f)
print (df)
x1 new
0 0 5
1 -3 3
2 2 0
3 -4 2
4 -5 3
5 3 1
6 2 0
7 -4 2
8 4 0
9 1 1
You can use this code to print out each index with the correct symbol:
print(df['x1'].map(lambda x: '+' if x > 0 else '-').to_string(index=False))
What the above code does is creates a new Series object, for which you use the map function to convert each symbol into a + if i>0 and a - if i<=0. Then, the Series is converted to a string and printed out without indices.
But if you absolutely need to loop through each row, you can use the following code, which is what you have but condensed into 2 lines:
for i in df['x1']:
print('+' if i > 0 else '-')