Apply Python function to multiple Pandas columns

Apply Python function to multiple Pandas columns - python

I am attempting to write a function and apply it to multiple fields in a pandas dataframe.
The function takes column colA1, and assigns a value to a new column, colB2 based on conditional statements.
This function works if a single column is given, e.g. colA1, but how could I write it to iterate through a
list of columns, returning a corresponding number of new columns?
The following function works on a single column:
dict = {'colA1':[2,6,8,28,5],
'colA2': [38,6,14,63,3],
'colA3':[90,40,80,98,3]}
df = pd.DataFrame(dict)
def function(x):
if x <= 10:
return '<= 10'
elif x > 10:
return '> 10'
df['colB1']=df['colA1'].apply(function)
df['colB1']
This returns:
0 <= 10
1 <= 10
2 <= 10
3 > 10
4 <= 10
I attempted to apply it to multiple columns as shown here:
Update Multiple Columns using Pandas Apply Function
df[['colB1', 'colB2', 'colB3']]=df[['colA1', 'colA2', 'colA3']].apply(function)
But this returns:
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index colA1')

If this actually what you want to do, a faster alternative is np.select():
cond=[df<= 10,df > 10]
choice=['<= 10','> 10' ]
df[:]=np.select(cond,choice)
print(df)
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10
You can also try with df.applymap() for your function:
df[['colA1','colA2','colA3']].applymap(function)
#df.applymap(function)
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10

this should do it
df.apply(lambda x: pd.Series([function(x['colA1']),function(x['colA2']),function(x['colA3'])]), axis=1).rename({0:'colA1',1:'colA2',2:'colA3'}, axis=1)
Output
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10

Related

Python: Add a complex conditional column without for loop

I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0

Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)

Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1

Creating a function to iterate through DataFrame

I am running into an issue creating a function that will recognize if a particular value in a column is between two values.
def bid(x):
if df['tla'] < 85000:
return 1
elif (df['tla'] >= 85000) & (df['tla'] < 110000):
return 2
elif (df['tla'] >= 111000) & (df['tla'] < 126000):
return 3
elif (df['tla'] >= 126000) & (df['tla'] < 150000):
return 4
elif (df['tla'] >= 150000) & (df['tla'] < 175000):
return 5
elif (df['tla'] >= 175000) & (df['tla'] < 200000):
return 6
elif (df['tla'] >= 200000) & (df['tla'] < 250000):
return 7
elif (df['tla'] >= 250000) & (df['tla'] < 300000):
return 8
elif (df['tla'] >= 300000) & (df['tla'] < 375000):
return 9
elif (df['tla'] >= 375000) & (df['tla'] < 453100):
return 10
elif df['tla'] >= 453100:
return 11
I apply that to my new column:
df['bid_bucket'] = df['bid_bucket'].apply(bid)
And I am getting this error back:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Anyone have any ideas?

try the following using numpy.select
import numpy as np
values = [1,2,3,4,5,6,7,8,9,10,11]
cond = [df['tla']<85000, (df['tla'] >= 850000) & (df['tla'] < 110000), .... ]
df['bid_bucket'] = np.select(cond, values)

This can already be accomplished with pd.cut, defining the bin edges, and adding +1 to the labels to get your numbering to start at 1.
import pandas as pd
import numpy as np
df = pd.DataFrame({'tla': [7, 85000, 111000, 88888, 51515151]})
df['bid_bucket'] = pd.cut(df.tla, right=False,
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf],
labels=False)+1
Output: df
tla bid_bucket
0 7 1
1 85000 2
2 111000 3
3 88888 2
4 126000 4
5 51515151 11

You can simply use the np.digitize function to assign the ranges
df['bid_bucket'] = np.digitize(df['bid_bucket'],np.arange(85000,453100,25000))
Example
a = np.random.randint(85000,400000,10)
#array([305628, 134122, 371486, 119856, 321423, 346906, 319321, 165714,360896, 206404])
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf]
np.digitize(a,bins)
Out:
array([9, 4, 9, 3, 9, 9, 9, 5, 9, 7])

To keep it in pandas: I think referencing df['tla'] in your function means to reference a series instead of a single value which leads to the ambiguity. You should provide the specific value instead. You could use lambda x, then your code could be something like this
df = pd.DataFrame({'tla':[10,123456,999999]})
def bid(x):
if x < 85000:
return 1
elif (x >= 85000 and x < 110000):
return 2
elif (x >= 111000 and x < 126000):
return 3
elif (x >= 126000 and x < 150000):
return 4
elif (x >= 150000 and x < 175000):
return 5
elif (x >= 175000 and x < 200000):
return 6
elif (x >= 200000 and x < 250000):
return 7
elif (x >= 250000 and x < 300000):
return 8
elif (x >= 300000 and x < 375000):
return 9
elif (x >= 375000 and x < 453100):
return 10
elif x >= 453100:
return 11
df['bid_bucket'] = df['tla'].apply(lambda x: bid(x))
df

You have two possibilities.
Either apply a function defined on a row on the pandas DataFrame in a row-wise way:
def function_on_a_row(row):
if row.tla > ...
...
df.apply(function_on_a_row, axis=1)
In which case keep bid the way you defined it but replace the parameter x with a word like "row" and then the df with "row" to keep the parameters name meaningful, and use:
df.bid_bucket = df.apply(bid, axis=1)
Or apply a function defined on an element on a pandas Series.
def function_on_an_elt(element_of_series):
if element_of_series > ...
...
df.new_column = df.my_column_of_interest.apply(function_on_an_elt)
In your case redefine bid accordingly.
Here you tried to mix both approaches, which does not work.

multiple if else conditions in pandas dataframe and derive multiple columns

I have a dataframe like below.
import pandas as pd
import numpy as np
raw_data = {'student':['A','B','C','D','E'],
'score': [100, 96, 80, 105,156],
'height': [7, 4,9,5,3],
'trigger1' : [84,95,15,78,16],
'trigger2' : [99,110,30,93,31],
'trigger3' : [114,125,45,108,46]}
df2 = pd.DataFrame(raw_data, columns = ['student','score', 'height','trigger1','trigger2','trigger3'])
print(df2)
I need to derive Flag column based on multiple conditions.
i need to compare score and height columns with trigger 1 -3 columns.
Flag Column:
if Score greater than equal trigger 1 and height less than 8 then Red --
if Score greater than equal trigger 2 and height less than 8 then Yellow --
if Score greater than equal trigger 3 and height less than 8 then Orange --
if height greater than 8 then leave it as blank
How to write if else conditions in pandas dataframe and derive columns?
Expected Output
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
For other column Text1 in my original question I have tried this one but the integer columns not converting the string when concatenation using astype(str) any other approach?
def text_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger1'].astype(str) + " and less than height 5"
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger2'].astype(str) + " and less than height 5"
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger3'].astype(str) + " and less than height 5"
elif (df['height'] > 8):
return np.nan

You need chained comparison using upper and lower bound
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan
df2['Flag'] = df2.apply(flag_df, axis = 1)
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else
Edit: answering #Cecilia's questions
what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2
Not sure what you tried but you can return a derived value instead of string using
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working
Again not sure what code did you try but using pandas isnull would do the trick
def flag_df(df):
if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan

Here is a way to use numpy.select() for doing this with neat code, scalable and faster:
conditions = [
(df2['trigger1'] <= df2['score']) & (df2['score'] < df2['trigger2']) & (df2['height'] < 8),
(df2['trigger2'] <= df2['score']) & (df2['score'] < df2['trigger3']) & (df2['height'] < 8),
(df2['trigger3'] <= df2['score']) & (df2['height'] < 8),
(df2['height'] > 8)
]
choices = ['Red','Yellow','Orange', np.nan]
df['Flag1'] = np.select(conditions, choices, default=np.nan)

you can use also apply with a custom function on axis 1 like this :
def color_selector(x):
if (x['trigger1'] <= x['score'] < x['trigger2']) and (x['height'] < 8):
return 'Red'
elif (x['trigger2'] <= x['score'] < x['trigger3']) and (x['height'] < 8):
return 'Yellow'
elif (x['trigger3'] <= x['score']) and (x['height'] < 8):
return 'Orange'
elif (x['height'] > 8):
return ''
df2 = df2.assign(flag=df2.apply(color_selector, axis=1))
you will get something like this :

Increasing column value pandas

I have a dataframe of 143999 rows which contains position and time data.
I already made a column "dt" which calulates the time difference between rows.
Now I want to create a new column which gives the dt values a group number.
So it starts with group = 0 and when dt > 60 the group number should increase by 1.
I tried the following:
def group(x):
c = 0 #
if densdata["dt"] < 60:
densdata["group"] = c
elif densdata["dt"] >= 60:
c += 1
densdata["group"] = c
densdata["group"] = densdata.apply(group, axis=1)'
The error that I get is: The truth value of a Series is ambiguous.
Any ideas how to fix this problem?
This is what I want:
dt group
0.01 0
2 0
0.05 0
300 1
2 1
60 2

You can take advantage of the fact that True evaluates to 1 and use .cumsum().
densdata = pd.DataFrame({'dt': np.random.randint(low=50,high=70,size=20),
'group' : np.zeros(20, dtype=np.int32)})
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 0
3 55 0
4 63 0
densdata['group'] = (densdata.dt >= 60).cumsum()
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 1
3 55 1
4 63 2
If you want to guarantee that the first value of group will be 0, even if the first value of dt is >= 60, then use
densdata['group'] = (densdata.dt.replace(densdata.dt[0],np.nan) >= 60).cumsum()

Use two or more relational operators in one sentence in python

How do two or more relational operators in a single sentence work? For example:
5 < 5 <= 3 > 10

Python supports double-ended comparisons. For example,
3 < x <= 7
is a check for 3 < x and x <= 7 (with x being evaluated just once).
By extension,
5 < 5 <= 3 > 10
means (5 < 5) and (5 <= 3) and (3 > 10), all of which are False, so the whole expression evaluates to False.

https://docs.python.org/2/reference/expressions.html#comparisons
It's evaluated in order, so your expression expands to
5 < 5 and 5 <= 3 and 3 > 10
which evaluates to False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apply Python function to multiple Pandas columns - python

this should do it df.apply(lambda x: pd.Series([function(x['colA1']),function(x['colA2']),function(x['colA3'])]), axis=1).rename({0:'colA1',1:'colA2',2:'colA3'}, axis=1) Output colA1 colA2 colA3 0 <= 10 > 10 > 10 1 <= 10 <= 10 > 10 2 <= 10 > 10 > 10 3 > 10 > 10 > 10 4 <= 10 <= 10 <= 10

Related

Python: Add a complex conditional column without for loop

Creating a function to iterate through DataFrame

multiple if else conditions in pandas dataframe and derive multiple columns

Increasing column value pandas

Use two or more relational operators in one sentence in python

Categories

Resources