I have a dataframe like below.
import pandas as pd
import numpy as np
raw_data = {'student':['A','B','C','D','E'],
'score': [100, 96, 80, 105,156],
'height': [7, 4,9,5,3],
'trigger1' : [84,95,15,78,16],
'trigger2' : [99,110,30,93,31],
'trigger3' : [114,125,45,108,46]}
df2 = pd.DataFrame(raw_data, columns = ['student','score', 'height','trigger1','trigger2','trigger3'])
print(df2)
I need to derive Flag column based on multiple conditions.
i need to compare score and height columns with trigger 1 -3 columns.
Flag Column:
if Score greater than equal trigger 1 and height less than 8 then Red --
if Score greater than equal trigger 2 and height less than 8 then Yellow --
if Score greater than equal trigger 3 and height less than 8 then Orange --
if height greater than 8 then leave it as blank
How to write if else conditions in pandas dataframe and derive columns?
Expected Output
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
For other column Text1 in my original question I have tried this one but the integer columns not converting the string when concatenation using astype(str) any other approach?
def text_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger1'].astype(str) + " and less than height 5"
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger2'].astype(str) + " and less than height 5"
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger3'].astype(str) + " and less than height 5"
elif (df['height'] > 8):
return np.nan
You need chained comparison using upper and lower bound
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan
df2['Flag'] = df2.apply(flag_df, axis = 1)
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else
Edit: answering #Cecilia's questions
what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2
Not sure what you tried but you can return a derived value instead of string using
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working
Again not sure what code did you try but using pandas isnull would do the trick
def flag_df(df):
if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
Here is a way to use numpy.select() for doing this with neat code, scalable and faster:
conditions = [
(df2['trigger1'] <= df2['score']) & (df2['score'] < df2['trigger2']) & (df2['height'] < 8),
(df2['trigger2'] <= df2['score']) & (df2['score'] < df2['trigger3']) & (df2['height'] < 8),
(df2['trigger3'] <= df2['score']) & (df2['height'] < 8),
(df2['height'] > 8)
]
choices = ['Red','Yellow','Orange', np.nan]
df['Flag1'] = np.select(conditions, choices, default=np.nan)
you can use also apply with a custom function on axis 1 like this :
def color_selector(x):
if (x['trigger1'] <= x['score'] < x['trigger2']) and (x['height'] < 8):
return 'Red'
elif (x['trigger2'] <= x['score'] < x['trigger3']) and (x['height'] < 8):
return 'Yellow'
elif (x['trigger3'] <= x['score']) and (x['height'] < 8):
return 'Orange'
elif (x['height'] > 8):
return ''
df2 = df2.assign(flag=df2.apply(color_selector, axis=1))
you will get something like this :
Related
I need to divide range of my passengers age onto 5 parts and create a new column where will be values from 0 to 4 respectively for every part(For 1 range value 0 for 2 range value 1 etc)
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset['Age_bin'] = a1.apply(0 for a in range(a))
Here what i tried to do but it does not work. I also pin dataset picture
DATASET
I expect to get result where i'll see a new column named 'Age_bin' and values 0 in it for Age from 0 to 16 inclusively, values 1 for age from 17 to 33 and other 3 rangers
Binning with pandas cut is appropriate here, try:
titset['Age_bin'] = titset['Age'].cut(bins=[0,17,34,51,68,81], include_lowest=True, labels=False)
First of all, the variable a is a range object, which you are calling range(a) again, which is equivalent to range(range(0, 17)), hence the error.
Secondly, even if you fixed the above problem, you will run into an error again since .apply takes in a callable (i.e., a function be it defined with def or a lambda function).
If your goal is to assign a new column that represents the age group that each row is in, you can just filter with your result and assign them:
titset = pd.DataFrame({'Age': range(1, 81)})
a = range(0,17)
b = range(17,34)
c = range(34, 51)
d = range(51, 68)
e = range(68,81)
a1 = titset.query('Age >= 0 & Age < 17')
a2 = titset.query('Age >= 17 & Age < 34')
a3 = titset.query('Age >= 34 & Age < 51')
a4 = titset.query('Age >= 51 & Age < 68')
a5 = titset.query('Age >= 68 & Age < 81')
titset.loc[a1.index, 'Age_bin'] = 0
titset.loc[a2.index, 'Age_bin'] = 1
titset.loc[a3.index, 'Age_bin'] = 2
titset.loc[a4.index, 'Age_bin'] = 3
titset.loc[a5.index, 'Age_bin'] = 4
Or better yet, use a for loop:
age_groups = [0, 17, 34, 51, 68, 81]
for i in range(len(age_groups) - 1):
subset = titset.query(f'Age >= {age_groups[i]} & Age < {age_groups[i+1]}')
titset.loc[subset.index, 'Age_bin'] = i
I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient.
Can my code be simplified and made more efficient?
(I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path + hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt + 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
Use df.assign() for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
or alternatively df.query (if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[] as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
or df.iterrows() as a fallback
df.assign() (or df.query) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[] calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f, HxFPos->r, HxTotalBtn->btn for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask is gated by (f <= 6) or (btn <= 30). (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask'].
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by #Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1
I have an excel file, importing as a dataframe. I want to use python to find matches found in the same row of the dataframe that are no more than 0.0002 difference. The rules are:
Start at column 15/row 1
Compare that value to column 16/row 1. Continue increasing the column number and comparing the new value to column 15/row 1. Loops until either a match is found or reaches the last column (which in my data is 125)
If it finds 2 numbers with a delta no more than 0.0002, it will continue to the next column/row 1 and will see if there is a third number that has a delta no more than 0.0002.
If it finds 3 numbers, then it will continue to the next column to search if there is a fourth number that has a delta no more than 0.0002.
If a fourth match is found and will not continue searching against the starting column. It will now progress to the next starting column (explained below)
If any match is found, it will place the Median of the numbers rounded to 0.0000 (if 2 matches, median of two numbers; if 3 matches median of three numbers; etc.) in a new column to the right of the existing data.
If any match is found, then the starting point (was column 15/1) will now move to column 16/1 and the process will repeat as described above. The goal with continuing is to see if there is another set of numbers that match.
when the starting point column reaches the end, then it goes to the next row.
I am trying to find the right code so I can get the value in row 1 column 1 and then compare it to the other values. Will this work?
df.iat[RowNum, ColNum]
When I find a match, I created four holding columns for each type of match (2, 3, or 4, which means 12 columns). Because each row will have a varying number of matches (or no matches), but for future analysis purposes, I need these to be located in defined column locations to reference. That is why I was planning on have four for each type of match.
For this piece of code, since I know the column name, I was looking to use column name and then use the row number integer to find the right location to enter the value. Is this correct (I concatenate the column name because there are four holding columns for each match end in 1, 2, 3, 4. This is in case there is more than one match found on a row, then I have multiple columns to hold the matches)?
df[ColumnName + str(3)].iloc[RowNum]
I tried to figure out how to get a single 'cell' by using integers (like Cells() in excel, but not sure if right way todo it. The documentation on .loc and .iloc talks about gathering rows of data, not a sincel 'cell'.
Here is a sample of the dataframe (Due to width space, I only showed the first column of each match type (excel TwoMatch2, because that row had 2 times it matched different numbers, but there is four in total for each).
High Low Open Close TwoMatch1 TwoMatch2...ThrMatch1...ForMatch1
0 1.11165 1.11128 1.11137 1.11165 1.1117
1 1.11165 1.11139 1.11148 1.11165
2 1.11167 1.11138 1.11166 1.11138 1.1117 1.1114
3 1.11165 1.11144 1.11165 1.11163 1.1117
4 1.11165 1.11149 1.1115 1.11165
5 1.11165 1.1115 1.11163 1.11163 1.1116 1.1116
6 1.11165 1.11159 1.11159 1.11159 1.1116 1.1116 1.1116
When the code finishes, it write the dataframe back to Excel, csv or database (working on replacing excel and using a database instead). It will have the original data plus the new columns which contain the matches for each row.
Here is the code I have developed, to which I need the above formulas to finalize (in case it helps to know my intentions):
df.reindex(columns = df.columns.tolist() + ['TwoRBs1','TwoRBs2','TwoRBs3','TwoRBs4','ThrRBs1','ThrRBs2','ThrRBs3','ThrRBs4','ForRBs1','ForRBs2','ForRBs3','ForRBs4'])
RowNum = 0
ttlcount = 5
OneMinGroupFlag = 0
FiveMinGroupFlag = 0
FifteenMinGroupFlag = 0
SixtyMinGroupFlag = 0
TwoFortyMinGroupFlag = 0
ColValues = 0
#------------------------------------------------------------------------------------------------------------------------------------------------------------#
#----------------------------------------------------------------Functions-----------------------------------------------------------------------------------#
#------------------------------------------------------------------------------------------------------------------------------------------------------------#
def AssignMinGroup(ColmnNum):
""" If the column or a match was found in the group, then it sets the flag to not check that group again """
nonlocal OneMinGroupFlag
nonlocal FiveMinGroupFlag
nonlocal FifteenMinGroupFlag
nonlocal SixtyMinGroupFlag
nonlocal TwoFortyMinGroupFlag
if (ColmnNum >= 14 and ColmnNum <= 19) or (ColmnNum >= 44 and ColmnNum <= 59): OneMinGroupFlag = 1
elif (ColmnNum >= 20 and ColmnNum <= 25) or (ColmnNum >= 60 and ColmnNum <= 75): FiveMinGroupFlag = 1
elif (ColmnNum >= 26 and ColmnNum <= 31) or (ColmnNum >= 76 and ColmnNum <= 91): FifteenMinGroupFlag = 1
elif (ColmnNum >= 32 and ColmnNum <= 37) or (ColmnNum >= 92 and ColmnNum <= 107): SixtyMinGroupFlag = 1
elif (ColmnNum >= 38 and ColmnNum <= 43) or (ColmnNum >= 108 and ColmnNum <= 123): TwoFortyMinGroupFlag = 1
def FilterGroups(ColmnNum):
nonlocal OneMinGroupFlag
nonlocal FiveMinGroupFlag
nonlocal FifteenMinGroupFlag
nonlocal SixtyMinGroupFlag
nonlocal TwoFortyMinGroupFlag
""""Determines if it is about to test a group that is to be filtered, then sets flag to filter this and go to the next colum/step number"""
if ColmnNum == 44 or ColmnNum == 45 or ColmnNum == 60 or ColmnNum == 61 or ColmnNum == 76 or ColmnNum == 77 or ColmnNum == 92 or ColmnNum == 93 or ColmnNum == 108 or ColmnNum == 109: return(True)
if OneMinGroupFlag == 1 and ((ColmnNum >= 14 and ColmnNum <= 19) or (ColmnNum >= 44 and ColmnNum <= 59)): return(True)
elif FiveMinGroupFlag == 1 and ((ColmnNum >= 20 and ColmnNum <= 25) or (ColmnNum >= 60 and ColmnNum <= 75)): return(True)
elif FifteenMinGroupFlag == 1 and ((ColmnNum >= 26 and ColmnNum <= 31) or (ColmnNum >= 76 and ColmnNum <= 91)): return(True)
elif SixtyMinGroupFlag == 1 and ((ColmnNum >= 32 and ColmnNum <= 37) or (ColmnNum >= 92 and ColmnNum <= 107)): return(True)
elif TwoFortyMinGroupFlag == 1 and ((ColmnNum >= 38 and ColmnNum <= 43) or (ColmnNum >= 108 and ColmnNum <= 123)): return(True)
else: return(False)
def CheckLogMatch(ColumnName,MatchValue):
nonlocal RowNum
""""Will check if the match has already been found, if not, then it will log it into the next available column for match type."""
if abs(df.loc[RowNum, [ColumnName + str(1)]] - MatchValue) <= 0.00029:
if abs(df.loc[RowNum, [ColumnName + str(2)]] - MatchValue) <= 0.00029:
if abs(df.loc[RowNum, [ColumnName + str(3)]] - MatchValue) <= 0.00029:
if abs(df.loc[RowNum, [ColumnName + str(4)]] - MatchValue) <= 0.00029:
pass
else: df.loc[RowNum,[ColumnName + str(4)]] = MatchValue
else: df.loc[RowNum, [ColumnName + str(3)]] = MatchValue
else: df.loc[RowNum,[ColumnName + str(2)]] = MatchValue
else: df.loc[RowNum, [ColumnName + str(1)]] = MatchValue
def Find234Matches():
""""Checks subsequent columns and compares to ColNum to find if there are 2, 3, or 4 matches to ColNum. Then it enters the matches in the table"""
nonlocal ColNum
nonlocal RowNum
nonlocal ColValues
TwoStep = ColNum + 1
while TwoStep <= 123:
if FilterGroups(TwoStep):
TwoStep += 1
continue
else:
Step2Val = df.iat[RowNum, TwoStep]
if abs(ColValues - Step2Val) <= 0.00029:
occur2 = round(median([ColValues, Step2Val]), 4)
AssignMinGroup(TwoStep)
ThreeStep = TwoStep + 1
while ThreeStep <= 123:
if FilterGroups(ThreeStep):
if ThreeStep == 123:
CheckLogMatch('TwoRBs',occur2)
return
else:
ThreeStep += 1
continue
else:
Step3Val = df.iat[RowNum, ThreeStep]
if abs(ColValues - Step3Val) <= 0.00029:
occur3 = round(median([ColValues, Step2Val, Step3Val]), 4)
AssignMinGroup(ThreeStep)
FourStep = ThreeStep + 1
while FourStep <= 123:
if FilterGroups(FourStep):
if FourStep == 123:
CheckLogMatch('ThrRBs',occur3)
CheckLogMatch('TwoRBs',occur2)
return
else:
FourStep += 1
continue
else:
Step4Val = df.iat[RowNum, FourStep]
if abs(ColValues - Step4Val) <= 0.00029:
occur4 = round(median([ColValues, Step2Val, Step3Val, Step4Val]), 4)
CheckLogMatch('ForRBs',occur4)
CheckLogMatch('ThrRBs',occur3)
CheckLogMatch('TwoRBs',occur2)
return
else:
if FourStep == 123:
CheckLogMatch('ThrRBs',occur3)
CheckLogMatch('TwoRBs',occur2)
return
else: FourStep += 1
else:
if ThreeStep == 123:
CheckLogMatch('TwoRBs',occur2)
return
else: ThreeStep += 1
else: TwoStep += 1
#------------------------------------------------------------------------------------------------------------------------------------------------------------#
#------------------------------------------------------------------------------------------------------------------------------------------------------------#
#------------------------------------------------------------------------------------------------------------------------------------------------------------#
while RowNum <= ttlcount:
ColNum = 14
while ColNum <= 107:
ColValues = df.iat[RowNum, ColNum]
if pd.isnull(ColValues) or ColValues > df.iat[RowNum, 9] or ColValues < df.iat[RowNum, 10]:
ColNum += 1
continue
else:
if ColNum == 44 or ColNum == 45 or ColNum == 60 or ColNum == 61 or ColNum == 76 or ColNum == 77 or ColNum == 92 or ColNum == 93 or ColNum == 108 or ColNum == 109:
ColNum += 1
continue
else:
AssignMinGroup(ColNum)
Find234Matches()
ColNum += 1
RowNum += 1
[ answer in progress - working with OP to understand expected output ]
Have look at the output below and determine if this meets your requirements.
(Inferred) Requirements:
Compare each 'High' column to 'Low', 'Open' and 'Close'
Calculate the absolute delta
If <= 0.0002 output the delta into a new column (.e.g 'HighLow')
If > 0.0002, output None
Sample Code:
import numpy as np
import pandas as pd
cols = [('High', i) for i in df.columns[1:]]
for a, b in cols:
df[a+b] = np.where((df[a] - df[b]).abs() <= 0.0002, (df[a] - df[b]).abs(), None)
Output:
High Low Open Close HighLow HighOpen HighClose
0 1.11165 1.11128 1.11137 1.11165 None None 0
1 1.11165 1.11139 1.11148 1.11165 None 0.00017 0
2 1.11167 1.11138 1.11166 1.11138 None 1e-05 None
3 1.11165 1.11144 1.11165 1.11163 None 0 2e-05
4 1.11165 1.11149 1.11150 1.11165 0.00016 0.00015 0
5 1.11165 1.11150 1.11163 1.11163 0.00015 2e-05 2e-05
6 1.11165 1.11159 1.11159 1.11159 6e-05 6e-05 6e-05
I figured it out. The first formula is:
dataframe.iat[RowNumber, ColNumber]
The second one is:
dataframe['ColumnName'].values[RowNum]
I am trying to apply one funtion to a column but i am getting the error
Name weight
Person1 30
Person2 70
My code is below
def classify(x):
if 0 <= x < 20:
y = "0 to 20%"
if 20 < x < 40:
y = "20 to 40%"
if 40 < x < 60:
y = "40 to 60%"
if 60 < x < 80:
y = "60 to 80%"
if 80 < x <= 100:
y = "80 to 100%"
return ( y)
df['Target'] = df['weight'].apply(lambda x: classify(x)) throwing the Local bound error
If I use print instead of return I am able to see the outputs
Expected out
Name weight Target
Person1 30 20 to 40
Person2 70 60 to 80
Why not using cut
df['Target']=pd.cut(df['weight'],[0,20,40,60,80,100])
I am running into an issue creating a function that will recognize if a particular value in a column is between two values.
def bid(x):
if df['tla'] < 85000:
return 1
elif (df['tla'] >= 85000) & (df['tla'] < 110000):
return 2
elif (df['tla'] >= 111000) & (df['tla'] < 126000):
return 3
elif (df['tla'] >= 126000) & (df['tla'] < 150000):
return 4
elif (df['tla'] >= 150000) & (df['tla'] < 175000):
return 5
elif (df['tla'] >= 175000) & (df['tla'] < 200000):
return 6
elif (df['tla'] >= 200000) & (df['tla'] < 250000):
return 7
elif (df['tla'] >= 250000) & (df['tla'] < 300000):
return 8
elif (df['tla'] >= 300000) & (df['tla'] < 375000):
return 9
elif (df['tla'] >= 375000) & (df['tla'] < 453100):
return 10
elif df['tla'] >= 453100:
return 11
I apply that to my new column:
df['bid_bucket'] = df['bid_bucket'].apply(bid)
And I am getting this error back:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Anyone have any ideas?
try the following using numpy.select
import numpy as np
values = [1,2,3,4,5,6,7,8,9,10,11]
cond = [df['tla']<85000, (df['tla'] >= 850000) & (df['tla'] < 110000), .... ]
df['bid_bucket'] = np.select(cond, values)
This can already be accomplished with pd.cut, defining the bin edges, and adding +1 to the labels to get your numbering to start at 1.
import pandas as pd
import numpy as np
df = pd.DataFrame({'tla': [7, 85000, 111000, 88888, 51515151]})
df['bid_bucket'] = pd.cut(df.tla, right=False,
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf],
labels=False)+1
Output: df
tla bid_bucket
0 7 1
1 85000 2
2 111000 3
3 88888 2
4 126000 4
5 51515151 11
You can simply use the np.digitize function to assign the ranges
df['bid_bucket'] = np.digitize(df['bid_bucket'],np.arange(85000,453100,25000))
Example
a = np.random.randint(85000,400000,10)
#array([305628, 134122, 371486, 119856, 321423, 346906, 319321, 165714,360896, 206404])
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf]
np.digitize(a,bins)
Out:
array([9, 4, 9, 3, 9, 9, 9, 5, 9, 7])
To keep it in pandas: I think referencing df['tla'] in your function means to reference a series instead of a single value which leads to the ambiguity. You should provide the specific value instead. You could use lambda x, then your code could be something like this
df = pd.DataFrame({'tla':[10,123456,999999]})
def bid(x):
if x < 85000:
return 1
elif (x >= 85000 and x < 110000):
return 2
elif (x >= 111000 and x < 126000):
return 3
elif (x >= 126000 and x < 150000):
return 4
elif (x >= 150000 and x < 175000):
return 5
elif (x >= 175000 and x < 200000):
return 6
elif (x >= 200000 and x < 250000):
return 7
elif (x >= 250000 and x < 300000):
return 8
elif (x >= 300000 and x < 375000):
return 9
elif (x >= 375000 and x < 453100):
return 10
elif x >= 453100:
return 11
df['bid_bucket'] = df['tla'].apply(lambda x: bid(x))
df
You have two possibilities.
Either apply a function defined on a row on the pandas DataFrame in a row-wise way:
def function_on_a_row(row):
if row.tla > ...
...
df.apply(function_on_a_row, axis=1)
In which case keep bid the way you defined it but replace the parameter x with a word like "row" and then the df with "row" to keep the parameters name meaningful, and use:
df.bid_bucket = df.apply(bid, axis=1)
Or apply a function defined on an element on a pandas Series.
def function_on_an_elt(element_of_series):
if element_of_series > ...
...
df.new_column = df.my_column_of_interest.apply(function_on_an_elt)
In your case redefine bid accordingly.
Here you tried to mix both approaches, which does not work.