I am running into an issue creating a function that will recognize if a particular value in a column is between two values.
def bid(x):
if df['tla'] < 85000:
return 1
elif (df['tla'] >= 85000) & (df['tla'] < 110000):
return 2
elif (df['tla'] >= 111000) & (df['tla'] < 126000):
return 3
elif (df['tla'] >= 126000) & (df['tla'] < 150000):
return 4
elif (df['tla'] >= 150000) & (df['tla'] < 175000):
return 5
elif (df['tla'] >= 175000) & (df['tla'] < 200000):
return 6
elif (df['tla'] >= 200000) & (df['tla'] < 250000):
return 7
elif (df['tla'] >= 250000) & (df['tla'] < 300000):
return 8
elif (df['tla'] >= 300000) & (df['tla'] < 375000):
return 9
elif (df['tla'] >= 375000) & (df['tla'] < 453100):
return 10
elif df['tla'] >= 453100:
return 11
I apply that to my new column:
df['bid_bucket'] = df['bid_bucket'].apply(bid)
And I am getting this error back:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Anyone have any ideas?
try the following using numpy.select
import numpy as np
values = [1,2,3,4,5,6,7,8,9,10,11]
cond = [df['tla']<85000, (df['tla'] >= 850000) & (df['tla'] < 110000), .... ]
df['bid_bucket'] = np.select(cond, values)
This can already be accomplished with pd.cut, defining the bin edges, and adding +1 to the labels to get your numbering to start at 1.
import pandas as pd
import numpy as np
df = pd.DataFrame({'tla': [7, 85000, 111000, 88888, 51515151]})
df['bid_bucket'] = pd.cut(df.tla, right=False,
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf],
labels=False)+1
Output: df
tla bid_bucket
0 7 1
1 85000 2
2 111000 3
3 88888 2
4 126000 4
5 51515151 11
You can simply use the np.digitize function to assign the ranges
df['bid_bucket'] = np.digitize(df['bid_bucket'],np.arange(85000,453100,25000))
Example
a = np.random.randint(85000,400000,10)
#array([305628, 134122, 371486, 119856, 321423, 346906, 319321, 165714,360896, 206404])
bins=[-np.inf, 85000, 110000, 126000, 150000, 175000,
200000, 250000, 300000, 375000, 453100, np.inf]
np.digitize(a,bins)
Out:
array([9, 4, 9, 3, 9, 9, 9, 5, 9, 7])
To keep it in pandas: I think referencing df['tla'] in your function means to reference a series instead of a single value which leads to the ambiguity. You should provide the specific value instead. You could use lambda x, then your code could be something like this
df = pd.DataFrame({'tla':[10,123456,999999]})
def bid(x):
if x < 85000:
return 1
elif (x >= 85000 and x < 110000):
return 2
elif (x >= 111000 and x < 126000):
return 3
elif (x >= 126000 and x < 150000):
return 4
elif (x >= 150000 and x < 175000):
return 5
elif (x >= 175000 and x < 200000):
return 6
elif (x >= 200000 and x < 250000):
return 7
elif (x >= 250000 and x < 300000):
return 8
elif (x >= 300000 and x < 375000):
return 9
elif (x >= 375000 and x < 453100):
return 10
elif x >= 453100:
return 11
df['bid_bucket'] = df['tla'].apply(lambda x: bid(x))
df
You have two possibilities.
Either apply a function defined on a row on the pandas DataFrame in a row-wise way:
def function_on_a_row(row):
if row.tla > ...
...
df.apply(function_on_a_row, axis=1)
In which case keep bid the way you defined it but replace the parameter x with a word like "row" and then the df with "row" to keep the parameters name meaningful, and use:
df.bid_bucket = df.apply(bid, axis=1)
Or apply a function defined on an element on a pandas Series.
def function_on_an_elt(element_of_series):
if element_of_series > ...
...
df.new_column = df.my_column_of_interest.apply(function_on_an_elt)
In your case redefine bid accordingly.
Here you tried to mix both approaches, which does not work.
Related
I have a data frame like
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1513.040037 1186.770072 514.840753 86.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
I want to select rows based on certain value ranges in x and y and find the values in all four x,y,w,h and perform addition or subtraction on those values and replace them with the calculated value in that row.
I am doing something like
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']] = df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29
and getting error:
"Must have equal len keys and value when setting with an ndarray"
when I tried this
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), 'x'] = df['x'] - 20
it works but I want to perform operation on all four columns in one go and update the values.
My desired answer is it should select row 5 and my answer should be like
x y w h
5 1493.040037 1021.770072 540.840753 57.909160
Any help will be much appreciated.
Let us fix your code
m = (df['x'] >= 1000) & (df['x'] < 1800) \
& (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] += [-20, -165, 26, -29]
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160 *** updated
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
With your approach , you can use pd.concat on the R.H.S
df.loc[(df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290), ['x', 'y', 'w','h']]=pd.concat((df['x'] - 20, df['y'] - 165, df['w'] + 26, df['h'] - 29),axis=1)
x y w h
0 1593.826218 1293.189452 353.268389 74.493565
1 1680.089430 1956.536916 87.632469 42.567752
2 1362.421731 1908.648195 52.031778 42.567752
3 1599.303248 1385.419580 351.899131 78.040878
4 1500.716721 1121.144789 397.084623 46.115064
5 1493.040037 1021.770072 540.840753 57.909160
6 1387.068363 1804.002472 212.234885 44.341408
7 787.333657 379.756446 416.254225 70.946253
You have to assign with an array of the same shape. Easiest way is to use the original df:
m = (df['x'] >= 1000) & (df['x'] < 1800) & (df['y'] >= 1150) & (df['y'] < 1290)
df.loc[m] = df.assign(x=df["x"]-20, y=df["y"]-165, w=df['w']+26, h=df['h']-29)
print (df[m])
x y w h
5 1493.040037 1021.770072 540.840753 57.90916
I've got a function f(a, b) that is taking two pandas dataframes to apply different formulas to the values like this:
def f(a, b):
if a > 0 and b > 0:
return a + b
elif a > 0 and b < 0:
return a - b
elif a < 0 and b > 0:
return a * b
elif a < 0 and b < 0:
return a / b
else:
print('bad')
dfa = pd.DataFrame({'a':[1, 1]})
dfb = pd.DataFrame({'b':[2, 2]})
f(dfa,dfb)
The issue here in particular is, that I'd need the current value that is processed in the function to branch, however, using the and operator leads to this below.
"The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
and using & is leading to a
"cannot compare [type] array with a scalar of type [bool]"
Edit:
Considering the answers, I starting to realize that my minimal example might not transport my intention very well.
def f(a, b):
if a > 0 and b > 0:
X = operationA()
elif a > 0 and b < 0:
X = operationB()
elif a < 0 and b < 0:
X = operationC()
elif a < 0 and b < 0:
X = operationD()
else:
print('bad')
Y = operationY()
return X, Y
# both dataframes are part of a training example label example = (a, b)
df_label_partA = pd.DataFrame({'a':[1, 1, -1, -1]})
df_label_partB = pd.DataFrame({'b':[1, -1, 1, -1]})
f(df_label_partA, df_label_partB)
the data frames can't be considered separately as each is part of a list of labels (basically a tuple split up into 2 lists)
IIUC:
pd.concat([dfa,dfb], axis=1).apply(lambda x: f(*x), axis=1)
Outputs:
0 3
1 3
dtype: int64
You can try this
def f(a, b):
if all(a > 0) and all(b > 0):
return dfa.a + dfb.b
elif all(a > 0) and all(b < 0):
return dfa.a - dfb.b
elif all(a < 0) and all(b > 0):
return dfa.a * dfb.b
elif all(a < 0) and all(b < 0):
return dfa.a / dfb.b
else:
print('bad')
dfa = pd.DataFrame({'a':[1, 1]})
dfb = pd.DataFrame({'b':[2, 2]})
f(dfa,dfb)
output
0 3
1 3
dtype: int64
Im searching for a function that Returns the Position of an element in a dataframe.
- there is duplicates in the dataframe amongst the values
- dataframe About 10*2000
- the function will be applied on a dataframe using applymap()
# initial dataframe
df = pandas.DataFrame({"R1": [8,2,3], "R2": [2,3,4], "R3": [-3,4,-1]})
Example:
get_position(2) is not clear as it could be either "R1" or "R2". I am
wondering if there is another way that python knows which Position the
element holds - possibly during the applymap() Operation
Edit:
df.rank(axis=1,pct=True)
EDIT2:
#intial dataframe
df_initial = pandas.DataFrame({"R1": [8,2,3], "R2": [2,3,4], "R3": [-3,4,-1]})
step1)
df_rank = df_initial.rank(axis=1,pct=True)
step2)
# Building Groups based on the percentage of the respective value
def function103(x):
if 0.0 <= x <= 0.1:
P1.append(get_column_name1(x))
return x
elif 0.1 < x <= 0.2:
P2.append(get_column_name1(x))
return x
elif 0.2 < x <= 0.3:
P3.append(get_column_name1(x))
return x
elif 0.3 < x <= 0.4:
P4.append(get_column_name1(x))
return x
elif 0.4 < x <= 0.5:
P5.append(get_column_name1(x))
return x
elif 0.5 < x <= 0.6:
P6.append(get_column_name1(x))
return x
elif 0.6 < x <= 0.7:
P7.append(get_column_name1(x))
return x
elif 0.7 < x <= 0.8:
P8.append(get_column_name1(x))
return x
elif 0.8 < x <= 0.9:
P9.append(get_column_name1(x))
return x
elif 0.9 < x <= 1.0:
P10.append(get_column_name1(x))
return x
else:
return x
step3)
# trying to get the columns Name of the the respective value
# my idea was to determine the Position of each value to then write a function
def get_column_name1(x)
#to return the values column Name
step 4)
# apply the function
P1=[]
P2=[]
P3=[]
P4=[]
P5=[]
P6=[]
P7=[]
P8=[]
P9=[]
P10=[]
P11=[]
df_rank.applymap(function103).head()
If need index or columns names by value in DataFrame use numpy.where for positions and then select all index or columns values converted to numpy array:
df = pd.DataFrame({"R1": [8,2,3], "R2": [2,3,4], "R3": [-3,4,-1]})
i, c = np.where(df == 2)
print (i, c)
[0 1] [1 0]
print (df.index.values[i])
[0 1]
print (df.columns.values[c])
['R2' 'R1']
EDIT:
i, c = np.where(df == 2)
df1 = df.rank(axis=1,pct=True)
print (df1)
R1 R2 R3
0 1.000000 0.666667 0.333333
1 0.333333 0.666667 1.000000
2 0.666667 1.000000 0.333333
print (df1.iloc[i, c])
R2 R1
0 0.666667 1.000000
1 0.666667 0.333333
print (df1.where(df == 2).dropna(how='all').dropna(how='all', axis=1))
R1 R2
0 NaN 0.666667
1 0.333333 NaN
Or:
out = df1.stack()[df.stack() == 2].rename_axis(('idx','cols')).reset_index(name='val')
print (out)
idx cols val
0 0 R2 0.666667
1 1 R1 0.333333
EDIT:
Solution for your function - need iterate by one column DataFrame created by reshape and extract Series.name, what is same like column name:
def get_column_name1(x):
return x.name
P1=[]
P2=[]
P3=[]
P4=[]
P5=[]
P6=[]
P7=[]
P8=[]
P9=[]
P10=[]
P11=[]
def function103(x):
if 0.0 <= x[0] <= 0.1:
P1.append(get_column_name1(x))
return x
elif 0.1 < x[0] <= 0.2:
P2.append(get_column_name1(x))
return x
elif 0.2 < x[0] <= 0.3:
P3.append(get_column_name1(x))
return x
elif 0.3 < x[0] <= 0.4:
P4.append(get_column_name1(x))
return x
elif 0.4 < x[0] <= 0.5:
P5.append(get_column_name1(x))
return x
elif 0.5 < x[0] <= 0.6:
P6.append(get_column_name1(x))
return x
elif 0.6 < x[0] <= 0.7:
P7.append(get_column_name1(x))
return x
elif 0.7 < x[0] <= 0.8:
P8.append(get_column_name1(x))
return x
elif 0.8 < x[0] <= 0.9:
P9.append(get_column_name1(x))
return x
elif 0.9 < x[0] <= 1.0:
P10.append(get_column_name1(x))
return x
else:
return x
a = df_rank.stack().reset_index(level=0, drop=True).to_frame().apply(function103, axis=1)
print (P4)
['R3', 'R1', 'R3']
I am attempting to write a function and apply it to multiple fields in a pandas dataframe.
The function takes column colA1, and assigns a value to a new column, colB2 based on conditional statements.
This function works if a single column is given, e.g. colA1, but how could I write it to iterate through a
list of columns, returning a corresponding number of new columns?
The following function works on a single column:
dict = {'colA1':[2,6,8,28,5],
'colA2': [38,6,14,63,3],
'colA3':[90,40,80,98,3]}
df = pd.DataFrame(dict)
def function(x):
if x <= 10:
return '<= 10'
elif x > 10:
return '> 10'
df['colB1']=df['colA1'].apply(function)
df['colB1']
This returns:
0 <= 10
1 <= 10
2 <= 10
3 > 10
4 <= 10
I attempted to apply it to multiple columns as shown here:
Update Multiple Columns using Pandas Apply Function
df[['colB1', 'colB2', 'colB3']]=df[['colA1', 'colA2', 'colA3']].apply(function)
But this returns:
ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index colA1')
If this actually what you want to do, a faster alternative is np.select():
cond=[df<= 10,df > 10]
choice=['<= 10','> 10' ]
df[:]=np.select(cond,choice)
print(df)
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10
You can also try with df.applymap() for your function:
df[['colA1','colA2','colA3']].applymap(function)
#df.applymap(function)
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10
this should do it
df.apply(lambda x: pd.Series([function(x['colA1']),function(x['colA2']),function(x['colA3'])]), axis=1).rename({0:'colA1',1:'colA2',2:'colA3'}, axis=1)
Output
colA1 colA2 colA3
0 <= 10 > 10 > 10
1 <= 10 <= 10 > 10
2 <= 10 > 10 > 10
3 > 10 > 10 > 10
4 <= 10 <= 10 <= 10
I have a dataframe like below.
import pandas as pd
import numpy as np
raw_data = {'student':['A','B','C','D','E'],
'score': [100, 96, 80, 105,156],
'height': [7, 4,9,5,3],
'trigger1' : [84,95,15,78,16],
'trigger2' : [99,110,30,93,31],
'trigger3' : [114,125,45,108,46]}
df2 = pd.DataFrame(raw_data, columns = ['student','score', 'height','trigger1','trigger2','trigger3'])
print(df2)
I need to derive Flag column based on multiple conditions.
i need to compare score and height columns with trigger 1 -3 columns.
Flag Column:
if Score greater than equal trigger 1 and height less than 8 then Red --
if Score greater than equal trigger 2 and height less than 8 then Yellow --
if Score greater than equal trigger 3 and height less than 8 then Orange --
if height greater than 8 then leave it as blank
How to write if else conditions in pandas dataframe and derive columns?
Expected Output
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
For other column Text1 in my original question I have tried this one but the integer columns not converting the string when concatenation using astype(str) any other approach?
def text_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger1'].astype(str) + " and less than height 5"
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger2'].astype(str) + " and less than height 5"
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['student'] + " score " + df['score'].astype(str) + " greater than " + df['trigger3'].astype(str) + " and less than height 5"
elif (df['height'] > 8):
return np.nan
You need chained comparison using upper and lower bound
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return 'Red'
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return 'Yellow'
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return 'Orange'
elif (df['height'] > 8):
return np.nan
df2['Flag'] = df2.apply(flag_df, axis = 1)
student score height trigger1 trigger2 trigger3 Flag
0 A 100 7 84 99 114 Yellow
1 B 96 4 95 110 125 Red
2 C 80 9 15 30 45 NaN
3 D 105 5 78 93 108 Yellow
4 E 156 3 16 31 46 Orange
Note: You can do this with a very nested np.where but I prefer to apply a function for multiple if-else
Edit: answering #Cecilia's questions
what is the returned object is not strings but some calculations, for example, for the first condition, we want to return df['height']*2
Not sure what you tried but you can return a derived value instead of string using
def flag_df(df):
if (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
what if there are 'NaN' values in osome columns and I want to use df['xxx'] is None as a condition, the code seems like not working
Again not sure what code did you try but using pandas isnull would do the trick
def flag_df(df):
if pd.isnull(df['height']):
return df['height']
elif (df['trigger1'] <= df['score'] < df['trigger2']) and (df['height'] < 8):
return df['height']*2
elif (df['trigger2'] <= df['score'] < df['trigger3']) and (df['height'] < 8):
return df['height']*3
elif (df['trigger3'] <= df['score']) and (df['height'] < 8):
return df['height']*4
elif (df['height'] > 8):
return np.nan
Here is a way to use numpy.select() for doing this with neat code, scalable and faster:
conditions = [
(df2['trigger1'] <= df2['score']) & (df2['score'] < df2['trigger2']) & (df2['height'] < 8),
(df2['trigger2'] <= df2['score']) & (df2['score'] < df2['trigger3']) & (df2['height'] < 8),
(df2['trigger3'] <= df2['score']) & (df2['height'] < 8),
(df2['height'] > 8)
]
choices = ['Red','Yellow','Orange', np.nan]
df['Flag1'] = np.select(conditions, choices, default=np.nan)
you can use also apply with a custom function on axis 1 like this :
def color_selector(x):
if (x['trigger1'] <= x['score'] < x['trigger2']) and (x['height'] < 8):
return 'Red'
elif (x['trigger2'] <= x['score'] < x['trigger3']) and (x['height'] < 8):
return 'Yellow'
elif (x['trigger3'] <= x['score']) and (x['height'] < 8):
return 'Orange'
elif (x['height'] > 8):
return ''
df2 = df2.assign(flag=df2.apply(color_selector, axis=1))
you will get something like this :