I have a data frame that contains some daily,monthly and weekly statistics and lost weight.
I would like to create the boolean column that contains the information whether the lost weight was bigger or lower than the threshold. I tried using if loop nad np.where
if df_prod_stats.loc[df_prod_stats['frequency'] == "daily"]:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 0.5)),1,0)
elif df_prod_stats.loc[df_prod_stats['frequency'] == "monthly"]:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 15)),1,0)
else:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 3.5)),1,0)
But i get an error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think you will need to do this a different way. I think you're trying to go through each row to see if it's weekly/monthly and checking the loss weight accordingly, however that is not what your code actually does. In the if df_prod_stats.loc[...], the loc will return a subset of the data frame, which will evaluate to true if it has data in, but then your next line of trying to fill in the new column will just apply to the entire original data frame, not the rows that just matched the loc statement. You can achieve what (I think) you want using several loc statements as below:
create target_met column and set to 0:
df_prod_stats['target_met'] = 0
Then use .loc to filter your first if statement condition (frequency is daily, loss weight is less than 0.5), and set target met to be 1:
df_prod_stats.loc[(df_prod_stats['frequency'] == 'daily')
& (df_prod_stats['loss_weight'] < 0.5), 'target_met'] = 1
elif condition (frequency is monthly, loss weight is less than 15):
df_prod_stats.loc[(df_prod_stats['frequency'] == 'monthly')
& (df_prod_stats['loss_weight'] < 15), 'target_met'] = 1
else condition (frequency is neither daily or monthly, and loss weight is less than 3.5):
df_prod_stats.loc[~(df_prod_stats['frequency'].isin(['daily', 'monthly']))
& (df_prod_stats['loss_weight'] < 3.5), 'target_met'] = 1
Put together you get:
df_prod_stats['target_met'] = 0
df_prod_stats.loc[(df_prod_stats['frequency'] == 'daily')
& (df_prod_stats['loss_weight'] < 0.5), 'target_met'] = 1
df_prod_stats.loc[(df_prod_stats['frequency'] == 'monthly')
& (df_prod_stats['loss_weight'] < 15), 'target_met'] = 1
df_prod_stats.loc[~(df_prod_stats['frequency'].isin(['daily', 'monthly']))
& (df_prod_stats['loss_weight'] < 3.5), 'target_met'] = 1
Output:
frequency loss_weight target_met
0 daily -0.42 1
1 daily -0.35 1
2 daily -0.67 1
3 daily -0.11 1
4 daily -0.31 1
I hope that is what you're trying to achieve.
I found out it's possible also to use simple set of conditions in np.whereas follows:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 0.5) & ( df_prod_stats['frequency'] == "daily")
| (df_prod_stats['loss_weight'] < 15.0) & ( df_prod_stats['frequency'] == "monthly")
| (df_prod_stats['loss_weight'] < 3.5) & ( df_prod_stats['frequency'] == "weekly")),1,0)
Related
I am implementing my own function for calculating taxes.My intenention is to solve this problem only with one function. Below you can see the data
df = pd.DataFrame({"id_n":["1","2","3","4","5"],
"sales1":[0,115000,440000,500000,740000],
"sales2":[0,115000,460000,520000,760000],
"tax":[0,8050,57500,69500,69500]
})
Now I want to introduce a tax function that needs to give the same results as results in column tax. Below you can see an estimation of that function:
# Thresholds
min_threeshold = 500000
max_threeshold = 1020000
# Maximum taxes
max_cap = 69500
# Rates
rate_1 = 0.035
rate_2 = 0.1
# Total sales
total_sale = df['sales1'] + df['sales2']
tax = df['tax']
# Function for estimation
def tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2):
if (total_sale > 0 and tax == 0): # <---- This line of code
calc_tax = 0
elif (total_sale < min_threeshold):
calc_tax = total_sale * rate_1
elif (total_sale >= min_threeshold) & (total_sale <= max_threeshold):
calc_tax = total_sale * rate_2
elif (total_sale > max_threeshold):
calc_tax = max_cap
return calc_tax
The next step is the execution of the above function, I want to have all of this results in one column.
df['new_tax']=tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
After execution of this command, I received this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So probably error is happen in this line of row and for that reason function can not be execuded (total_sale > 0 and tax == 0):
So can anybody help me how to solve this problem ?
The error occurs because you are comparing a series (collection of values) with a single integer.
Your variable total_sale has the following form:
0 0
1 230000
2 900000
3 1020000
4 1500000
dtype: int64
You cannot compare this series with zero. You must either compare each single element with zero (0, 230000, 900000, etc.) or whether any entry satisfies your condition.
I think you want something like this:
def tax_fun(total_sale, tax, min_threeshold, max_threeshold, max_cap, rate_1, rate_2):
calc_tax = np.empty(shape=total_sale.shape)
calc_tax[(total_sale > 0) & (tax == 0)] = 0
calc_tax[(total_sale < min_threeshold)] = total_sale[(total_sale < min_threeshold)] * rate_1
calc_tax[(total_sale >= min_threeshold) & (total_sale <= max_threeshold)] = total_sale[(total_sale >= min_threeshold) & (total_sale <= max_threeshold)] * rate_2
calc_tax[(total_sale > max_threeshold)] = max_cap
return calc_tax
df['new_tax'] = tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
print(df)
----------------------------------------------------
id_n sales1 sales2 tax new_tax
0 1 0 0 0 0.0
1 2 115000 115000 8050 8050.0
2 3 440000 460000 57500 90000.0
3 4 500000 520000 69500 102000.0
4 5 740000 760000 69500 69500.0
----------------------------------------------------
I would use indexing instead of if and else conditions.
I'm relatively new to Python and hoping someone can help point me in the right direction.
For context, I want create a new column in a Pandas dataframe that assigns a score of linear integer values to a new column based on values in an existing column being within certain ranges.
There is a lower and upper bound, say 0 and 0.75. Being below or above those respectively will yield the lowest / highest value.
Written manually with relatively few conditions it looks like this using np.select():
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions = [
(df['col1'] < 0),
(df['col1'] >= 0) & (df['col1'] <= .25),
(df['col1'] >= .25) & (df['col1'] <= .5),
(df['col1'] >= .5) & (df['col1'] <= .75),
(df['col1'] >= .75)
]
values = [0, 1, 2, 3, 4]
df['col3'] = np.select(conditions,values,default=None)
I would like to be able to dynamically divide the mid-range between bounds into many more conditions, which is easy enough using np.linspace.
Where I'm having trouble is in assigning the values. I have tried to do this using pd.cut and operating on a list to feed into np.select. This is the closest I have come with these:
d = {'col1': [-1, 0, .1, .6, .8],'col2': [-4,-0.02, 0.07, 1, 2]}
df = pd.DataFrame(data=d)
conditions_no = 9 # Choose number of conditions to divide the mid-range
choices = [n for n in range(1, conditions_no + 2)] # Assign values to apply starting from 1
mid_range = np.linspace(0,.75,conditions_no) # Divide mid-range by number of conditions
mid_range = np.append(mid_range[0],mid_range) # Repeat lower bound at start for < condition
cols = ['df["col1"]' for c in range(0, conditions + 1)] # Generate list of column references
conditions = list(zip(cols,mid_range)) # List with range as values, df as key
conditions = [f'{k} >= {v}' for k, v in conditions] # Combine column references and
conditions[0] = conditions[0].replace('>=','<') # Change first condition to less than lower bound
conditions = conditions[::-1] # Reverse values and assigned choices to check > highest value first
choices = choices[::-1]
Here the conditions are a list of strings rather than code:
['df["col1"] >= 0.75',
'df["col1"] >= 0.65625',
'df["col1"] >= 0.5625',
'df["col1"] >= 0.46875',
'df["col1"] >= 0.375',
'df["col1"] >= 0.28125',
'df["col1"] >= 0.1875',
'df["col1"] >= 0.09375',
'df["col1"] >= 0.0',
'df["col1"] < 0.0']
So they understandably throw an error:
df['col3'] = np.select(conditions, choices, default=None)
# TypeError: invalid entry 0 in condlist: should be boolean ndarray
I understand that eval() might be able to help here, but haven't been able to find a way to get that to run with np.select. I've also read that it's best to try and avoid using eval().
This is the effort so far using pd.cut:
conditions = 9
choices = [n for n in range(1, conditions + 2)]
mid_range = np.linspace(0,.75,conditions)
mid_range = np.append(-float("inf"),mid_range)
mid_range = np.append(mid_range,float("inf"))
df['col3'] = pd.cut(df['col1'], mid_range, labels=choices)
df['col4'] = pd.cut(df['col2'], mid_range, labels=choices)
This works, but assigns a categorical that I then can't operate on as needed:
df['col3'] + df['col4']
# TypeError: unsupported operand type(s) for +: 'Categorical' and 'Categorical'
After everything I've looked up, I keep coming back to np.select as likely being the best solution here. However, I can't figure out how to dynamically create the conditions - are any of these efforts along the right lines or is there a better approach I should look at?
I want to create a directional pandas pct_change function, so a negative number in a prior row, followed by a larger negative number in a subsequent row will result in a negative pct_change (instead of positive).
I have created the following function:
```
ef pct_change_directional(x):
if x.shift() > 0.0:
return x.pct_change() #compute normally if prior number > 0
elif x.shift() < 0.0 and x > x.shift:
return abs(x.pct_change()) # make positive
elif x.shift() <0.0 and x < x.shift():
return -x.pct_change() #make negative
else:
return 0
```
However when I apply it to my pandas dataframe column like so:
df['col_pct_change'] = pct_change_directional(df[col1])
I get the following error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
any ideas how I can make this work?
Thanks!
CWE
As #Wen said multiple where, not unlikely np.select
mask1 = df[col].shift() > 0.0
mask2 = ((df[col].shift() < 0.0) & (df[col] > df[col].shift())
mask3 = ((df[col].shift() < 0.0) & (df[col] < df[col].shift())
np.select([mask1, mask2, mask3],
[df[col].pct_change(), abs(df[col].pct_change()),
-df[col].pct_change()],
0)
Much detail about select and where you can see here
I'm trying to filter out certain rows in my dataframe that is allowing two combinations of values for two columns. For example columns 'A' and 'B' can just be either 'A' > 0 and 'B' > 0 OR 'A' < 0 and 'B' < 0. Any other combination I want to filter.
I tried the following
df = df.loc[(df['A'] > 0 & df['B'] > 0) or (df['A'] < 0 & df['B'] < 0)]
which gives me an error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is probably a very trivial questions but I couldn't find any solution to be honest and I can't figure out what the problem with my approach ist.
You need some parenthesis and to format for pandas (and/or to become &/|):
df = df[((df['A'] > 0) & (df['B'] > 0)) | ((df['A'] < 0) & (df['B'] < 0))]
Keep in mind what this is doing - you're just building a giant list of [True, False, True, True] and passing that into the df index, telling it to keep each row depending on whether it gets a True or a False in the corresponding list.
I recently learned about pandas and was happy to see its analytics functionality. I am trying to convert Excel array functions into the Pandas equivalent to automate spreadsheets that I have created for the creation of performance attribution reports. In this example, I created a new column in Excel based on conditions within other columns:
={SUMIFS($F$10:$F$4518,$A$10:$A$4518,$C$4,$B$10:$B$4518,0,$C$10:$C$4518," ",$D$10:$D$4518,$D10,$E$10:$E$4518,$E10)}
The formula is summing up the values in the "F" array (security weights) based on certain conditions. "A" array (portfolio ID) is a certain number, "B" array (security id) is zero, "C" array (group description) is " ", "D" array (start date) is the date of the row that I am on, and "E" array (end date) is the date of the row that I am on.
In Pandas, I am using the DataFrame. Creating a new column on a dataframe with the first three conditions is straight forward, but I am having difficult with the last two conditions.
reportAggregateDF['PORT_WEIGHT'] = reportAggregateDF['SEC_WEIGHT_RATE']
[(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[:]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[:])].sum()
Obviously the .ix[:] in the last two conditions is not doing anything for me, but is there a way to make the sum conditional on the row that I am on without looping? My goal is to not do any loops, but instead use purely vector operations.
You want to use the apply function and a lambda:
>> df
A B C D E
0 mitfx 0 200 300 0.25
1 gs 1 150 320 0.35
2 duk 1 5 2 0.45
3 bmo 1 145 65 0.65
Let's say I want to sum column C times E but only if column B == 1 and D is greater than 5:
df['matches'] = df.apply(lambda x: x['C'] * x['E'] if x['B'] == 1 and x['D'] > 5 else 0, axis=1)
df.matches.sum()
It might be cleaner to split this into two steps:
df_subset = df[(df.B == 1) & (df.D > 5)]
df_subset.apply(lambda x: x.C * x.E, axis=1).sum()
or to use simply multiplication for speed:
df_subset = df[(df.B == 1) & (df.D > 5)]
print sum(df_subset.C * df_subset.E)
You are absolutely right to want to do this problem without loops.
I'm sure there is a better way, but this did it in a loop:
for idx, eachRecord in reportAggregateDF.T.iteritems():
reportAggregateDF['PORT_WEIGHT'].ix[idx] = reportAggregateDF['SEC_WEIGHT_RATE'][(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[idx]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[idx])].sum()