Operators and multiple conditions in pandas - python

I am implementing my own function for calculating taxes.My intenention is to solve this problem only with one function. Below you can see the data
df = pd.DataFrame({"id_n":["1","2","3","4","5"],
"sales1":[0,115000,440000,500000,740000],
"sales2":[0,115000,460000,520000,760000],
"tax":[0,8050,57500,69500,69500]
})
Now I want to introduce a tax function that needs to give the same results as results in column tax. Below you can see an estimation of that function:
# Thresholds
min_threeshold = 500000
max_threeshold = 1020000
# Maximum taxes
max_cap = 69500
# Rates
rate_1 = 0.035
rate_2 = 0.1
# Total sales
total_sale = df['sales1'] + df['sales2']
tax = df['tax']
# Function for estimation
def tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2):
if (total_sale > 0 and tax == 0): # <---- This line of code
calc_tax = 0
elif (total_sale < min_threeshold):
calc_tax = total_sale * rate_1
elif (total_sale >= min_threeshold) & (total_sale <= max_threeshold):
calc_tax = total_sale * rate_2
elif (total_sale > max_threeshold):
calc_tax = max_cap
return calc_tax
The next step is the execution of the above function, I want to have all of this results in one column.
df['new_tax']=tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
After execution of this command, I received this error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
So probably error is happen in this line of row and for that reason function can not be execuded (total_sale > 0 and tax == 0):
So can anybody help me how to solve this problem ?

The error occurs because you are comparing a series (collection of values) with a single integer.
Your variable total_sale has the following form:
0 0
1 230000
2 900000
3 1020000
4 1500000
dtype: int64
You cannot compare this series with zero. You must either compare each single element with zero (0, 230000, 900000, etc.) or whether any entry satisfies your condition.
I think you want something like this:
def tax_fun(total_sale, tax, min_threeshold, max_threeshold, max_cap, rate_1, rate_2):
calc_tax = np.empty(shape=total_sale.shape)
calc_tax[(total_sale > 0) & (tax == 0)] = 0
calc_tax[(total_sale < min_threeshold)] = total_sale[(total_sale < min_threeshold)] * rate_1
calc_tax[(total_sale >= min_threeshold) & (total_sale <= max_threeshold)] = total_sale[(total_sale >= min_threeshold) & (total_sale <= max_threeshold)] * rate_2
calc_tax[(total_sale > max_threeshold)] = max_cap
return calc_tax
df['new_tax'] = tax_fun(total_sale,tax,min_threeshold,max_threeshold,max_cap,rate_1,rate_2)
print(df)
----------------------------------------------------
id_n sales1 sales2 tax new_tax
0 1 0 0 0 0.0
1 2 115000 115000 8050 8050.0
2 3 440000 460000 57500 90000.0
3 4 500000 520000 69500 102000.0
4 5 740000 760000 69500 69500.0
----------------------------------------------------
I would use indexing instead of if and else conditions.

Related

Create a column under if condition doesn't work

I have a data frame that contains some daily,monthly and weekly statistics and lost weight.
I would like to create the boolean column that contains the information whether the lost weight was bigger or lower than the threshold. I tried using if loop nad np.where
if df_prod_stats.loc[df_prod_stats['frequency'] == "daily"]:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 0.5)),1,0)
elif df_prod_stats.loc[df_prod_stats['frequency'] == "monthly"]:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 15)),1,0)
else:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 3.5)),1,0)
But i get an error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I think you will need to do this a different way. I think you're trying to go through each row to see if it's weekly/monthly and checking the loss weight accordingly, however that is not what your code actually does. In the if df_prod_stats.loc[...], the loc will return a subset of the data frame, which will evaluate to true if it has data in, but then your next line of trying to fill in the new column will just apply to the entire original data frame, not the rows that just matched the loc statement. You can achieve what (I think) you want using several loc statements as below:
create target_met column and set to 0:
df_prod_stats['target_met'] = 0
Then use .loc to filter your first if statement condition (frequency is daily, loss weight is less than 0.5), and set target met to be 1:
df_prod_stats.loc[(df_prod_stats['frequency'] == 'daily')
& (df_prod_stats['loss_weight'] < 0.5), 'target_met'] = 1
elif condition (frequency is monthly, loss weight is less than 15):
df_prod_stats.loc[(df_prod_stats['frequency'] == 'monthly')
& (df_prod_stats['loss_weight'] < 15), 'target_met'] = 1
else condition (frequency is neither daily or monthly, and loss weight is less than 3.5):
df_prod_stats.loc[~(df_prod_stats['frequency'].isin(['daily', 'monthly']))
& (df_prod_stats['loss_weight'] < 3.5), 'target_met'] = 1
Put together you get:
df_prod_stats['target_met'] = 0
df_prod_stats.loc[(df_prod_stats['frequency'] == 'daily')
& (df_prod_stats['loss_weight'] < 0.5), 'target_met'] = 1
df_prod_stats.loc[(df_prod_stats['frequency'] == 'monthly')
& (df_prod_stats['loss_weight'] < 15), 'target_met'] = 1
df_prod_stats.loc[~(df_prod_stats['frequency'].isin(['daily', 'monthly']))
& (df_prod_stats['loss_weight'] < 3.5), 'target_met'] = 1
Output:
frequency loss_weight target_met
0 daily -0.42 1
1 daily -0.35 1
2 daily -0.67 1
3 daily -0.11 1
4 daily -0.31 1
I hope that is what you're trying to achieve.
I found out it's possible also to use simple set of conditions in np.whereas follows:
df_prod_stats['target_met'] =np.where(((df_prod_stats['loss_weight'] < 0.5) & ( df_prod_stats['frequency'] == "daily")
| (df_prod_stats['loss_weight'] < 15.0) & ( df_prod_stats['frequency'] == "monthly")
| (df_prod_stats['loss_weight'] < 3.5) & ( df_prod_stats['frequency'] == "weekly")),1,0)

In python, how to bin ranges based on look up table with lower and upper limits [duplicate]

I have variable called x whose minimum value is zero and maximum is 2 million. So I cut the value into bins like this code:
bins = [0,1,10000,20000,50000,60000,70000,100000,2000000]
df_input['X_bins'] = pd.cut(df_input['X'], bins,right=False)
Currently I am using for-loop to replace each bin with their Weight-of-Evidence value:
def flag_dfstd(df_input):
if (df_input['X'] >=0) & (df_input['X'] <100) :
return '-0.157688'
elif (df_input['X'] >=100) & (df_input['X'] < 10000) :
return '-0.083307'
elif (df_input['X'] >=10000) & (df_input['X'] < 20000) :
return '0.381819'
elif (df_input['X'] >=20000) & (df_input['X'] < 50000):
return '0.364365'
else:
return '0'
df_input['X_WOE'] = df_input.apply(flag_dfstd, axis = 1).astype(str)
Is there way that I can replace the Weight of Evidence without using for loop?
I think you need cut with parameter labels and for replace misisng value is necessary add cat.add_categories before replace:
df_input = pd.DataFrame({'X':[0,20,100, 10000, 30000, 1000000]})
b = [-np.inf, 100, 10000, 20000, 50000]
l = ['-0.157688', '-0.083307', '0.381819', '0.364365']
df_input['X_WOE'] = pd.cut(df_input['X'], bins=b, labels=l,right=False)
df_input['X_WOE'] = df_input['X_WOE'].cat.add_categories(['0']).fillna('0')
print (df_input)
X X_WOE
0 0 -0.157688
1 20 -0.157688
2 100 -0.083307
3 10000 0.381819
4 30000 0.364365
5 1000000 0

Require Max Value of previous 5 high values : Based on conditions

I need help in the following code for finding the max value in the window of previous 5 rows. i don't why its not working. Could anyone please help?
I am trying to put the conditions as:
if day(1), then max value = value[day(1)]
elif day(1<n<=5), then max value = value[day(n)] if value[day(n)]>value[day(n-1)]
else, max value will be the last max value iterated. Thanks for your time. Also, len(src) is 29969, if required.
def high_change(src, lkbk) :
highest_high = []
last_val = np.nan
for i in range(len(src)) :
for a in range(i, i+lkbk) :
if a == i :
highest_high = high_df1[a] # first day value is max value
last_val = high_df1[a]
elif high_df1[a] > high_df1[(a-1)] :
highest_high = high_df1[a] # then max high value in ref to previous value.
else :
highest_high = last_val
return highest_high
df1['h_h'] = pd.Series(perc_change(df1, 5))
answering my question, i got the result from the following code, hope it helps and any better code is welcomed.
def high_change(i, j) :
last_hval = np.nan
for a in range(i, j) :
if a == i :
last_hval = high_df1[a]
elif high_df1[a] > high_df1[(a-1)] :
last_hval = high_df1[a]
else :
last_hval
return last_hval
def perc_change(src, lkbk) :
highest_high = []
for i in range(len(src)) :
if i < lkbk :
highest_high.append(np.nan)
else :
highest_high.append(high_change(i-lkbk, i))
return highest_high
df['h_h'] = pd.Series(perc_change(df, 5), dtype=float).round(2)
First, compute if the value of current day is greater than the value of the previous day. Apply cumsum to get increasing values and rolling with a window of 6 (last 5 previous days and the current one). Finally, get the max value of the window excluding the current day:
WINDOW = 5
df['highest5'] = df['high'].gt(df['high'].shift()).cumsum() \
.rolling(WINDOW+1) \
.apply(lambda x: df.loc[x[:WINDOW].idxmax(), 'high'])
>>> df
high highest5
0 13996 NaN
1 14021 NaN
2 14019 NaN
3 14013 NaN
4 14019 NaN
5 14018 14019.0
6 14022 14019.0
7 14023 14022.0
8 14021 14023.0
9 14020 14023.0
10 14014 14023.0

python, operation on big pandas Dataframe

I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable
Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)

unable to loop through numpy arrays

I am really confused and can't seem to find an answer for my code below. I keep getting the following error:
File "C:\Users\antoniozeus\Desktop\backtester2.py", line 117, in backTest
if prices >= smas:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Now, as you will see my code below, I am trying to compare two numpy arrays, step by step, to try and generate a signal once my condition is met. This is based on stock Apple data.
Going from one point at a time so starting at index[0] then [1], if my prices is greater than or equal to smas (moving average), then a signal is produced. Here is the code:
def backTest():
#Trade Rules
#Buy when prices are greater than our moving average
#Sell when prices drop below or moving average
portfolio = 50000
tradeComm = 7.95
stance = 'none'
buyPrice = 0
sellPrice = 0
previousPrice = 0
totalProfit = 0
numberOfTrades = 0
startPrice = 0
startTime = 0
endTime = 0
totalInvestedTime = 0
overallStartTime = 0
overallEndTime = 0
unixConvertToWeeks = 7*24*60*60
unixConvertToDays = 24*60*60
date, closep, highp, lowp, openp, volume = np.genfromtxt('AAPL2.txt', delimiter=',', unpack=True,
converters={ 0: mdates.strpdate2num('%Y%m%d')})
## FIRST SMA
window = 10
weights = np.repeat(1.0, window)/window
'''valid makes sure that we only calculate from valid data, no MA on points 0:21'''
smas = np.convolve(closep, weights, 'valid')
prices = closep[9:]
for price in prices:
if stance == 'none':
if prices >= smas:
print "buy triggered"
buyPrice = closep
print "bought stock for", buyPrice
stance = "holding"
startTime = date
print 'Enter Date:', startTime
if numberOfTrades == 0:
startPrice = buyPrice
overallStartTime = date
numberOfTrades += 1
elif stance == 'holding':
if prices < smas:
print 'sell triggered'
sellPrice = closep
print 'finished trade, sold for:',sellPrice
stance = 'none'
tradeProfit = sellPrice - buyPrice
totalProfit += tradeProfit
print totalProfit
print 'Exit Date:', endTime
endTime = date
timeInvested = endTime - startTime
totalInvestedTime += timeInvested
overallEndTime = endTime
numberOfTrades += 1
#this is our reset
previousPrice = closep
You have numpy arrays -- smas is the output of np.convolve which is an array, and I believe that prices is also an array. with numpy,arr > other_arrwill return anndarray` which doesn't have a well defined truth value (hence the error).
You probably want to compare price with a single element from smas although I'm not sure which (or what np.convolve is going to return here -- It may only have a single element)...
I think you mean
if price >= smas
You have
if prices >= smas
which compares the whole list at once.

Categories