How to use min/max with shift function in pandas? - python

I have timeseries data (euro/usd).
I want to create new column with conditions that
(It easier to read in my code to understand the conditions.)
if minimum of 3 previous high prices less than or equal to the current price then it will be 'BUY_SIGNAL' and if maximum of 3 previous low prices higher than or equal to the current price then it will be 'SELL_SIGNAL'.
Here is my table looks like
DATE OPEN HIGH LOW CLOSE
0 1990.09.28 1.25260 1.25430 1.24680 1.24890
1 1990.10.01 1.25170 1.26500 1.25170 1.25480
2 1990.10.02 1.25520 1.26390 1.25240 1.26330
3 1990.10.03 1.26350 1.27000 1.26030 1.26840
4 1990.10.04 1.26810 1.27750 1.26710 1.27590
and this is my code (I try to create 2 functions and it does not work)
def target_label(df):
if df['HIGH']>=[df['HIGH'].shift(1),df['HIGH'].shift(2),df['HIGH'].shift(3)].min(axis=1):
return 'BUY_SIGNAL'
if df['LOW']>=[df['LOW'].shift(1),df['LOW'].shift(2),df['LOW'].shift(3)].min(axis=1):
return 'SELL_SIGNAL'
else:
return 'NO_SIGNAL'
def target_label(df):
if df['HIGH']>=df[['HIGH1','HIGH2','HIGH3'].min(axis=1):
return 'BUY_SIGNAL'
if df['LOW']<=df[['LOW1','LOW2','LOW3']].max(axis=1):
return 'SELL_SIGNAL'
else:
return 'NO_SIGNAL'
d_df.apply (lambda df: target_label(df), axis=1)

You can use rolling(3).min() to get the minimum of previous 3 rows. The same would work for other functions like max, mean, etc. Something like the following:
df['signal'] = np.where(
df['HIGH'] >= df.shift(1).rolling(3)['HIGH'].min(), 'BUY_SIGNAL',
np.where(
df['LOW'] >= df.shift(1).rolling(3)['LOW'].min(), 'SELL_SIGNAL',
'NO_SIGNAL'
)
)

Related

How to iterate over column values for each group and track sum

I have 4 dataframes like as given below
df_raw = pd.DataFrame(
{'stud_id' : [101, 101,101],
'prod_id':[12,13,16],
'total_qty':[100,1000,80],
'ques_date' : ['13/11/2020', '10/1/2018','11/11/2017']})
df_accu = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,16],
'accu_qty':[10,500,10],
'accu_date' : ['13/08/2021','02/11/2019','17/12/2018']})
df_inv = pd.DataFrame(
{'stud_id' : [101,101,101],
'prod_id':[12,13,18],
'inv_qty':[5,100,15],
'inv_date' : ['16/02/2022', '22/11/2020','19/10/2019']})
df_bkl = pd.DataFrame(
{'stud_id' : [101,101,101,101],
'prod_id' :[12,12,12,17],
'bkl_qty' :[15,40,2,10],
'bkl_date':['16/01/2022', '22/10/2021','09/10/2020','25/06/2020']})
My objective is to find out the below
a) Get the date when threshold exceeds 50%
threshold is given by the formula below
threshold = (((df_inv['inv_qty']+df_bkl['bkl_qty']+df_accu['accu_qty'])/df_raw['total_qty'])*100)
We have to add in the same order. Meaning, first, we have to add inv_qty, then bkl_qty and finally accu_qty.We do this way in order to identify the correct date when they exceeded 50% of total qty. Additionally, this has to be computed for each stud_id and prod_id.
but the problem is df_bkl has multiple records for the same stud_id and prod_id and it is by design. Real data also looks like this. Whereas df_accu and df_inv will have only row for each stud_id and prod_id.
In the above formula for df['bkl_qty'],we have to use each value of df['bkl_qty'] to compute the sum.
for ex: let's take stud_id = 101 and prod_id = 12.
His total_qty = 100, inv_qty = 5, his accu_qty=10. but he has three bkl_qty values - 15,40 and 2. So, threshold has to be computed in a fashion like below
5 (is value of inv_qty) +15 (is 1st value of bkl_qty) +40 (is 2nd value of bkl_qty) +2 (is 3rd value of bkl_qty) +10(is value of accu_qty)
So, now with the above, we can know that his threshold exceeded 50% when his bkl_qty value was 40. Meaning, 5+15+40 = 60 (which is greater than 50% of total_qty (100)).
I was trying something like below
df_stage_1 = df_raw.merge(df_inv,on=['stud_id','prod_id'], how='left').fillna(0)
df_stage_2 = df_stage_1.merge(df_bkl,on=['stud_id','prod_id'])
df_stage_3 = df_stage_2.merge(df_accu,on=['stud_id','prod_id'])
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['bkl_qty'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
But this is incorrect as I am not able to do each value by value for bkl_qty from df_bkl
In this post, I have shown only sample data with one stud_id=101 but in real time I have more than 1000's of stud_id and prod_id.
Therfore, any elegant and efficient approach would be useful. We have to apply this logic on million record datasets.
I expect my output to be like as shown below. whenever the sum value exceeds 50% of total_qty, we need to get that corresponding date
stud_id,prod_id,total_qty,threshold,threshold_date
101 12 100 72 22/10/2021
It can be achieved using groupby and cumsum which does cumulative summation.
# add cumulative sum column to df_bkl
df_bkl['csum'] = df_bkl.groupby(['stud_id','prod_id'])['bkl_qty'].cumsum()
# use df_bkl['csum'] to compute threshold instead of bkl_qty
df_stage_3['threshold'] = ((df_stage_3['inv_qty'] + df_stage_3['csum'] + df_stage_3['accu_qty'])/df_stage_3['total_qty'])*100
# check if inv_qty already exceeds threshold
df_stage_3.loc[df_stage_3.inv_qty > df_stage_3.total_qty/2, 'bkl_date'] = df_stage_3['inv_date']
# next doing some filter and merge to arrive at the desired df
gt_thres = df_stage_3[df_stage_3['threshold'] > df_stage_3['total_qty']/2]
df_f1 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].min().to_frame(name='threshold').reset_index()
df_f2 = gt_thres.groupby(['stud_id','prod_id','total_qty'])['threshold'].max().to_frame(name='threshold_max').reset_index()
df = pd.merge(df_f1, df_stage_3, on=['stud_id','prod_id','total_qty','threshold'], how='inner')
df2 = pd.merge(df,df_f2, on=['stud_id','prod_id','total_qty'], how='inner')
df2 = df2[['stud_id','prod_id','total_qty','threshold','bkl_date']].rename(columns={'threshold_max':'threshold', 'bkl_date':'threshold_date'})
print(df2)
provides the output as:
stud_id prod_id total_qty threshold threshold_date
0 101 12 100 72.0 22/10/2021
Does this work?

Create a column that is a conditional smoothed moving average of another column in python

My Problem
I am trying to create a column in python which is the conditional smoothed moving 14 day average of another column. The condition is that I only want to include positive values from another column in the rolling average.
I am currently using the following code which works exactly how I want it to, but it is really slow because of the loops. I want to try and re-do it without using loops. The dataset is simply the last closing price of a stock.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df['delta'] = df.PX_LAST.pct_change()
df.loc[df.index[0], 'avg_gain'] = 0
for x in range(1,len(df.index)):
if df["delta"].iloc[x] > 0:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + df["delta"].iloc[x]) / 14
else:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + 0) / 14
df
Correct Output Example
Dates PX_LAST delta avg_gain
03/09/2018 43.67800 NaN 0.000000
04/09/2018 43.14825 -0.012129 0.000000
05/09/2018 42.81725 -0.007671 0.000000
06/09/2018 43.07725 0.006072 0.000434
07/09/2018 43.37525 0.006918 0.000897
10/09/2018 43.47925 0.002398 0.001004
11/09/2018 43.59750 0.002720 0.001127
12/09/2018 43.68725 0.002059 0.001193
13/09/2018 44.08925 0.009202 0.001765
14/09/2018 43.89075 -0.004502 0.001639
17/09/2018 44.04200 0.003446 0.001768
Attempted Solutions
I tried to create a new column that only comprises of the positive values and then tried to create the smoothed moving average of that new column but it doesn't give me the right answer
df['new_col'] = df['delta'].apply(lambda x: x if x > 0 else 0)
df['avg_gain'] = df['new_col'].ewm(14,min_periods=1).mean()
The maths behind it as follows...
Avg_Gain = ((Avg_Gain(t-1) * 13) + (New_Col * 1)) / 14
where New_Col only equals the positive values of Delta
Does anyone know how I might be able to do it?
Cheers
This should speed up your code:
df['avg_gain'] = df[df['delta'] > 0]['delta'].rolling(14).mean()
Does your current code converge to zero? If you can provide the data, then it would be easier for the folk to do some analysis.
I would suggest you add a column which is 0 if the value is < 0 and instead has the same value as the one you want to consider if it is >= 0. Then you take the running average of this new column.
df['new_col'] = df.apply(lambda x: x['delta'] if x['delta'] >= 0 else 0)
df['avg_gain'] = df['new_value'].rolling(14).mean()
This would take into account zeros instead of just discarding them.

How to perform multiple boolean conditions in a whole DataFrame (row by row)?

I have the next DataFrame:
open high low close volume
0 62.8571 63.9285 62.7714 63.5642 82641944.0
1 63.6642 64.9285 63.5014 64.5114 88379522.0
2 61.7014 63.6857 61.4428 63.2757 112681030.0
3 62.5928 63.6399 62.0285 62.8085 113921367.0
4 63.4357 64.0499 62.6028 63.0505 110727309.0
.. .. .. .. .. ..
And currently I have the next code to generate a "bool"(0,1,-1) Series depending con multiple conditions (selecting 2 by 2 rows. In other cases I will need 3/4 rows in each iteration/calculation):
def check_pattern(data):
engulfed_bar_range = data.iloc[-2]['close'] - data.iloc[-2]['open']
if abs(engulfed_bar_range) >= params:
if engulfed_bar_range > 0:
return -1*((data.iloc[-1]['open'] > data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] < data.iloc[-2]['open']))
else:
return +1*((data.iloc[-1]['open'] < data.iloc[-2]['close']) and \
(data.iloc[-1]['close'] > data.iloc[-2]['open']))
return False
res = []
for index in range(1, len(all_data)):
data = all_data.iloc[index-1:index+1]
res.append(check_pattern(d))
s = pd.Series(res)
There is any better/easiest/bestPerformance way of doing that? In some other cases similar to that, in which I only need the data of one column of the DataFrame, I have used df.rolling(..), but in this case that I need using data of several columns I dn't know how to do it. Maybe, there is some function of numpy that I can use? Or pd.eval? (I have tried but I havn't been able to get what I want)...
Thank so much in advance for your help.
Graphical explanation of what I'm looking for in the df:
I want a pd.Series with +1 when there is a Bullish Engulfing pattern and -1 when there is a Bearish Engulfing pattern. And 0 if there is no patter at that indexes.

Group by date range in pandas dataframe

I have a time series data in pandas, and I would like to group by a certain time window in each year and calculate its min and max.
For example:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
How to group by time window e.g. 'Jan-10':'Mar-21' for each year and calculate its min and max for column value?
You can use the resample method.
df.resample('5d').agg(['min','max'])
I'm not sure if there's a direct way to do it without first creating a flag for the days required. The following function is used to create a flag required:
# Function for flagging the days required
def flag(x):
if x.month == 1 and x.day>=10: return True
elif x.month in [2,3,4]: return True
elif x.month == 5 and x.day<=21: return True
else: return False
Since you require for each year, it would be a good idea to have the year as a column.
Then the min and max for each year for given periods can be obtained with the code below:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
df['Year'] = df.index.year
pd.pivot_table(df[list(pd.Series(df.index).apply(flag))], values=['value'], index = ['Year'], aggfunc=[min,max])
The output will look like follows:
Sample Output
Hope that answers your question... :)
You can define the bin edges, then throw out the bins you don't need (every other) with .loc[::2, :]. Here I'll define two functions just to check we're getting the date ranges we want within groups (Note since left edges are open, need to subtract 1 day):
import pandas as pd
edges = pd.to_datetime([x for year in df.index.year.unique()
for x in [f'{year}-02-09', f'{year}-03-21']])
def min_idx(x):
return x.index.min()
def max_idx(x):
return x.index.max()
df.groupby(pd.cut(df.index, bins=edges)).agg([min_idx, max_idx, min, max]).loc[::2, :]
Output:
value
min_idx max_idx min max
(2011-02-09, 2011-03-21] 2011-02-10 2011-03-21 0.009343 0.990564
(2012-02-09, 2012-03-21] 2012-02-10 2012-03-21 0.026369 0.978470
(2013-02-09, 2013-03-21] 2013-02-10 2013-03-21 0.039491 0.946481
(2014-02-09, 2014-03-21] 2014-02-10 2014-03-21 0.029161 0.967490
(2015-02-09, 2015-03-21] 2015-02-10 2015-03-21 0.006877 0.969296
(2016-02-09, 2016-03-21] NaT NaT NaN NaN

calculating slope on a rolling basis in pandas df python

I have a dataframe :
CAT ^GSPC
Date
2012-01-06 80.435059 1277.810059
2012-01-09 81.560600 1280.699951
2012-01-10 83.962914 1292.079956
....
2017-09-16 144.56653 2230.567646
and I want to find the slope of the stock / and S&P index for the last 63 days for each period. I have tried :
x = 0
temp_dct = {}
for date in df.index:
x += 1
max(x, (len(df.index)-64))
temp_dct[str(date)] = np.polyfit(df['^GSPC'][0+x:63+x].values,
df['CAT'][0+x:63+x].values,
1)[0]
However I feel this is very "unpythonic" , but I've had trouble integrating rolling/shift functions into this.
My expected output is to have a column called "Beta" that has the slope of the S&P (x values) and stock (y values) for all dates available
# this will operate on series
def polyf(seri):
return np.polyfit(seri.index.values, seri.values, 1)[0]
# you can store the original index in a column in case you need to reset back to it after fitting
df.index = df['^GSPC']
df['slope'] = df['CAT'].rolling(63, min_periods=2).apply(polyf, raw=False)
After running this, there will be a new column store the fitting result.

Categories