Exponential moving averages in pandas - python

I have the following dataframe:
I'm trying to perform the calculation of an exponential moving average of 13 periods but the results don't match at all, I'm using the following code to try to get the result:
stockdata['sma'] = stockdata['close'].rolling(window = 13, min_periods=13).mean()
stockdata['ema'] = stockdata['close'].ewm(span=13, adjust=False).mean()
simple media is working perfectly, only exponential moving media is not giving the correct values
the calculation of moving averages is done in the column 'close'
ativo close sma ema
0 PETR4 28.18 NaN 28.180000
1 PETR4 28.63 NaN 28.244286
2 PETR4 28.39 NaN 28.265102
3 PETR4 29.18 NaN 28.395802
4 PETR4 28.93 NaN 28.472116
5 PETR4 29.13 NaN 28.566099
6 PETR4 29.48 NaN 28.696656
7 PETR4 30.13 NaN 28.901420
8 PETR4 29.72 NaN 29.018360
9 PETR4 29.42 NaN 29.075737
10 PETR4 29.36 NaN 29.116346
11 PETR4 29.75 NaN 29.206868
12 PETR4 30.55 29.296154 29.398744
on the dataframe the oldest data is on top, and the most recent below
the correct value of the exponential moving average of 13 should be 29.53,
what would be the correct way to use the function?
for the verification that I did on the investment platform, these values ​​that the function is giving me are values ​​of an arithmetic moving average

The issue is that you're assuming that an EMA with span 13 is only looking at the last 13 data points... But that's not really the case, it will look past that date, simply use smaller weights for the data points further in the past...
If you take the last month of closing stock prices for PETR4 and take the 13-day EMA for them, you'll get at the expected result:
closing_price = pd.Series([
30.00, 29.62, 29.29, 29.65, 29.30,
28.03, 28.80, 28.85, 28.94, 28.45,
28.18, 28.63, 28.39, 29.18, 28.93,
29.13, 29.48, 30.13, 29.72, 29.42,
29.36, 29.75, 30.55,
])
And:
In []: closing_price.ewm(span=13, adjust=False).mean().iloc[-1]
Out[]: 29.52974644568118
Which after rounding seems to match your expected answer.

Related

How to calculate the difference between value in a current row and a later row and column based on some conditional in a pandas dataframe

I’m trying to build a backtesting script for trading and I’m stuck trying to figure out how to calculate a future profit or loss based on a particular trigger. In essence, a particular pattern or other set of indicators will produce a trigger. This trigger is added as a “1” in a column of a pandas data frame in the same row of the close value that will be used as a trade entry. The trigger can happen at any point in the data set. What I need to do is calculate which comes first, a profit or a loss, based on a percentage profit or stop loss. For example, say a profit of 10%, and a stop loss of 4%. For profit, a value in the high column (with a higher index value than the row with the trigger) needs to be equal to or greater than the close (+ 10%) value on the trigger row. I’ll record the difference between the close value (at the trigger) and the high value that meets the criteria (a positive value in this case) in a new column, profit.
For a stop loss scenario, I'm looking for a value in the low column that is 4% or lower than the close value on the triggered row. This will be recorded in the same profit column, albeit as a negative value.
The intent here is to record the difference that occurs first, downstream (higher index value) of the current row index. Keep in mind triggers could happen on any row, including before a win or loss value occurs. For this reason, I believe I need to use an apply function, passing the close value at the trigger point, and the low and high columns (from the current row index to the last row index) to a function to do the calculation.
Unfortunately I’m not experienced enough in pandas (so far) to execute something of this complexity. Or, is there a better way?
Sample dataframe with triggers:
time open high low close trigger
1641013560000.0 46680.02 46686.17 46679.66 46680.03 NaN
1641013620000.0 46680.02 46708.64 46679.77 46698.07 NaN
1641013680000.0 46698.07 46704.37 46696.48 46700.45 NaN
1641013740000.0 46700.01 46703.06 46682.75 46684.28 1
1641013800000.0 46685.14 46700.75 46684.95 46700.75 NaN
1641013860000.0 46700.74 46726.71 46697.05 46725.09 NaN
1641013920000.0 46725.09 46728.23 46712.33 46712.34 1
1641013980000.0 46712.34 46758.99 46712.34 46757.93 1
1641014040000.0 46753.27 46779.67 46743.92 46779.67 NaN
1641014100000.0 46779.67 46780.00 46763.37 46775.48 NaN
....
1641014460000.0 46888.39 46896.29 46869.32 46896.29 NaN
1641014520000.0 46896.03 47092.27 46896.03 47046.62 NaN
1641014580000.0 47046.59 47258.84 47037.02 47182.72 1
1641014640000.0 47186.64 47453.33 47180.75 47433.39 NaN
1641014700000.0 47433.38 47448.45 47359.25 47362.86 NaN
1641014760000.0 47362.86 47415.87 47347.42 47404.15 NaN
1641014820000.0 47405.65 47486.71 47397.04 47450.39 NaN
1641014880000.0 47450.04 47463.93 47365.47 47380.77 NaN
Sample dataframe after profit calculated:
time open high low close trigger profit
1641013560000.0 46680.02 46686.17 46679.66 46680.03 NaN NaN
1641013620000.0 46680.02 46708.64 46679.77 46698.07 NaN NaN
1641013680000.0 46698.07 46704.37 46696.48 46700.45 NaN NaN
1641013740000.0 46700.01 46703.06 46682.75 46684.28 1 284
1641013800000.0 46685.14 46700.75 46684.95 46700.75 NaN NaN
1641013860000.0 46700.74 46726.71 46697.05 46725.09 NaN NaN
1641013920000.0 46725.09 46728.23 46712.33 46712.34 1 -65
1641013980000.0 46712.34 46758.99 46712.34 46757.93 1 -123
1641014040000.0 46753.27 46779.67 46743.92 46779.67 NaN NaN
1641014100000.0 46779.67 46780.00 46763.37 46775.48 NaN NaN
.....
1641014460000.0 46888.39 46896.29 46869.32 46896.29 NaN NaN
1641014520000.0 46896.03 47092.27 46896.03 47046.62 NaN NaN
1641014580000.0 47046.59 47258.84 47037.02 47182.72 1 383
1641014640000.0 47186.64 47453.33 47180.75 47433.39 NaN NaN
1641014700000.0 47433.38 47448.45 47359.25 47362.86 NaN NaN
1641014760000.0 47362.86 47415.87 47347.42 47404.15 NaN NaN
1641014820000.0 47405.65 47486.71 47397.04 47450.39 NaN NaN
1641014880000.0 47450.04 47463.93 47365.47 47380.77 NaN NaN
Note the above profit values are simulated, and not based on an actual calculation.
I've attempted something along these lines:
df['profit'] = np.where((df.trigger > 0), (df.close - df.high.rolling(k).min()), np.NaN)
df['profit'] = np.where((df.trigger > 0), (df.close - df.low.rolling(k).min()), np.NaN)
Unfortunately this only gives me an absolute high or low for specific period, k, and doesn't tell me which occurs first.

Historical Volatility from Prices of many different bonds in same column

I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778

Python - Find percent change for previous 7-day period's average

I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN

Pandas - Calculate rolling cumulative product with variable window size

I have some financial time series data that I would like to calculate the rolling cumulative product with a variable window size.
What I am trying to accomplish is using the following formula but instead of having window fixed at 12, I would like to use the value stored in the last column of the dataframe labeled 'labels_y' which will change over time.
df= (1 + df).rolling(window=12).apply(np.prod, raw=True) - 1
A sample of the data:
Out[102]:
div_yield earn_variab growth ... value volatility labels_y
date ...
2004-02-23 -0.001847 0.003252 -0.001264 ... 0.004368 -0.004490 2.0
2004-02-24 -0.001668 0.007404 0.002108 ... -0.006122 0.008183 2.0
2004-02-25 -0.003272 0.004596 0.001283 ... -0.002057 0.005912 3.0
2004-02-26 0.001818 -0.003397 -0.003190 ... 0.001327 -0.003908 3.0
2004-02-27 -0.002838 0.009879 0.000808 ... 0.000350 0.010557 3.0
[5 rows x 11 columns]
and the final result should look like:
Out[104]:
div_yield earn_variab growth ... value volatility labels_y
date ...
2004-02-23 NaN NaN NaN ... NaN NaN NaN
2004-02-24 -0.003512 0.010680 0.000841 ... -0.001781 0.003656 8.0
2004-02-25 -0.006773 0.015325 0.002125 ... -0.003834 0.009589 35.0
2004-02-26 -0.003126 0.008596 0.000193 ... -0.006851 0.010180 47.0
2004-02-27 -0.004294 0.011075 -0.001104 ... -0.000383 0.012559 63.0
[5 rows x 11 columns]
Rows 1 and 2 are calculated with a 2 day rolling window and rows 3, 4 and 5 use a 3 day window
I have tried using
def get_window(row):
return (1 + row).rolling(window=int(row['labels_y'])).apply(np.prod, raw=True) - 1
df = df.apply(get_window, axis=1)
I realize that calculates the cumupative product in the wrong direction. I am struggling on how to get this to work.
Any help would be hugely appreciated.
Thanks
def get_window(row, df):
return (1 + df).rolling(window=int(row['labels_y'])).apply(np.prod, raw=True).loc[row.name]-1
result = df1.apply(get_window, axis=1, df=df1)
Does this do the trick? Highly inefficient, but I don't see another way except for tedious for-loops.

Group by multiple time units in pandas data frame

I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer

Categories