I'm trying to implement the Kaufman Efficiency Ratio (ER) in Python with Pandas.
In a Pandas DataFrame, I have two columns:
Date
Closing Price of a stock (the German DAX index, ^GDAXI, in this example):
Date Close
2016-01-05 10310.10
2016-01-06 10214.02
2016-01-07 9979.85
2016-01-08 9849.34
2016-01-11 9825.07
2016-01-12 9985.43
2016-01-13 9960.96
2016-01-14 9794.20
What I need is a third column that includes the ER for a given period n.
Definition of the ER:
ER = Direction / Volatility
Where:
Direction = ABS (Close – Close[n])
Volatility = n * ∑ (ABS(Close – Close[1]))
n = The efficiency ratio period.
Here is an example of a n=3 period ER (taken from http://etfhq.com/blog/2011/02/07/kaufmans-efficiency-ratio/):
What I'm struggeling with is how to do this in Python with Pandas?
In the end, my dataframe should look like this, according to the calculation above:
Date Adj Close ER(3)
2016-01-04 10283.44
2016-01-05 10310.10
2016-01-06 10214.02
2016-01-07 9979.85 0.9
2016-01-08 9849.34 1.0
2016-01-11 9825.07 1.0
2016-01-12 9985.43 0.0
2016-01-13 9960.96 0.5
2016-01-14 9794.20 0.1
How do I make Pandas to look back at the previous n rows for the calculation needed for the ER?
Any help is greatly appreciated!
Thank you in advance.
Dirk
No need to write a rolling function, just use diff and rolling_sum:
df['direction'] = df['Close'].diff(3).abs()
df['volatility'] = pd.rolling_sum(df['Close'].diff().abs(), 3)
I think the code is pretty much self-explanatory. Please let me know if you would like explanations.
In [11]: df['direction'] / df['volatility']
Out[11]:
0 NaN
1 NaN
2 NaN
3 1.000000
4 1.000000
5 0.017706
6 0.533812
7 0.087801
dtype: float64
This looks like what you're looking for.
Related
I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN
I have the following dataframe:
COD CHM DATE
0 5713 0.0 2020-07-16
1 5713 1.0 2020-08-11
2 5713 2.0 2020-06-20
3 5713 3.0 2020-06-19
4 5713 4.0 2020-06-01
... ... ... ...
2135283 73306036 0.0 2020-09-30
2135284 73306055 12.0 2020-09-30
2135285 73306479 9.0 2020-09-30
2135286 73306656 3.0 2020-09-30
2135287 73306676 1.0 2020-09-30
I want to calculate the mean and the standard deviation for each COD throughout the dates (time).
For this, I am doing:
traf_user_chm_med =traf_user_chm_med.groupby(['COD', 'DATE'])['CHM'].sum().reset_index()
dates = pd.date_range(start=traf_user_chm_med.DATE.min(), end=traf_user_chm_med.DATE.max(), freq='MS', closed='left').sort_values(ascending=False)
clients = traf_user_chm_med['COD'].unique()
idx = pd.MultiIndex.from_product((clients, dates), names=['COD', 'DATE'])
M0 = pd.to_datetime('2020-08')
M1 = M0-pd.DateOffset(month=M0.month-1)
M2 = M0-pd.DateOffset(month=M0.month-2)
M3 = M0-pd.DateOffset(month=M0.month-3)
M4 = M0-pd.DateOffset(month=M0.month-4)
M5 = M0-pd.DateOffset(month=M0.month-5)
def filter_dates(grp):
grp.set_index('YEAR_MONTH', inplace=True)
grp=grp[M0:M5].reset_index()
return grp
traf_user_chm_med = traf_user_chm_med.groupby('COD').apply(filter_dates)
Not sure why it doesn't work, it returns an empty dataframe.
After this I would unstack to get the activity in the several months and calculate the mean and standard deviation for each COD.
This is a long proccess, not sure if there is a faster way to do it that gets me the values I want.
Still, if anyone can help me get this one working would be aweosome!
df['mean'] = df.groupby('DATE')['COD'].transform('mean')
If I understand correctly, you're simply requiring this:
df.groupby("COD")["CHM"].agg("std")
As a general principle, there's almost always a "pythonic" way to do these things that's fewer lines and easy to understand!
You can use transform to broadcast your mean and std
...
df['mean'] = df.groupby('DATE')['COD'].transform('mean')
df['std'] = df.groupby('DATE')['COD'].transform('std')
I have some financial time series data that I would like to calculate the rolling cumulative product with a variable window size.
What I am trying to accomplish is using the following formula but instead of having window fixed at 12, I would like to use the value stored in the last column of the dataframe labeled 'labels_y' which will change over time.
df= (1 + df).rolling(window=12).apply(np.prod, raw=True) - 1
A sample of the data:
Out[102]:
div_yield earn_variab growth ... value volatility labels_y
date ...
2004-02-23 -0.001847 0.003252 -0.001264 ... 0.004368 -0.004490 2.0
2004-02-24 -0.001668 0.007404 0.002108 ... -0.006122 0.008183 2.0
2004-02-25 -0.003272 0.004596 0.001283 ... -0.002057 0.005912 3.0
2004-02-26 0.001818 -0.003397 -0.003190 ... 0.001327 -0.003908 3.0
2004-02-27 -0.002838 0.009879 0.000808 ... 0.000350 0.010557 3.0
[5 rows x 11 columns]
and the final result should look like:
Out[104]:
div_yield earn_variab growth ... value volatility labels_y
date ...
2004-02-23 NaN NaN NaN ... NaN NaN NaN
2004-02-24 -0.003512 0.010680 0.000841 ... -0.001781 0.003656 8.0
2004-02-25 -0.006773 0.015325 0.002125 ... -0.003834 0.009589 35.0
2004-02-26 -0.003126 0.008596 0.000193 ... -0.006851 0.010180 47.0
2004-02-27 -0.004294 0.011075 -0.001104 ... -0.000383 0.012559 63.0
[5 rows x 11 columns]
Rows 1 and 2 are calculated with a 2 day rolling window and rows 3, 4 and 5 use a 3 day window
I have tried using
def get_window(row):
return (1 + row).rolling(window=int(row['labels_y'])).apply(np.prod, raw=True) - 1
df = df.apply(get_window, axis=1)
I realize that calculates the cumupative product in the wrong direction. I am struggling on how to get this to work.
Any help would be hugely appreciated.
Thanks
def get_window(row, df):
return (1 + df).rolling(window=int(row['labels_y'])).apply(np.prod, raw=True).loc[row.name]-1
result = df1.apply(get_window, axis=1, df=df1)
Does this do the trick? Highly inefficient, but I don't see another way except for tedious for-loops.
I have a dataset of 4 attributes like:
taxi id date time longitude latitude
0 1 2/2/2008 15:36 116.51 39.92
1 1 2/2/2008 15:46 116.51 39.93
2 1 2/2/2008 15:56 116.51 39.91
3 1 2/2/2008 16:06 116.47 39.91
4 1 2/2/2008 16:16 116.47 39.92
datatype of each attribute is as follows:
taxi id dtype('int64')
date time dtype('O')
longitude dtype('float64')
latitude dtype('float64')
i want to calculate mean and standard deviation (std) for each attribute.
For mean i have tried this code:
np.mean('longitude')
but it gives me error like:
TypeError: cannot perform reduce with flexible type
You can using pandas describe
df.describe()
Out[878]:
taxi id longitude latitude
count 5.000000 5.0 5.000000 5.000000
mean 2.000000 1.0 116.494000 39.918000
std 1.581139 0.0 0.021909 0.008367
min 0.000000 1.0 116.470000 39.910000
25% 1.000000 1.0 116.470000 39.910000
50% 2.000000 1.0 116.510000 39.920000
75% 3.000000 1.0 116.510000 39.920000
max 4.000000 1.0 116.510000 39.930000
You have to specify that you're looking for the mean of your dataframe. As it is, you're not referencing your dataframe at all when you call numpy.mean().
If you dataframe is called df, using pandas.Series.mean should work, like this:
df['longitude'].mean()
df['longitude'].std()
As it is, you're calling numpy.mean() on a string, which doesn't mean much. If you really wanted to use numpy.mean(), you could use np.mean(df['longitude'])