I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
Related
I have data of the following format:
station_number date river_height river_flow
0 1 2005-01-01 08:09:00 0.285233 0.782065
1 1 2005-01-01 11:28:12 0.129994 0.386652
2 4 2005-01-01 17:33:36 0.457168 0.167025
3 2 2005-01-01 23:21:00 0.359086 0.851716
4 4 2005-01-02 04:18:36 0.332998 0.830749
5 1 2005-01-02 09:28:12 0.867262 0.855507
6 3 2005-01-02 13:15:36 0.352409 0.023737
7 2 2005-01-02 17:31:12 0.696562 0.846762
8 1 2005-01-02 21:15:36 0.910944 0.096999
9 4 2005-01-03 02:13:12 0.981430 0.152109
I need to calculate a daily average of the river height and river flow per unique station number, so as a result something like this:
station_number date river_height river_flow
0 1 2005-01-01 0.285 0.782
1 1 2005-01-02 0.233 0.753
2 2 2005-01-01 0.129 0.386
3 2 2005-01-02 0.994 0.386
4 3 2005-01-01 0.457 0.167
5 3 2005-01-02 0.168 0.134
6 4 2005-01-01 0.356 0.321
7 4 2005-01-02 0.086 0.716
Keep in mind that the above numbers are random, and not actually the averages I'm looking for. I need an entry for each day for each station. I hope I have clarified what I need!
I have tried aggregating using groupby such as below:
monthly_flow_data_mean = df.groupby(pd.PeriodIndex(df['date'], freq="M"))['river_flow'].mean()
But this obviously just takes all river_flow measurements not considering the station numbers. I have had trouble finding what combination of groupby and aggregations I need to properly achieve what I need.
I tried this as well:
daily_flow_df = df.groupby(pd.PeriodIndex(df['date'], freq="D")).agg({"river_flow": "mean", "river_height": "mean", "station_number": "first"})
But I am pretty sure this also doesn't really work as we are not really using the station number to aggregate, but merely choosing how to aggregate it while aggregating all river flow measurements.
I can obviously also just split the dataframe into 4 classes and then do the aggregation per dataframe, and merge it back together. But I am wondering if there is some smart little groupby trick that can help me achieve this in less lines, as it will be useful later in my project(s) as well where I might have way more classes in the data.
You can use either of the following solutions to groupby 'station_number' and date on the 'Date' column using pd.Grouper or dt.normalize:
df.groupby(['station_number', pd.Grouper(key='date', freq='D')]).mean()
or
df.groupby(['station_number', df['date'].dt.normalize()]).mean()
For example, let's consider the following dataframe:
Restaurant_ID Floor Cust_Arrival_Datetime
0 100 1 2021-11-17 17:20:00
1 100 1 2021-11-17 17:22:00
2 100 1 2021-11-17 17:25:00
3 100 1 2021-11-17 17:30:00
4 100 1 2021-11-17 17:50:00
5 100 1 2021-11-17 17:51:00
6 100 2 2021-11-17 17:25:00
7 100 2 2021-11-17 18:00:00
8 100 2 2021-11-17 18:50:00
9 100 2 2021-11-17 18:56:00
For the above toy example we can consider that the Cust_Arrival_Datetime is sorted as well as grouped by store and floor (as seen above). How could we, now, calculate things such as the median time interval that passes for a customer arrival for each unique store and floor group?
The desired output would be:
Restaurant_ID Floor Median Arrival Interval(in minutes)
0 100 1 3
1 100 2 35
The Median Arrival Interval is calculated as follows: for the first floor of the store we can see that by the time the second customer arrives 2 minutes have already passed since the first one arrived. Similarly, 3 minutes have elapsed between the 2nd and the 3rd customer and 5 minutes for the 3rd and 4th customer etc. The median for floor 1 and restaurant 100 would be 3.
I have tried something like this:
df.groupby(['Restaurant_ID', 'Floor'].apply(lambda row: row['Customer_Arrival_Datetime'].shift() - row['Customer_Arrival_Datetime']).apply(np.median)
but this does not work!
Any help is welcome!
IIUC, you can do
(df.groupby(['Restaurant_ID', 'Floor'])['Cust_Arrival_Datetime']
.agg(lambda x: x.diff().dt.total_seconds().median()/60))
and you get
Restaurant_ID Floor
100 1 3.0
2 35.0
Name: Cust_Arrival_Datetime, dtype: float64
you can chain with reset_index if needed
Consider the following data frame:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'time': pd.to_datetime(
['14:14', '14:17', '14:25', '17:29', '17:40','17:43']
)
})
Suppose, you'd like to apply a range of transformations:
def stats(group):
diffs = group.diff().dt.total_seconds()/60
return {
'min': diffs.min(),
'mean': diffs.mean(),
'median': diffs.median(),
'max': diffs.max()
}
Then you simply have to apply these:
>>> df.groupby('group')['time'].agg(stats).apply(pd.Series)
min mean median max
group
1 3.0 5.5 5.5 8.0
2 3.0 7.0 7.0 11.0
I have a problem with calculating a function in python. I want to calculate the IRR for a number of investment, all of which are described in their own dataframe. I have a dataframe for each investment up until a certain date, so I have a multiple dataframe describing the flows of payments the investment has made up until different dates for each investment, with the last row of each dataframe containing the information of the stock of capital that each investment has until that point. I do this in order to have like a time series of the IRR for each investment. Each dataframe of which I want to calculate the IRR is in a list.
To calculate the IRR for each dataframe I made these functions:
def npv(irr, cfs, yrs):
return np.sum(cfs / ((1. + irr) ** yrs))
def irr(cfs, yrs, x0)
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
So in order to calculate the IRR for each dataframe in my list I did:
for i, new_df in enumerate(dfs):
cash_flow = new_df.FLOWS.values
years = new_df.timediff.values
output.loc[i, ['DATE']] = new_df['DATE'].iloc[-1]
output.loc[i, ['Investment']] = new_df['Investment'].iloc[-1]
output.loc[i, ['irr']] = irr(cash_flow, years, x0=0.)
Output is the dataframe I want to create that the cointains the information I want, i.e the IRR of each invesment up until a certain date. The problem is, it calculates the IRR correctly for some dataframes, but not for others. For example it calculates the IRR correctly for this dataframe:
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-06-30 1 116957227.3 5.347945205479452
With an IRR of 0.215. But this dataframe, for the exact same investment it does not. It returns a IRR of 0.0001, but the real IRR should be around 0.216.
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-09-30 1 123753575.7 5.6
These two dataframes have exactly the same flows excepto the last row, of which it cointains the stock of capital up until that date for that investment. So the only difference between these two dataframes is the last row. This means this investment hasn't had any inflows or outflows during that time. I don't understand why the IRR varies so much. Or why some IRR are calculated incorrectly.
Most are calculated correctly but a few are not.
Thanks for helping me.
As I have thought, it is a problem with the optimization method.
When I have tried your irr function with the second df, I have even received a warning:
RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
But trying out scipy.optimize.root with other methods seem to work for me. Changed the func to:
import scipy.optimize as optimize
def irr(cfs, yrs, x0):
r = optimize.root(npv, args=(cfs, yrs), x0=x0, method='broyden1')
return float(r.x)
I just checked lm and broyden1, and those both converged with your second example to around 0.216. There are multiple methods, and I have no clue which would be the best choice from those, but most seems to be better then the hybr method used in fsolve.
I have the following dataframe:
I'm trying to perform the calculation of an exponential moving average of 13 periods but the results don't match at all, I'm using the following code to try to get the result:
stockdata['sma'] = stockdata['close'].rolling(window = 13, min_periods=13).mean()
stockdata['ema'] = stockdata['close'].ewm(span=13, adjust=False).mean()
simple media is working perfectly, only exponential moving media is not giving the correct values
the calculation of moving averages is done in the column 'close'
ativo close sma ema
0 PETR4 28.18 NaN 28.180000
1 PETR4 28.63 NaN 28.244286
2 PETR4 28.39 NaN 28.265102
3 PETR4 29.18 NaN 28.395802
4 PETR4 28.93 NaN 28.472116
5 PETR4 29.13 NaN 28.566099
6 PETR4 29.48 NaN 28.696656
7 PETR4 30.13 NaN 28.901420
8 PETR4 29.72 NaN 29.018360
9 PETR4 29.42 NaN 29.075737
10 PETR4 29.36 NaN 29.116346
11 PETR4 29.75 NaN 29.206868
12 PETR4 30.55 29.296154 29.398744
on the dataframe the oldest data is on top, and the most recent below
the correct value of the exponential moving average of 13 should be 29.53,
what would be the correct way to use the function?
for the verification that I did on the investment platform, these values that the function is giving me are values of an arithmetic moving average
The issue is that you're assuming that an EMA with span 13 is only looking at the last 13 data points... But that's not really the case, it will look past that date, simply use smaller weights for the data points further in the past...
If you take the last month of closing stock prices for PETR4 and take the 13-day EMA for them, you'll get at the expected result:
closing_price = pd.Series([
30.00, 29.62, 29.29, 29.65, 29.30,
28.03, 28.80, 28.85, 28.94, 28.45,
28.18, 28.63, 28.39, 29.18, 28.93,
29.13, 29.48, 30.13, 29.72, 29.42,
29.36, 29.75, 30.55,
])
And:
In []: closing_price.ewm(span=13, adjust=False).mean().iloc[-1]
Out[]: 29.52974644568118
Which after rounding seems to match your expected answer.
I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N
"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02
The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()
I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64
Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.