calculating slope on a rolling basis in pandas df python - python

I have a dataframe :
CAT ^GSPC
Date
2012-01-06 80.435059 1277.810059
2012-01-09 81.560600 1280.699951
2012-01-10 83.962914 1292.079956
....
2017-09-16 144.56653 2230.567646
and I want to find the slope of the stock / and S&P index for the last 63 days for each period. I have tried :
x = 0
temp_dct = {}
for date in df.index:
x += 1
max(x, (len(df.index)-64))
temp_dct[str(date)] = np.polyfit(df['^GSPC'][0+x:63+x].values,
df['CAT'][0+x:63+x].values,
1)[0]
However I feel this is very "unpythonic" , but I've had trouble integrating rolling/shift functions into this.
My expected output is to have a column called "Beta" that has the slope of the S&P (x values) and stock (y values) for all dates available

# this will operate on series
def polyf(seri):
return np.polyfit(seri.index.values, seri.values, 1)[0]
# you can store the original index in a column in case you need to reset back to it after fitting
df.index = df['^GSPC']
df['slope'] = df['CAT'].rolling(63, min_periods=2).apply(polyf, raw=False)
After running this, there will be a new column store the fitting result.

Related

'Conditional' groupby in pandas DataFrame

I am calculating market beta using daily data with pandas.DataFrame. That is, I want to calculate variances of market return and covariances between market return and individual stock return using 252 days window with 200 minimum observation conditions with groupby. Beta is Var(market_return)/Cov(market_return, stock_return). First, I used unconditional groupby to obtain the beta value, which means that I calculate the variances and covariances for every day of my data sample. However, then, I realize that calculating all betas consumes too much time and is wasteful. This is because only end-of-the-month data will be used. For example, even if betas are calculated on 1st Jan, 2nd Jan, ..., and 31st Jan, only the beta of 31st Jan will be used. Therefore, I want to know if there is any way to run my groupby code conditionally.
For example, my output is as follows using 252 days window with 200 minimum observation groupby.
stock_key
date
var(market_return)
covar(market_return, stock_return)
A
2012-01-26
9.4212
-4.23452
A
2012-01-27
9.3982
-4.18421
A
2012-01-28
9.1632
-4.33552
A
2012-01-29
9.0456
-4.55831
A
2012-01-30
9.2231
-4.92373
A
2012-01-31
9.0687
-4.04133
...
A
2012-02-27
8.9345
-4.72344
A
2012-02-28
9.0010
-4.82349
...
B
2012-01-26
4.8456
-1.42325
B
2012-01-27
4.8004
-1.18421
B
2012-01-28
4.0983
-1.02842
B
2012-01-29
4.9465
-1.13834
B
2012-01-30
4.7354
-1.63450
B
2012-01-31
4.1945
-1.18234
I want to know is there any way to get result as follows.
stock_key
date
var(market_return)
covar(market_return, stock_return)
A
2012-01-31
9.0687
-4.04133
A
2012-02-28
9.0010
-4.82349
B
2012-01-31
4.1945
-1.18234
Thank you for reading my question.
+
I add my code as follows. Here, PERMNO is the id of stocks.
dtmpPair = dlongPair[['PERMNO','dayMktmRF','eadjret']]
dgrpPair = dtmpPair.groupby(['PERMNO']).rolling(252, min_periods = 200)
dgrpCov = dgrpPair.cov().unstack()
ddenom = dgrpCov['dayMktmRF']['dayMktmRF'].reset_index()
ddenom = ddenom[['PERMNO','dayMktmRF']]
ddenom['date'] = dlongPair['date']
ddenom.columns = ['PERMNO','MktVar','date']
dnumer = dgrpCov['dayMktmRF']['eadjret'].reset_index()
dnumer = dnumer[['PERMNO','eadjret']]
dnumer['date'] = dlongPair['date']
dnumer.columns = ['PERMNO','Cov','date']
ddfBeta = dnumer.merge(ddenom, on = ['PERMNO','date'])
ddfBeta['beta_daily'] = ddfBeta['Cov'] / ddfBeta['MktVar']
ddfBeta = ddfBeta[ddfBeta['beta_daily'].notnull()]
ddfBeta['month'] = ddfBeta['date'].dt.month
ddfBeta['year'] = ddfBeta['date'].dt.year
beta_daily = ddfBeta[['date','PERMNO','year','month','beta_daily']]
Here, dlongPair dataframe consists of data as follows.
Without using groupby we can check if the date in the row is the last day of the month.
df['date']=pd.to_datetime(df['date']) #string to datetime
#Is the date in the row the last day of that month?
dfx=df[df['date'] - pd.offsets.Day() + pd.offsets.MonthEnd(1)==df['date']]
Output:
stock_key date var(market_return) covar(market_return, stock_return)
5 A 2012-01-31 9.0687 -4.04133
15 B 2012-01-31 4.1945 -1.18234
Note: 2012-02's last day is 29.

How to perform sliding window correlation operation on pandas dataframe with datetime index?

I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)
You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])

What is the best way to make the data as stationary & inverse transform in time series - Python

I did the 1st differencing as the time series is not stationary.
When I do the invert transformation, some values are coming as negative as we get negative values due to diff(). Is there a way to sort it out and bring back the data in original format as close to the expected result.
This is my python code. Is there a way to fix the code or any alternate logic to make the data as stationary and forecasting the series?
count = 0
def invert_transformation(df_train, df_forecast):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
if count > 0: # For 1st differencing
print("Enter into invert transformation")
for col in columns:
df_fc[str(col)+'_f'] = df_train[col].iloc[-1] + df_fc[str(col)+'_f'].cumsum()
print("df_fc: \n", df_fc)
return df_fc
# Since the data is not stationary, I did the 1st difference
df_differenced = df_train.diff().dropna()
count = count + 1 #increase the count
count
....
....
model = VAR(df_differenced)
....
fc = model_fitted.forecast(y=forecast_input, steps=10)
df_forecast2 = pd.DataFrame(fc, index=df2.index[-nobs:], columns=df2.columns + '_f')
df_results = invert_transformation(df_train, df_forecast2)
value of df_results(TS is the index column) are:
TS Field1_f Field2_f
44:13.0 6.826511e+05 1.198614e+06
44:14.0 -8.620101e+05 4.694556e+05
..
..
44:22.0 -1.401620e+07 -2.092826e+06
Value of df_differenced are:
TS Field1 Field2
43:34.0 187000.0 29000.0
43:35.0 175000.0 76722.0
43:36.0 -10000.0 31000.0
43:37.0 90000.0 42000.0
43:38.0 -130000.0 -42000.0
43:39.0 40000.0 -98444.0
..
..
44:11.0 -130000.0 40722.0
44:12.0 117000.0 -42444.0

Create a column that is a conditional smoothed moving average of another column in python

My Problem
I am trying to create a column in python which is the conditional smoothed moving 14 day average of another column. The condition is that I only want to include positive values from another column in the rolling average.
I am currently using the following code which works exactly how I want it to, but it is really slow because of the loops. I want to try and re-do it without using loops. The dataset is simply the last closing price of a stock.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df['delta'] = df.PX_LAST.pct_change()
df.loc[df.index[0], 'avg_gain'] = 0
for x in range(1,len(df.index)):
if df["delta"].iloc[x] > 0:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + df["delta"].iloc[x]) / 14
else:
df["avg_gain"].iloc[x] = ((df["avg_gain"].iloc[x - 1] * 13) + 0) / 14
df
Correct Output Example
Dates PX_LAST delta avg_gain
03/09/2018 43.67800 NaN 0.000000
04/09/2018 43.14825 -0.012129 0.000000
05/09/2018 42.81725 -0.007671 0.000000
06/09/2018 43.07725 0.006072 0.000434
07/09/2018 43.37525 0.006918 0.000897
10/09/2018 43.47925 0.002398 0.001004
11/09/2018 43.59750 0.002720 0.001127
12/09/2018 43.68725 0.002059 0.001193
13/09/2018 44.08925 0.009202 0.001765
14/09/2018 43.89075 -0.004502 0.001639
17/09/2018 44.04200 0.003446 0.001768
Attempted Solutions
I tried to create a new column that only comprises of the positive values and then tried to create the smoothed moving average of that new column but it doesn't give me the right answer
df['new_col'] = df['delta'].apply(lambda x: x if x > 0 else 0)
df['avg_gain'] = df['new_col'].ewm(14,min_periods=1).mean()
The maths behind it as follows...
Avg_Gain = ((Avg_Gain(t-1) * 13) + (New_Col * 1)) / 14
where New_Col only equals the positive values of Delta
Does anyone know how I might be able to do it?
Cheers
This should speed up your code:
df['avg_gain'] = df[df['delta'] > 0]['delta'].rolling(14).mean()
Does your current code converge to zero? If you can provide the data, then it would be easier for the folk to do some analysis.
I would suggest you add a column which is 0 if the value is < 0 and instead has the same value as the one you want to consider if it is >= 0. Then you take the running average of this new column.
df['new_col'] = df.apply(lambda x: x['delta'] if x['delta'] >= 0 else 0)
df['avg_gain'] = df['new_value'].rolling(14).mean()
This would take into account zeros instead of just discarding them.

Group by date range in pandas dataframe

I have a time series data in pandas, and I would like to group by a certain time window in each year and calculate its min and max.
For example:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
How to group by time window e.g. 'Jan-10':'Mar-21' for each year and calculate its min and max for column value?
You can use the resample method.
df.resample('5d').agg(['min','max'])
I'm not sure if there's a direct way to do it without first creating a flag for the days required. The following function is used to create a flag required:
# Function for flagging the days required
def flag(x):
if x.month == 1 and x.day>=10: return True
elif x.month in [2,3,4]: return True
elif x.month == 5 and x.day<=21: return True
else: return False
Since you require for each year, it would be a good idea to have the year as a column.
Then the min and max for each year for given periods can be obtained with the code below:
times = pd.date_range(start = '1/1/2011', end = '1/1/2016', freq = 'D')
df = pd.DataFrame(np.random.rand(len(times)), index=times, columns=["value"])
df['Year'] = df.index.year
pd.pivot_table(df[list(pd.Series(df.index).apply(flag))], values=['value'], index = ['Year'], aggfunc=[min,max])
The output will look like follows:
Sample Output
Hope that answers your question... :)
You can define the bin edges, then throw out the bins you don't need (every other) with .loc[::2, :]. Here I'll define two functions just to check we're getting the date ranges we want within groups (Note since left edges are open, need to subtract 1 day):
import pandas as pd
edges = pd.to_datetime([x for year in df.index.year.unique()
for x in [f'{year}-02-09', f'{year}-03-21']])
def min_idx(x):
return x.index.min()
def max_idx(x):
return x.index.max()
df.groupby(pd.cut(df.index, bins=edges)).agg([min_idx, max_idx, min, max]).loc[::2, :]
Output:
value
min_idx max_idx min max
(2011-02-09, 2011-03-21] 2011-02-10 2011-03-21 0.009343 0.990564
(2012-02-09, 2012-03-21] 2012-02-10 2012-03-21 0.026369 0.978470
(2013-02-09, 2013-03-21] 2013-02-10 2013-03-21 0.039491 0.946481
(2014-02-09, 2014-03-21] 2014-02-10 2014-03-21 0.029161 0.967490
(2015-02-09, 2015-03-21] 2015-02-10 2015-03-21 0.006877 0.969296
(2016-02-09, 2016-03-21] NaT NaT NaN NaN

Categories