as the title suggests, where has the rolling function option in the ols command in Pandas migrated to in statsmodels? I can't seem to find it.
Pandas tells me doom is in the works:
FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
model = pd.ols(y=series_1, x=mmmm, window=50)
in fact, if you do something like:
import statsmodels.api as sm
model = sm.OLS(series_1, mmmm, window=50).fit()
print(model.summary())
you get results (window does not impair the running of the code) but you get only the parameters of the regression run on the entire period, not the series of parameters for each of the rolling period it should be supposed to work on.
I created an ols module designed to mimic pandas' deprecated MovingOLS; it is here.
It has three core classes:
OLS : static (single-window) ordinary least-squares regression. The output are NumPy arrays
RollingOLS : rolling (multi-window) ordinary least-squares regression. The output are higher-dimension NumPy arrays.
PandasRollingOLS : wraps the results of RollingOLS in pandas Series & DataFrames. Designed to mimic the look of the deprecated pandas module.
Note that the module is part of a package (which I'm currently in the process of uploading to PyPi) and it requires one inter-package import.
The first two classes above are implemented entirely in NumPy and primarily use matrix algebra. RollingOLS takes advantage of broadcasting extensively also. Attributes largely mimic statsmodels' OLS RegressionResultsWrapper.
An example:
import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS
# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"
syms = {
"TWEXBMTH" : "usd",
"T10Y2YM" : "term_spread",
"GOLDAMGBD228NLBM" : "gold",
}
params = {
"fq": "Monthly,Monthly,Monthly",
"id": ",".join(syms.keys()),
"cosd": "2000-01-01",
"coed": "2019-02-01",
}
data = pd.read_csv(
url + "?" + urllib.parse.urlencode(params, safe=","),
na_values={"."},
parse_dates=["DATE"],
index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
# usd term_spread gold
# DATE
# 2000-02-01 0.012580 -1.409091 0.057152
# 2000-03-01 -0.000113 2.000000 -0.047034
# 2000-04-01 0.005634 0.518519 -0.023520
# 2000-05-01 0.022017 -0.097561 -0.016675
# 2000-06-01 -0.010116 0.027027 0.036599
y = data.usd
x = data.drop('usd', axis=1)
window = 12 # months
model = PandasRollingOLS(y=y, x=x, window=window)
print(model.beta.head()) # Coefficients excluding the intercept
# term_spread gold
# DATE
# 2001-01-01 0.000033 -0.054261
# 2001-02-01 0.000277 -0.188556
# 2001-03-01 0.002432 -0.294865
# 2001-04-01 0.002796 -0.334880
# 2001-05-01 0.002448 -0.241902
print(model.fstat.head())
# DATE
# 2001-01-01 0.136991
# 2001-02-01 1.233794
# 2001-03-01 3.053000
# 2001-04-01 3.997486
# 2001-05-01 3.855118
# Name: fstat, dtype: float64
print(model.rsq.head()) # R-squared
# DATE
# 2001-01-01 0.029543
# 2001-02-01 0.215179
# 2001-03-01 0.404210
# 2001-04-01 0.470432
# 2001-05-01 0.461408
# Name: rsq, dtype: float64
Rolling beta with sklearn
import pandas as pd
from sklearn import linear_model
def rolling_beta(X, y, idx, window=255):
assert len(X)==len(y)
out_dates = []
out_beta = []
model_ols = linear_model.LinearRegression()
for iStart in range(0, len(X)-window):
iEnd = iStart+window
model_ols.fit(X[iStart:iEnd], y[iStart:iEnd])
#store output
out_dates.append(idx[iEnd])
out_beta.append(model_ols.coef_[0][0])
return pd.DataFrame({'beta':out_beta}, index=out_dates)
df_beta = rolling_beta(df_rtn_stocks['NDX'].values.reshape(-1, 1), df_rtn_stocks['CRM'].values.reshape(-1, 1), df_rtn_stocks.index.values, 255)
Adding for completeness a speedier numpy-only solution which limits calculations only to the regression coefficients and the final estimate
Numpy rolling regression function
import numpy as np
def rolling_regression(y, x, window=60):
"""
y and x must be pandas.Series
"""
# === Clean-up ============================================================
x = x.dropna()
y = y.dropna()
# === Trim acc to shortest ================================================
if x.index.size > y.index.size:
x = x[y.index]
else:
y = y[x.index]
# === Verify enough space =================================================
if x.index.size < window:
return None
else:
# === Add a constant if needed ========================================
X = x.to_frame()
X['c'] = 1
# === Loop... this can be improved ====================================
estimate_data = []
for i in range(window, x.index.size+1):
X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
y_slice = y.values[i-window:i]
coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
# === Assemble ========================================================
estimate = pandas.Series(data=estimate_data, index=x.index[window-1:])
return estimate
Notes
In some specific case uses, which only require the final estimate of the regression, x.rolling(window=60).apply(my_ols) appears to be somewhat slow
As a reminder, the coefficients for a regression can be calculated as a matrix product, as you can read on wikipedia's least squares page. This approach via numpy's matrix multiplication can speed up the process somewhat vs using the ols in statsmodels. This product is expressed in the line starting as coeff = ...
For rolling trend in one column, one can just use:
import numpy as np
def calc_trend(window:int = 30):
df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)
However, in my case I wasted to find a trend with respect to date, where date was in another column. I had to create the functionality manually, but it is easy. First, convert from TimeDate to int64 representing days from t_0:
xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)
Then:
def calc_trend(window:int=30):
for t in range(len(df)):
if t < window//2:
continue
i0 = t - window//2 # Start window
i1 = i0 + window # End window
xvec = xdays[i0:i1]
yvec = df['column_name'][i0:i1].values
df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]
Related
I have fit a linearmodels.PanelOLS model and stored it in m. I now want to test if certain coefficients are simultaneously equal to zero.
Does a fitted linearmodels.PanelOLS object have an F-test function where I can pass my own restriction matrix?
I am looking for something like statsmodels' f_test method.
Here's a minimum reproducible example.
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
# Is there an f_test method for m???
m.f_test(r_mat=some_matrix_here) # Something along these lines?
You can use wald_test (a standard F-test is numerically identical to a Walkd test under some assumptions on the covariance).
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
Then the test
import numpy as np
# Use matrix notation RB - q = 0 where R is restr and q is value
# Restrictions: expersq = 0.001 & expersq+married = 0.2
restr = np.array([[0,1,0],[0,1,1]])
value = np.array([0.01, 0.2])
m.wald_test(restr, value)
This returns
Linear Equality Hypothesis Test
H0: Linear equality constraint is valid
Statistic: 0.2608
P-value: 0.8778
Distributed: chi2(2)
WaldTestStatistic, id: 0x2271cc6fdf0
You can also use formula syntax if you used formulas to define your model, which can be easier to code up.
fm = PanelOLS.from_formula("lwage~ 1 + expersq + married", data=df)
fm = fm.fit(cov_type='clustered', cluster_entity=True)
fm.wald_test(formula="expersq = 0.001,expersq+married = 0.2")
The result is the same as above.
I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
})
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
}).clip(0,100)
And plot:
df.plot.density(subplots=True);
gives:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))
Background:
I'd like to solve a wide array of optimization problems such as asset weights in a portfolio, and parameters in trading strategies where the variables are passed to functions containing a bunch of other variables as well.
Until now, I've been able to do these things easily in Excel using the Solver Add-In. But I think it would be much more efficient and even more widely applicable using Python. For the sake of clarity, I'm going to boil the question down to the essence of portfolio optimization.
My question (short version):
Here's a dataframe and a corresponding plot with asset returns.
Dataframe 1:
A1 A2
2017-01-01 0.0075 0.0096
2017-01-02 -0.0075 -0.0033
.
.
2017-01-10 0.0027 0.0035
Plot 1 - Asset returns
Based on that, I would like to find the weights for the optimal portfolio with regards to risk / return (Sharpe ratio), represented by the green dot in the plot below (the red dot is the so-called minimum variance portfolio, and represents another optimization problem).
Plot 2 - Efficient frontier and optimal portfolios:
How can I do this with numpy or scipy?
The details:
The following code section contains the function returns() to build a dataframe with random returns for two assets, as well as a function pf_sharpe to calculate the Sharpe ratio of two given weights for a portfolio of the returns.
# imports
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
# Sharpe ratio
def pf_sharpe(df, w1, w2):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
Parameters
==========
df : pandas dataframe
w1 : portfolio weight for asset 1
w2 : portfolio weight for asset 2
'''
weights = [w1,w2]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return pf_sharpe
# Make df with random returns and calculate
# sharpe ratio for a 80/20 split between assets
df_returns = returns(rows = 10, names = ['A1', 'A2'])
df_returns.plot(kind = 'bar')
sharpe = pf_sharpe(df = df_returns, w1 = 0.8, w2 = 0.2)
print(sharpe)
# Output:
# 5.09477512073
Now I'd like to find the portfolio weights that optimize the Sharpe ratio. I think you could express the optimization problem as follows:
maximize:
pf_sharpe()
by changing:
w1, w2
under the constraints:
0 < w1 < 1
0 < w2 < 1
w1 + w2 = 1
What I've tried so far:
I found a possible setup in the post Python Scipy Optimization.minimize using SLSQP showing maximized results. Below is what I have so far, and it addresses a central aspect of my question directly:
[...]where the variables are passed to functions containing a bunch of other variables as well.
As you can see, my initial challenge prevents me from even testing if my bounds and constraints will be accepted by the function optimize.minimize(). I haven't even bothered to take into consideration the fact that this is a maximization and not a minimization problem (hopefully amendable by changing the sign of the function).
Attempts:
# bounds
b = (0,1)
bnds = (b,b)
# constraints
def constraint1(w1,w2):
return w1 - w2
cons = ({'type': 'eq', 'fun':constraint1})
# initial guess
x0 = [0.5, 0.5]
# Testing the initial guess
print(pf_sharpe(df = df_returns, weights = x0))
# Optimization attempts
attempt1 = optimize.minimize(pf_sharpe(), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt2 = optimize.minimize(pf_sharpe(df = df_returns, weights), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt3 = optimize.minimize(pf_sharpe(weights, df = df_returns), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
Results:
Attempt1 is closest to the scipy setup here, but understandably fails because neither df nor weights have been specified.
Attempt2 fails with SyntaxError: positional argument follows keyword argument
Attempt3 fails with NameError: name 'weights' is not defined
I was under the impression that df could freely be specified, and that x0 in optimize.minimize would be considered the variables to be tested as 'representatives' for the weights in the function specified by pf_sharpe().
As you surely understand, my transition from Excel to Python in this regard has not been the easiest, and there is plenty I don't understand here. Anyway, I'm hoping some of you may offer some suggestions or clarifications!
Thank you!
Appendix 1 - Simulation approach:
This particular portfolio optimization problem can easily be solved by simulating a bunch of portfolio weights. And I did exactly that to produce the portfolio plot above. Here's the whole function if anyone is interested:
# Portfolio simulation
def portfolioSim(df, simRuns):
''' Function to take a df with asset returns,
runs a number of simulated portfolio weights,
plots return and risk for those weights,
and finds minimum risk portfolio
and max risk / return portfolio
Parameters
==========
df : pandas dataframe with returns
simRuns : number of simulations
'''
prets = []
pvols = []
pwgts = []
names = list(df_returns)
for p in range (simRuns):
# Assign random weights
weights = np.random.random(len(list(df_returns)))
weights /= np.sum(weights)
weights = np.asarray(weights)
# Calculate risk and returns with random weights
prets.append(np.sum(df_returns.mean() * weights) * 252)
pvols.append(np.sqrt(np.dot(weights.T, np.dot(df_returns.cov() * 252, weights))))
pwgts.append(weights)
prets = np.array(prets)
pvols = np.array(pvols)
pwgts = np.array(pwgts)
pshrp = prets / pvols
# Store calculations in a df
df1 = pd.DataFrame({'return':prets})
df2 = pd.DataFrame({'risk':pvols})
df3 = pd.DataFrame(pwgts)
df3.columns = names
df4 = pd.DataFrame({'sharpe':pshrp})
df_temp = pd.concat([df1, df2, df3, df4], axis = 1)
# Plot resulst
plt.figure(figsize=(8, 4))
plt.scatter(pvols, prets, c=prets / pvols, cmap = 'viridis', marker='o')
# Min risk
min_vol_port = df_temp.iloc[df_temp['risk'].idxmin()]
plt.plot([min_vol_port['risk']], [min_vol_port['return']], marker='o', markersize=12, color="red")
# Max sharpe
max_sharpe_port = df_temp.iloc[df_temp['sharpe'].idxmax()]
plt.plot([max_sharpe_port['risk']], [max_sharpe_port['return']], marker='o', markersize=12, color="green")
# Test run
portfolioSim(df = df_returns, simRuns = 250)
Appendix 2 - Excel Solver approach:
Here is how I would approach the problem using Excel Solver. Instead of linking to a file, I've only attached a screenshot and included the most important formulas in a code section. I'm guessing not many of you is going to be interested in reproducing this anyway. But I've included it just to show that it can be done quite easily in Excel.
Grey ranges represent formulas. Ranges that can be changed and used as arguments in the optimization problem are highlighted in yellow. The green range is the objective function.
Here's an image of the worksheet and Solver setup:
Excel formulas:
C3 =AVERAGE(C7:C16)
C4 =AVERAGE(D7:D16)
H4 =COVARIANCE.P(C7:C16;D7:D16)
G5 =COVARIANCE.P(C7:C16;D7:D16)
G10 =G8+G9
G13 =MMULT(TRANSPOSE(G8:G9);C3:C4)
G14 =SQRT(MMULT(TRANSPOSE(G8:G9);MMULT(G4:H5;G8:G9)))
H13 =G12/G13
H14 =G13*252
G16 =G13/G14
H16 =H13/H14
End notes:
As you can see from the screenshot, Excel solver suggests a 47% / 53% split between A1 and A2 to obtain an optimal Sharpe Ratio of 5,6. Running the Python function sr_opt = portfolioSim(df = df_returns, simRuns = 25000) yields a Sharpe Ratio of 5,3 with corresponding weights of 46% and 53% for A1 and A2:
print(sr_opt)
#Output
#return 0.361439
#risk 0.067851
#A1 0.465550
#A2 0.534450
#sharpe 5.326933
The method applied in Excel is GRG Nonlinear. I understand that changing the SLSQP argument to a non-linear method would get me somewhere, and I've look into Nonlinear solvers in scipy as well, but with little success.
And maybe Scipy even isn't the best option here?
A more detailed answer, 1st part of your code remains the same
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
The function pf_sharpe is modified, the 1st input is one of the weights, the parameter to be optimised. Instead of inputting constraint w1 + w2 = 1, we can define w2 as 1-w1 inside pf_sharpe, which is perfectly equivalent but simpler and faster. Also, minimize will attempt to minimize pf_sharpe, and you actually want to maximize it, so now the output of pf_sharpe is multiplied by -1.
# Sharpe ratio
def pf_sharpe(weight, df):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
'''
weights = [weight[0], 1-weight[0]]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return -pf_sharpe
# initial guess
x0 = [0.5]
df_returns = returns(rows = 10, names = ['A1', 'A2'])
# Optimization attempts
out = minimize(pf_sharpe, x0, method='SLSQP', bounds=[(0, 1)], args=(df_returns,))
optimal_weights = [out.x, 1-out.x]
print(optimal_weights)
print(-pf_sharpe(out.x, df_returns))
This returns an optimized Sharpe Ratio of 6.16 (better than 5.3) for w1 practically one and w2 practically 0
Given the some randomly generated data with
2 columns,
50 rows and
integer range between 0-100
With R, the poisson glm and diagnostics plot can be achieved as such:
> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)
In Python, this would give me the line predictor vs residual plot:
import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()
And to plot the diagnostics in Python:
model_fitted_y = results.fittedvalues # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals) # absolute residuals
plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()
But when I try to get the cook statistics,
# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]
it threw an error saying:
AttributeError Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
4 model_residuals = results.resid
5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
7 # absolute squared normalized residuals
8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
33 pass
34
---> 35 obj = getattr(results, attr)
36 data = results.model.data
37 how = self._wrap_attrs.get(attr)
AttributeError: 'GEEResults' object has no attribute 'get_influence'
Is there a way to plot out all 4 diagnostic plots in Python like in R?
How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?
The generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use something like:
import pandas as pd
import statsmodels.api as sm
# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here
# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )
# Relabel dependent variable as y (standard notation)
y = df.X2
# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()
EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import statsmodels.api as sm
PATH = '/Users/robertmilletich/test_reg.csv'
def _weight_matrix(fitted_model):
"""Calculates weight matrix in Poisson regression
Parameters
----------
fitted_model : statsmodel object
Fitted Poisson model
Returns
-------
W : 2d array-like
Diagonal weight matrix in Poisson regression
"""
return np.diag(fitted_model.fittedvalues)
def _hessian(X, W):
"""Hessian matrix calculated as -X'*W*X
Parameters
----------
X : 2d array-like
Matrix of covariates
W : 2d array-like
Weight matrix
Returns
-------
hessian : 2d array-like
Hessian matrix
"""
return -np.dot(X.T, np.dot(W, X))
def _hat_matrix(X, W):
"""Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)
Parameters
----------
X : 2d array-like
Matrix of covariates
W : 2d array-like
Diagonal weight matrix
Returns
-------
hat : 2d array-like
Hat matrix
"""
# W^(1/2)
Wsqrt = W**(0.5)
# (X'*W*X)^(-1)
XtWX = -_hessian(X = X, W = W)
XtWX_inv = np.linalg.inv(XtWX)
# W^(1/2)*X
WsqrtX = np.dot(Wsqrt, X)
# X'*W^(1/2)
XtWsqrt = np.dot(X.T, Wsqrt)
return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))
def main():
# Load data and separate into X and y
df = pd.read_csv(PATH)
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
y = df.X2
# Fit model
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
# Weight matrix
W = _weight_matrix(model)
# Hat matrix
H = _hat_matrix(X, W)
hii = np.diag(H) # Diagonal values of hat matrix
# Pearson residuals
r = model.resid_pearson
# Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
# Note: dispersion is 1 since we aren't modeling overdispersion
cooks_d = (r/(1 - hii))**2 * hii/(1*2)
if __name__ == "__main__":
main()
As an update here
statsmodels has now, since version 0.10, get_influence method also for GLMResults.
https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html
for example:
Print influence and outlier measures for 10 observations with largest cook distance:
infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]
There are no combination plots, but influence plot infl.plot_influence() and index plot infl.plot_index(...) for any of the measures are available.
Generic influence measures for maximum likelihood models is or will become available for discrete and other models.
MLE influence measures are based on hessian, i.e. observed information matrix, while for GLM both expected information matrix and hessian versions are available.
In GLM, the distinction is only relevant when non-canonical links are used.
I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))
Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)
Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same
While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.
Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.
Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.
but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.