I am trying to create a minimum variance portfolio based on 1 year of data. I then want to rebalance the portfolio every month recomputing thus the covariance matrix. (my dataset starts in 1992 and finishes in 2017).
I did the following code which works when it is not in a loop. But when put in the loop the inverse of the covariance matrix is Singular. I don't understand why this problem arises since I reset every variable at the end of the loop.
### Importing the necessary libraries ###
import pandas as pd
import numpy as np
from numpy.linalg import inv
### Importing the dataset ###
df = pd.read_csv("UK_Returns.csv", sep = ";")
df.set_index('Date', inplace = True)
### Define varibales ###
stocks = df.shape[1]
returns = []
vol = []
weights_p =[]
### for loop to compute portfolio and rebalance every 30 days ###
for i in range (0,288):
a = i*30
b = i*30 + 252
portfolio = df[a:b]
mean_ret = ((1+portfolio.mean())**252)-1
var_cov = portfolio.cov()*252
inv_var_cov = inv(var_cov)
doit = 0
weights = np.dot(np.ones((1,stocks)),inv_var_cov)/(np.dot(np.ones((1,stocks)),np.dot(inv_var_cov,np.ones((stocks,1)))))
ret = np.dot(weights, mean_ret)
std = np.sqrt(np.dot(weights, np.dot(var_cov, weights.T)))
returns.append(ret)
vol.append(std)
weights_p.append(weights)
weights = []
var_cov = np.zeros((stocks,stocks))
inv_var_cov = np.zeros((stocks,stocks))
i+=1
Does anyone has an idea to solve this issue?
The error it yields is the following:
---------------------------------------------------------------------------
LinAlgError Traceback (most recent call last)
<ipython-input-17-979efdd1f5b2> in <module>()
21 mean_ret = ((1+portfolio.mean())**252)-1
22 var_cov = portfolio.cov()*252
---> 23 inv_var_cov = inv(var_cov)
24 doit = 0
25 weights = np.dot(np.ones((1,stocks)),inv_var_cov)/(np.dot(np.ones((1,stocks)),np.dot(inv_var_cov,np.ones((stocks,1)))))
<__array_function__ internals> in inv(*args, **kwargs)
1 frames
/usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
95
96 def _raise_linalgerror_singular(err, flag):
---> 97 raise LinAlgError("Singular matrix")
98
99 def _raise_linalgerror_nonposdef(err, flag):
LinAlgError: Singular matrix
Thank you so much for any help you can provide me with!
The data is shared in the following google drive: https://drive.google.com/file/d/1-Bw7cowZKCNU4JgNCitmblHVw73ORFKR/view?usp=sharing
It would be better to identify what is causing the singularity of the matrix
but there are means of living with singular matrices.
Try to use pseudoinverse by np.linalg.pinv(). It is guaranteed to always exist.
See pinv
Other way around it is avoid computing inverse matrix at all.
Just find Least Squares solution of the system. See lstsq
Just replace np.dot(X,inv_var_cov) with
np.linalg.lstsq(var_conv, X, rcond=None)[0]
Related
I am attempting to fit a function to a set of data I have. The function in question is:
x(t) = - B + sqrt(AB(t-t0) + (x0 + B)^2)
I have tried to fit my data (included at the bottom) to this using two different methods but have found that whatever I do the fit for B is extremely unstable. Changing either the method or the initial guess wildly changes the output value. In addition, when I look at the error for this fit using curve_fit the error is almost two orders of magnitude higher than the value. Does anyone have some suggestions on what I should do to decrease the error?
import numpy as np
import scipy.optimize as spo
def modelFun(t,A,B):
return -B + np.sqrt(A*B*(t-t0) + np.power(x0 + B,2))
def errorFun(k,time,data):
A = k[0]
B = k[1]
return np.sum((data-modelFun(time,A,B))**2)
data = np.genfromtxt('testdata.csv',delimiter=',',skip_header = 1)
time = data[:,0]
xt = data[:,1]
t0 = data[0,0]
x0 = data[0,1]
minErrOut = spo.minimize(errorFun,[1,1000],args=(time,xt),bounds=((0,None),(0,None)))
(curveOut, curveCovar) = spo.curve_fit(modelFun,time,xt,p0=[1,1000],method='dogbox',bounds=([-np.inf,0],np.inf))
print('minimize result: A={}; B={}'.format(*minErrOut.x))
print('curveFit result: A={}; B={}'.format(*curveOut))
print('curveFit Error: A={}; B={}'.format(*np.sqrt(np.diag(curveCovar))))
Datafile:
Time,x
201,2.67662
204,3.28159
206,3.44378
208,3.72537
210,3.94826
212,4.36716
214,4.65373
216,5.26766
219,5.59502
221,6
223,6.22189
225,6.49652
227,6.799
229,7.30846
231,7.54229
232,7.76517
233,7.6209
234,7.89552
235,7.94826
236,8.17015
237,8.66965
238,8.66965
239,8.8398
240,8.88856
241,9.00697
242,9.45075
243,9.51642
244,9.63483
245,9.63483
246,10.07861
247,10.02687
248,10.24876
249,10.31443
250,10.47164
251,10.99502
252,10.92935
253,11.0995
254,11.28358
255,11.58209
256,11.53035
257,11.62388
258,11.93632
259,11.98806
260,12.26269
261,12.43284
262,12.60299
263,12.801
264,12.99502
265,13.08557
266,13.25572
267,13.32139
268,13.57114
269,13.76617
270,13.88358
271,13.83184
272,14.10647
273,14.27662
274,14.40796
TL;DR;
Your dataset is linear and misses observations at larger timescale. Therefore you can capture A which is proportional to the slope while your model needs to keep B large (and potentially unbounded) to inhibit the square root trend.
This can be confirmed by developing Taylor series of your model and analyzing the MSE surface associated to the regression.
In a nutshell, considering this kind of dataset and the given model, accept A don't trust B.
MCVE
First, let's reproduce your problem:
import io
import numpy as np
import pandas as pd
from scipy import optimize
stream = io.StringIO("""Time,x
201,2.67662
204,3.28159
...
273,14.27662
274,14.40796""")
data = pd.read_csv(stream)
# Origin Shift:
data = data.sub(data.iloc[0,:])
data = data.set_index("Time")
# Simplified model:
def model(t, A, B):
return -B + np.sqrt(A*B*t + np.power(B, 2))
# NLLS Fit:
parameters, covariance = optimize.curve_fit(model, data.index, data["x"].values, p0=(1, 1000), ftol=1e-8)
# array([3.23405915e-01, 1.59960168e+05])
# array([[ 3.65068730e-07, -3.93410484e+01],
# [-3.93410484e+01, 9.77198860e+12]])
The adjustment is fair:
But as you noticed model parameters differ from several order of magnitude which can prevent optimization to perform properly.
Notice that your dataset is quite linear. The observed effect is not surprising and is inherent the chosen model. B parameter must be several orders of magnitude bigger than A to keep the linear behaviour.
This claim is supported by the analysis of the first terms of the Taylor series:
def taylor(t, A, B):
return (A/2*t - A**2/B*t**2/8)
parameters, covariance = optimize.curve_fit(taylor, data.index, data["x"].values, p0=(1, 100), ftol=1e-8)
parameters
# array([3.23396685e-01, 1.05237134e+09])
Without surprise the slope of your linear dataset can be captured while the parameter B can be arbitrary large and will cause float arithmetic errors during optimization (hence the minimize warning bellow you have got).
Analyzing Error Surface
The problem can be reformulated as a minimization problem:
def objective(beta, t, x):
return np.sum(np.power(model(t, beta[0], beta[1]) - x, 2))
result = optimize.minimize(objective, (1, 100), args=(data.index, data["x"].values))
# fun: 0.6594398116927569
# hess_inv: array([[8.03062155e-06, 2.94644208e-04],
# [2.94644208e-04, 1.14979735e-02]])
# jac: array([2.07304955e-04, 6.40749931e-07])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 389
# nit: 50
# njev: 126
# status: 2
# success: False
# x: array([3.24090627e-01, 2.11891188e+03])
If we plot the MSE associated to your dataset, we get the following surface:
We have a canyon that is narrow on A space but seems unbounded at least for first decades on B space. This is supporting your observations in your post and comments. It also brings a technical insight on why we cannot fit B properly.
Performing the same operation on synthetic dataset:
t = np.linspace(0, 1000, 100)
x = model(t, 0.35, 20)
data = pd.DataFrame(x, index=t, columns=["x"])
To have the square root shape in addition of the linear trend at the beginning.
result = optimize.minimize(objective, (1, 0), args=(data.index, data["x"].values), tol=1e-8)
# fun: 1.9284246829733202e-10
# hess_inv: array([[ 4.34760333e-05, -4.42855253e-03],
# [-4.42855253e-03, 4.59219063e-01]])
# jac: array([ 4.35726463e-03, -2.19158602e-05])
# message: 'Desired error not necessarily achieved due to precision loss.'
# nfev: 402
# nit: 94
# njev: 130
# status: 2
# success: False
# x: array([ 0.34999987, 20.000013 ])
This version of the problem has following MSE surface:
Showing a potential convex valley around the known solution which explain why you are able to fit both parameters when there are sufficient large time acquisition.
Notice the valley is strongly stretched meaning that in this scenario your problem will benefit from normalization.
I am computing these derivatives using the Montecarlo approach for a generic call option. I am interested in this combined derivative (with respect to both S and Sigma). Doing this with the algorithmic differentiation, I get an error that can be seen at the end of the page. What could be a possible solution? Just to explain something regarding the code, I am going to attach the formula used to compute the "X" in the code below:
from jax import jit, grad, vmap
import jax.numpy as jnp
from jax import random
Underlying_asset = jnp.linspace(1.1,1.4,100)
volatilities = jnp.linspace(0.5,0.6,100)
def second_derivative_mc(S,vol):
N = 100
j,T,q,r,k = 10000,1.,0,0,1.
S0 = jnp.array([S]).T #(Nx1) vector underlying asset
C = jnp.identity(N)*vol #matrix of volatilities with 0 outside diagonal
e = jnp.array([jnp.full(j,1.)])#(1xj) vector of "1"
Rand = np.random.RandomState()
Rand.seed(10)
U= Rand.normal(0,1,(N,j)) #Random number for Brownian Motion
sigma2 = jnp.array([vol**2]).T #Vector of variance Nx1
first = jnp.dot(sigma2,e) #First part equation
second = jnp.dot(C,U) #Second part equation
X = -0.5*first+jnp.sqrt(T)*second
St = jnp.exp(X)*S0
P = jnp.maximum(St-k,0)
payoff = jnp.average(P, axis=-1)*jnp.exp(-q*T)
return payoff
greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0)(Underlying_asset,volatilities)
This is the error message:
> UnfilteredStackTrace Traceback (most recent call
> last) <ipython-input-78-0cc1da97ae0c> in <module>()
> 25
> ---> 26 greek = vmap(grad(grad(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
>
> 18 frames UnfilteredStackTrace: TypeError: Gradient only defined for
> scalar-output functions. Output had shape: (100,).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
The above exception was the direct cause of the following exception:
> TypeError Traceback (most recent call
> last) /usr/local/lib/python3.7/dist-packages/jax/_src/api.py in
> _check_scalar(x)
> 894 if isinstance(aval, ShapedArray):
> 895 if aval.shape != ():
> --> 896 raise TypeError(msg(f"had shape: {aval.shape}"))
> 897 else:
> 898 raise TypeError(msg(f"had abstract value {aval}"))
> TypeError: Gradient only defined for scalar-output functions. Output had shape: (100,).
As the error message indicates, gradients can only be computed for functions that return a scalar. Your function returns a vector:
print(len(second_derivative_mc(1.1, 0.5)))
# 100
For vector-valued functions, you can compute the jacobian (which is similar to a multi-dimensional gradient). Is this what you had in mind?
from jax import jacobian
greek = vmap(jacobian(jacobian(second_derivative_mc, argnums=1), argnums=0))(Underlying_asset,volatilities)
Also, this is not what you asked about, but the function above will probably not work as you intend even if you solve the issue in the question. Numpy RandomState objects are stateful, and thus will generally not work correctly with jax transforms like grad, jit, vmap, etc., which require side-effect-free code (see Stateful Computations In JAX). You might try using jax.random instead; see JAX: Random Numbers for more information.
I am trying to use Pandas ewm function to calculating exponentially weighted moving averages. However i've noticed that information seems to carry through your entire time series. What this means is that every data point's MA is dependant on a different number of previous data points. Therefore the ewm function at every data point is mathematically different.
I think some here had a similar question
Does Pandas calculate ewm wrong?
But i did try their method, and i am not getting functionality i want.
def EMA(arr, window):
sma = arr.rolling(window=window, min_periods=window).mean()[:window]
rest = arr[window:]
return pd.concat([sma, rest]).ewm(com=window, adjust=False).mean()
a = pd.DataFrame([x for x in range(100)])
print(list(EMA(a, 10)[0])[-1])
print(list(EMA(a[50:], 10)[0])[-1])
In this example, i have an array of 1 through 100. I calculate moving averages on this array, and array of 50-100. The last moving average should be the same, since i am using only a window of 10. But when i run this code i get two different values, indicating that ewm is indeed dependent on the entire series.
IIUC, you are asking for ewm in a rolling window, which means, every 10 rows return a single number. If that is the case, then we can use a stride trick:
Edit: update function works on series only
def EMA(arr, window=10, alpha=0.5):
ret = pd.Series(index=arr.index, name=arr.name)
arr=np.array(arr)
l = len(arr)
stride = arr.strides[0]
ret.iloc[window-1:] = (pd.DataFrame(np.lib.stride_tricks.as_strided(arr,
(l-window+1,window),
(stride,stride)))
.T.ewm(alpha)
.mean()
.iloc[-1]
.values
)
return ret
Test:
a = pd.Series([x for x in range(100)])
EMA(a).tail(2)
# 98 97.500169
# 99 98.500169
# Name: 9, dtype: float64
EMA(a[:50]).tail(2)
# 98 97.500169
# 99 98.500169
# Name: 9, dtype: float64
EMA(a, 2).tail(2)
98 97.75
99 98.75
dtype: float64
Test on random data:
a = pd.Series(np.random.uniform(0,1,10000))
fig, ax = plt.subplots(figsize=(12,6))
a.plot(ax=ax)
EMA(a,alpha=0.99, window=2).plot(ax=ax)
EMA(a,alpha=0.99, window=1500).plot(ax=ax)
plt.show()
Output: we can see that the larger window (green) is less volatile than the smaller window (orange).
This can be achieved by working with the formula for exponential smoothing by cancelling the lagged terms. The formula can be found on the ewm page.
The following code demonstrates that no memory is left after adjustment. For every point, the fixed window of information used is L=1000. And the factor f should be included if one desires to have the equivalent for the adjust=True version (for adjust=False just get rid of the f factor).
srs1=pd.Series(np.random.normal(size=100000))
alpha=0.02
em1=srs1.ewm(alpha=alpha,adjust=False).mean()
L=1000
f=1-(1-alpha)**np.clip(np.arange(em1.shape[0]),0,L)
em1_=(em1-em1.shift(L)*(1-alpha)**L)/f
S=1001
em2=srs1[S:].ewm(alpha=alpha,adjust=False).mean()
f=1-(1-alpha)**np.clip(np.arange(em2.shape[0]),0,L)
em2_=(em2-em2.shift(L)*(1-alpha)**L)/f
print((em2_[:10000]-em1_[S:S+10000]).abs().max())
This seems to be possible in pandas 1.5 with a mix of rolling, and win_type:
pd.Series.rolling(window=10, win_type='exponential').mean(tau=0.5, center=10, sym=False)
I use a non symetric exponential window centered at the same size of the window in order to have a exponential function decaying towards the past.
This yields the same results as the EMA function provided by Quang Hoang.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def EMA(arr, window=10, alpha=0.5):
ret = pd.Series(index=arr.index, name=arr.name, dtype='float64')
arr=np.array(arr)
l = len(arr)
stride = arr.strides[0]
ret.iloc[window-1:] = (pd.DataFrame(np.lib.stride_tricks.as_strided(arr,
(l-window+1,window),
(stride,stride)))
.T.ewm(alpha)
.mean()
.iloc[-1]
.values
)
return ret
a = pd.Series([x for x in range(100)])
custom=EMA(a)
builtin= a.rolling(window=10, win_type='exponential').mean(tau=0.5, center=10, sym=False)
custom=custom.plot.line(label="Custom EMA")
builtin.plot.line(label="Built-in EMA")
plt.legend()
I am trying to do a correlated fit of both x and y data, however when I pass in covariance matrices for my x and y measurements, I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-173-273ef42c6f27> in <module>()
----> 1 odrout = theodr.run()
/Users/anaconda/lib/python2.7/site-packages/scipy/odr/odrpack.pyc in run(self)
1098 for attr in kwd_l:
1099 obj = getattr(self, attr)
-> 1100 if obj is not None:
1101 kwds[attr] = obj
1102
ValueError: could not convert we to a suitable array
Here is a minimal NOT working example that triggers this error on my machine:
import numpy as np
import scipy.odr as spodr
# make x and y data for a function
xx = np.linspace(0, 2*np.pi, 100)
yy = 2.*np.sin(3*xx) - 1
# randomize both variables a bit, and make 10 measurements
# of each data point
xdat = xx + np.random.normal(scale=0.3, size=(10,100))
ydat = yy + np.random.normal(scale=0.3, size=(10, 100))
# the function I will fit to
sin = lambda beta, x: beta[0]*np.sin(beta[1] * x) + beta[2]
# the covariance matrices for both data sets, here I summed over
# the 10 measurements I made for both my x and y data
xcov = np.cov(xdat.transpose())
ycov = np.cov(ydat.transpose())
# setup the odr data
odrdat = spodr.RealData(np.mean(xdat, axis=0),
np.mean(ydat, axis=0), covx=xcov, covy=ycov)
# set up the odr model
model = spodr.Model(sin)
# make the odr object
theodr = spodr.ODR(odrdat, model, beta0=[2,3,-1])
# run the odr object
odrout = theodr.run()
I can't seem to see why the matrices I'm passing are not suitable arrays. From the docs:
Covariance of x covx is an array of covariance matrices of x and are converted to weights by performing a matrix inversion on each observation’s covariance matrix.
This makes me think I should be passing a covariance matrix for each data point, but I don't have that type of information, and I don't think I need it. For a correlated fit it should be enough to have the covariances between all the data. For instance, in scipy.curve_fit you can pass in a 2d-array as a covariance matrix for the y-data, you don't need one for every single point.
Is there a particular way I should be passing these covariance matrices?
I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks
The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.