Covariance Matrix calculated by Python Numpy change every time - python

I have a 1043*261 matrix with very small numbers between 0 and 1, and I calculated 1043*1043 covariance matrix using numpy.cov() function. I tried to run the code a few times and got similar (not exactly the same) covariance matrices, but the elements in the covariance matrices were slightly different by scale of e-7. This sometimes made the covariance matrix non-PSD, which will cause serious problem for me.
Does anyone know why the differences would exist and how to solve it?
Attached are two covariance matrices I got by running the same code twice. If you compare them by element, you will see slight differences:
No. 1
[[ 5.05639177e-06 2.44041401e-06 3.30187175e-06 ..., 1.66634014e-06
4.03972183e-06 1.18433575e-06]
[ 2.44041401e-06 9.67277658e-06 9.04356309e-06 ..., 2.50668884e-06
5.43371939e-06 4.74297546e-06]
[ 3.30187175e-06 9.04356309e-06 2.09334309e-05 ..., 3.13977728e-06
8.69946165e-06 6.15981652e-06]
...,
[ 1.66634014e-06 2.50668884e-06 3.13977728e-06 ..., 4.20175297e-06
4.16076781e-06 1.59827406e-06]
[ 4.03972183e-06 5.43371939e-06 8.69946165e-06 ..., 4.16076781e-06
2.58010941e-05 3.02797946e-06]
[ 1.18433575e-06 4.74297546e-06 6.15981652e-06 ..., 1.59827406e-06
3.02797946e-06 6.60805238e-06]]
No.2
[[ 5.05997030e-06 2.42187179e-06 3.30788097e-06 ..., 1.66495376e-06
4.03676937e-06 1.17413702e-06]
[ 2.42187179e-06 9.60677140e-06 9.05219266e-06 ..., 2.50338648e-06
5.42679569e-06 4.75547515e-06]
[ 3.30788097e-06 9.05219266e-06 2.04172017e-05 ..., 3.13058624e-06
8.67976701e-06 6.28137859e-06]
...,
[ 1.66495376e-06 2.50338648e-06 3.13058624e-06 ..., 4.20175297e-06
4.16076781e-06 1.59827884e-06]
[ 4.03676937e-06 5.42679569e-06 8.67976701e-06 ..., 4.16076781e-06
2.58010941e-05 3.02810307e-06]
[ 1.17413702e-06 4.75547515e-06 6.28137859e-06 ..., 1.59827884e-06
3.02810307e-06 6.63834973e-06]]
Thank you very much!

numpy.cov seems to be deterministic:
import numpy
randoms = numpy.random.random((1043, 261))
covs = [numpy.cov(randoms) for _ in range(10)]
all((c==covs[0]).all() for c in covs)
#>>> True
I'd imagine the problem is elsewhere.
Also note that this result holds with numbers 1000th the size

Related

What does optimizer ImFil return as history?

I want to use the python implementation of the optimizer ImFil (Implicit Filtering) using the sckit-quant package. On this website there is a extensive manual on the matlab code on which the python implementation is based on.
The code runs but the output is not as expected. Here is a toy code to demonstrate my problem:
import numpy as np
from skquant.opt import minimize
def f(x):
ifail=0 #function never fails
icount=1 #each call weighted the same
return x[0]*x[0] + x[1]*x[1] , ifail, icount
bounds=np.array([[-10,10], [-10,10]])
x0=np.array([5,5]) #initial point
budget=20
res, hist=minimize(f, x0, bounds, budget, method='ImFil')
print('optimization result: ', res)
print('optimization history: ', hist, ' shape: ', hist.shape)
My expectation: The history of the optimization 'hist' should be a (20)x(7) array, according to the documentation:
histout = iteration history, updated after each nonlinear iteration
= (number of iterations )x(N+5) array, the columns are
[fcount, fval, norm(sgrad), norm(step), iarm, xval]
fcount = cumulative function evals
fval = current function value
norm(sgrad) = current projected stencil grad norm
norm(step) = norm of last step
iarm = line searches within current iteration
= -1 means first iterate at a new scale
xval = transpose of the current iteration
What I instead get:
optimization result: Optimal value: 1.57772e-29, at parameters: [ 3.55271368e-15 -1.77635684e-15]
optimization history: [[ 5.00000000e+01 5.00000000e+00 5.00000000e+00]
[ 5.00000000e+01 -5.00000000e+00 5.00000000e+00]
[ 5.00000000e+01 5.00000000e+00 -5.00000000e+00]
[ 1.25000000e+02 1.00000000e+01 5.00000000e+00]
[ 1.25000000e+02 5.00000000e+00 1.00000000e+01]
[ 2.50000000e+01 0.00000000e+00 5.00000000e+00]
[ 2.50000000e+01 5.00000000e+00 0.00000000e+00]
[ 2.00000000e+02 -1.00000000e+01 -1.00000000e+01]
[ 2.00000000e+02 -1.00000000e+01 -1.00000000e+01]
[ 2.94733047e+01 -3.83883476e+00 -3.83883476e+00]
[ 1.60849571e+01 1.16116524e+00 -3.83883476e+00]
[ 1.60849571e+01 -3.83883476e+00 1.16116524e+00]
[ 9.28616524e+01 -8.83883476e+00 -3.83883476e+00]
[ 9.28616524e+01 -3.83883476e+00 -8.83883476e+00]
[ 1.57772181e-29 3.55271368e-15 -1.77635684e-15]
[ 2.50000000e+01 5.00000000e+00 -1.77635684e-15]
[ 2.50000000e+01 3.55271368e-15 5.00000000e+00]
[ 2.50000000e+01 -5.00000000e+00 -1.77635684e-15]
[ 2.50000000e+01 3.55271368e-15 -5.00000000e+00]
[ 6.25000000e+00 2.50000000e+00 -1.77635684e-15]
[ 6.25000000e+00 3.55271368e-15 2.50000000e+00]
[ 6.25000000e+00 -2.50000000e+00 -1.77635684e-15]
[ 6.25000000e+00 3.55271368e-15 -2.50000000e+00]] shape: (23, 3)
The optimization results are reasonable, but what is it that is returned in the history?
My real life example:
Similarly, my code runs and the result of the optimization is reasonable, but the returned history is not what I would expect:
In my case I have a 12-dimensional input to my objective function and a budget of 200. I thus would expect histout to have dimensions (200)x(12+5). What I get is (209)x(13). The 209 are ok. A bit overhead is expected because the current number of calls to the evaluation and the budget are only in compared in larger intervals. But the 13 irritates me. Also, there is nothing which could be interpreted as fcount, the number of objective function evaluations, which I expect to be an integer in histout. From the toy problem it looks like the first column in each row is the f(x), where x is the second and third column. The result of the optimization then is the row with the minimal value in the first column. But this is not the case for my actual problem. There values arise in the first column which are smaller than the returned result of the optimization.
Edit:
I found out, that the option 'standalone' is hardcoded on False in the code (in skquant/opt/_init.py). When standalone is False the returned history is fval,xval. This does explain the format of the output.
The open problem remains: What is fval? The documentation says 'fval, the current value of the objective function'. What exactly does this mean? I'm asking for details, because in my returned array there are values that are outside the range of my function. So, is it some extrapolation or so? I did not really understand how ImFil actually works. Also, what is returned as result is not the minimum of the fval array. How is the result determined?

Linear Dependence of Set of Vectors in numpy

I want to check whether some vectors are dependent on each other or not by numpy, I found some good suggestions for checking linear dependency of rows of a matrix in the link below:
How to find linearly independent rows from a matrix
I can not understand the 'Cauchy-Schwarz inequality' method which I think is due to lack of my knowledge, however I tried the Eigenvalue method to check linear dependency among columns and here is my code:
A = np.array([
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(A)
print(lambdas)
print(V)
and I get:
[ 1. 0. 1.61803399 -0.61803399]
[[ 0. 0.70710678 0.2763932 -0.7236068 ]
[ 0. 0. 0.4472136 0.4472136 ]
[ 0. 0. 0.7236068 -0.2763932 ]
[ 1. -0.70710678 0.4472136 0.4472136 ]]
My question is what is the relevance between these eigenvectors or eigenvalues to the dependency of columns of my matrix? How can I understand which columns are dependent to each other and which are independent by these values?
The second column vector corresponds to the eigenvalue of 0.
Just take a look at the API documentation when you get confused.
v : (…, M, M) array
The normalized (unit “length”) eigenvectors, such that the column
v[:,i] is the eigenvector corresponding to the eigenvalue w[i].
You can find the linearly independent columns by QR decomposition as described here.

SciPy method eigsh giving nonintuitive results

I tried to use SciPy function linalg.eigsh to calculate a few eigenvalues and eigenvectors of a matrix. However, when I print the calculated eigenvectors, they are of the same dimension as the number of eigenvalues I wanted to calculate. Shouldn't it give me the actual eigenvector, whose dimension is the same as that of the original matrix?
My code for reference:
id = np.eye(13)
val, vec = sp.sparse.linalg.eigsh(id, k = 2)
print(vec[1])
Which gives me:
[-0.26158945 0.63952164]
While intuitively it should have a dimension of 13. And it should not be a non-integer value either. Is it just my misinterpretation of the function? If so, is there any other function in Python that can calculate a few eigenvectors (I don't want the full spectrum) of the wanted dimensionality?
vec is an array with shape (13, 2).
In [21]: vec
Out[21]:
array([[ 0.36312724, -0.04921923],
[-0.26158945, 0.63952164],
[ 0.41693924, 0.34811192],
[ 0.30068329, -0.11360339],
[-0.05388733, -0.3225355 ],
[ 0.47402124, -0.28180261],
[ 0.50581823, 0.29527393],
[ 0.06687073, 0.19762049],
[ 0.103382 , 0.29724875],
[-0.09819873, 0.00949533],
[ 0.05458907, -0.22466131],
[ 0.15499849, 0.0621803 ],
[ 0.01420219, 0.04509334]])
The eigenvectors are stored in the columns of vec. To see the first eigenvector, use vec[:, 0]. When you printed vec[0] (which is equivalent to vec[0, :]), you printed the first row of vec, which is just the first components of the two eigenvectors.

Python: Calculating the inverse of a pseudo inverse matrix

I am trying to calculate the pseudo inverse of a matrix which should be not very difficult. The problem is inverting the matrix.
I am using the following code:
A=numpy.random.random_sample((4,5,))
A=mat(A)
B=pseudoinverse(A)
def pseudoinverse(A):
helper=A.T*A
print helper*helper.I
PI=helper.I*A.T
return PI`
to test this I included the print line. helper*helper.I should give unity. The output I get from this is:
[[ 2. -1. 0. 0. 3. ]
[ 0. 2. 0. 0. 3.5 ]
[ 0. -0.5 1.125 -1. 2.25 ]
[ 2. 0. 0.25 -2. 3. ]
[ 0. 0. 0.5 -2. 4. ]]
which is clearly not unity. I don't know what I did wrong and really would like to know.
Your matrix A does not have full column rank. In consequence helper is singular and not invertible (If you print helper.I you will see some very large numbers).
The solution is to compute the right inverse instead of the left inverse:
helper = A * A.T
PI = A.T * helper.I
See Wikipedia for more details.
Unless you are doing this for exercise, you could also use numpy's built in implementation of the pseudeinverse.
edit
>>> numpy.random.seed(42)
>>> a = mat(numpy.random.random_sample((3, 4))) # smaller matrix for nicer output
>>> h = a * a.T
>>> h * h.I
matrix([[ 1.00000000e+00, 1.33226763e-15, 0.00000000e+00],
[ -1.77635684e-15, 1.00000000e+00, 0.00000000e+00],
[ 0.00000000e+00, 1.33226763e-15, 1.00000000e+00]])
Up to numeric precision this looks pretty much like an identity matrix to me.
The problem in your code is that A.T * A is not invertible. If you try to invert such a matrix you get wrong results.
In contrast, A * A.T is invertible.
You have two options:
change the direction of multiplication
call pseudoinverse(A.T)

numpy mean of rows when speed is a concern

I want to do mean of rows of numpy matrix. So for the input:
array([[ 1, 1, -1],
[ 2, 0, 0],
[ 3, 1, 1],
[ 4, 0, -1]])
my output will be:
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
I came up with a solution result = array([[x] for x in np.mean(my_matrix, axis=1)]), but this function will be called a lots of times on matrices of 40rows x 10-300 columns, so i would like to make it faster, and this implementation seems slow
You can do something like this:
>>> my_matrix.mean(axis=1)[:,np.newaxis]
array([[ 0.33333333],
[ 0.66666667],
[ 1.66666667],
[ 1. ]])
If the matrices are fresh and independent there isn't much you can save because the only way to compute the mean is to actually sum the numbers.
If however the matrices are obtained from partial views of a single fixed dataset (e.g. you're computing a moving average) the you can use a sum table. For example after:
st = data.cumsum(0)
you can compute the average of the elements between index x0 and x1 with
avg = (st[x1] - st[x0]) / (x1 - x0)
in O(1) (i.e. the computing time doesn't depends on how many elements you are averaging).
You can even use numpy to compute an array with the moving averages directly with:
res = (st[n:] - st[:-n]) / n
This approach can even be extended to higher dimensions like computing the average of the values in a rectangle in O(1) with
st = data.cumsum(0).cumsum(1)
rectsum = (st[y1][x1] + st[y0][x0] - st[y0][x1] - st[y1][x0])

Categories