Consider singular value decomposition M=USV*. Then the eigenvalue decomposition of M* M gives M* M= V (S* S) V*=VS* U* USV*. I wish to verify this equality with numpy by showing that the eigenvectors returned by eigh function are the same as those returned by svd function:
import numpy as np
np.random.seed(42)
# create mean centered data
A=np.random.randn(50,20)
M= A-np.array(A.mean(0),ndmin=2)
# svd
U1,S1,V1=np.linalg.svd(M)
S1=np.square(S1)
V1=V1.T
# eig
S2,V2=np.linalg.eigh(np.dot(M.T,M))
indx=np.argsort(S2)[::-1]
S2=S2[indx]
V2=V2[:,indx]
# both Vs are in orthonormal form
assert np.all(np.isclose(np.linalg.norm(V1,axis=1), np.ones(V1.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V1,axis=0), np.ones(V1.shape[1])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=1), np.ones(V2.shape[0])))
assert np.all(np.isclose(np.linalg.norm(V2,axis=0), np.ones(V2.shape[1])))
assert np.all(np.isclose(S1,S2))
assert np.all(np.isclose(V1,V2))
The last assertion fails. Why?
Just play with small numbers to debug your problem.
Start with A=np.random.randn(3,2) instead of your much larger matrix with size (50,20)
In my random case, I find that
v1 = array([[-0.33872745, 0.94088454],
[-0.94088454, -0.33872745]])
and for v2:
v2 = array([[ 0.33872745, -0.94088454],
[ 0.94088454, 0.33872745]])
they only differ for a sign, and obviously, even if normalized to have unit module, the vector can differ for a sign.
Now if you try the trick
assert np.all(np.isclose(V1,-1*V2))
for your original big matrix, it fails... again, this is OK. What happens is that some vectors have been multiplied by -1, some others haven't.
A correct way to check for equality between the vectors is:
assert allclose(abs((V1*V2).sum(0)),1.)
and indeed, to get a feeling of how this works you can print this quantity:
(V1*V2).sum(0)
that indeed is either +1 or -1 depending on the vector:
array([ 1., -1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., -1., 1., 1., 1., -1., -1.])
EDIT: This will happen in most cases, especially if starting from a random matrix. Notice however that this test will likely fail if one or more eigenvalues has an eigenspace of dimension larger than 1, as pointed out by #Sven Marnach in his comment below:
There might be other differences than just vectors multiplied by -1.
If any of the eigenvalues has a multi-dimensional eigenspace, you
might get an arbitrary orthonormal basis of that eigenspace, and to
such bases might be rotated against each other by an arbitraty
unitarian matrix
Related
I have a 150x4 matrix X which I created from a pandas dataframe using the following code:
X = df_new.as_matrix()
I have to normalize it using this function:
I know that Uj is the mean val of j, and that σ j is the standard deviation of j, but I don't understand what j is. I'm having a little trouble understanding what the bar on X is, and I'm confused by the commas in the equation (I don't know if they have any significance or not).
Can anyone help me understand what this equation means so I can then write the normalization using sklearn?
You don't actually need to write code for the normalization yourself - it comes ready with sklearn.preprocessing.scale.
Here is an example from the docs:
>>> from sklearn import preprocessing
>>> import numpy as np
>>> X_train = np.array([[ 1., -1., 2.],
... [ 2., 0., 0.],
... [ 0., 1., -1.]])
>>> X_scaled = preprocessing.scale(X_train)
>>> X_scaled
array([[ 0. ..., -1.22..., 1.33...],
[ 1.22..., 0. ..., -0.26...],
[-1.22..., 1.22..., -1.06...]])
When used with the default setting axis=0, the mormalization happens column-wise (i.e. for each column j, as in your equestion). As a result, it is easy to confirm that scaled data has zero mean and unit variance:
>>> X_scaled.mean(axis=0)
array([ 0., 0., 0.])
>>> X_scaled.std(axis=0)
array([ 1., 1., 1.])
The indexes for matrix X are row (i) and column (j). Hence, X,j means column j of matrix X. I.e. normalize each column of matrix X to z-scores.
You can do that using pandas:
df_new_zscores = (df_new - df_new.mean()) / df_new.std()
I do not know pandas but I think that the equation means that the normalized matrix is given by
You subtract the empirical mean and devide by the empirical standard deviation per column.
You sometimes use this for Principal Component Analysis.
Is there a way to improve the precision of the output of numpy.linalg.eig() and scipy.linalg.eig()?
I'm diagonalizing a non-symmetric matrix, yet I expect on physical grounds to get a real spectrum of pairs of positive and negative eigenvalues. Indeed, the eigenvalues do come in pairs, and I have verified by an independent analytical calculation that two of the pairs are correct. The problematic pair is the one with eigenvalues close to zero, which appear to have small imaginary parts. I am expecting this pair to be degenerate at zero, so the imaginary parts can at most be at machine precision, but they are much larger. I think this leads to a small error in the eigenvectors, which however can propagate in subsequent manipulations.
The example below shows that there are fictitious imaginary parts leftovers, by checking the validity of the transformation.
import numpy as np
import scipy.linalg as sla
H = np.array(
[[ 11.52, -1., -1., 9.52, 0., 0. ],
[ -1., 11.52, -1., 0., 9.52, 0., ],
[ -1., -1., 11.52, 0., 0., 9.52,],
[ -9.52, 0., 0., -11.52, 1., 1., ],
[ 0., -9.52, 0., 1., -11.52, 1., ],
[ 0., 0., -9.52, 1., 1., -11.52 ]],
dtype=np.float64
)
#print(H)
E,V = np.linalg.eig(H)
#E,V = sla.eig(H)
H2=reduce(np.dot,[V,np.diag(E),np.linalg.inv(V)])
#print(H2)
print(np.linalg.norm(H-H2))
which gives
3.93435308362e-09
a number of the order of the fictitious imaginary part of the zero eigenvalues.
You may be losing some precision by taking the inverse in computing the error above. If instead you compute:
# H = V.E.inv(V) <=> H.V = V.E
print(np.linalg.norm(H.dot(V)-V.dot(np.diag(E))))
# error: 2.81034671113e-14
the error is much smaller.
Your problem may also be suffering from ill-conditioning, meaning that there will be a very high numerical sensitivity to rounding and other errors. The Bauer-Fike Theorem gives an upper-bound on the error sensitivity of the eigenvalue problem.
From this theorem, in the worse circumstances an error at machine precision in the input domain could propagate to an error on the order of 1e-8 in the eigenvalues since:
machine_precison = np.finfo(np.float64).eps
print(np.linalg.cond(V)*(machine_precison))
# 4.54517272701e-08
I have a 2D Numpy array, in which I want to normalise each column to zero mean and unit variance. Since I'm primarily used to C++, the method in which I'm doing is to use loops to iterate over elements in a column and do the necessary operations, followed by repeating this for all columns. I wanted to know about a pythonic way to do so.
Let class_input_data be my 2D array. I can get the column mean as:
column_mean = numpy.sum(class_input_data, axis = 0)/class_input_data.shape[0]
I then subtract the mean from all columns by:
class_input_data = class_input_data - column_mean
By now, the data should be zero mean. However, the value of:
numpy.sum(class_input_data, axis = 0)
isn't equal to 0, implying that I have done something wrong in my normalisation. By isn't equal to 0, I don't mean very small numbers which can be attributed to floating point inaccuracies.
Something like:
import numpy as np
eg_array = 5 + (np.random.randn(10, 10) * 2)
normed = (eg_array - eg_array.mean(axis=0)) / eg_array.std(axis=0)
normed.mean(axis=0)
Out[14]:
array([ 1.16573418e-16, -7.77156117e-17, -1.77635684e-16,
9.43689571e-17, -2.22044605e-17, -6.09234885e-16,
-2.22044605e-16, -4.44089210e-17, -7.10542736e-16,
4.21884749e-16])
normed.std(axis=0)
Out[15]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
I have a data structure that looks like a list values and I am trying to compute the (x,y) 2d hermite functions from them using numpy. I'm trying to use as many numpy arrays as possible due to the performance boost you get from getting to Fortran as quickly as possible (I'm expecting x to be in practice many thousands of 3-arrays). Specifically, my code looks like this:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
coefs = np.array([[[1., 0.],[0., 1.]], [[0., 1.], [1., 0.]]])
z = np.array([0., 0.])
z[:] = hermval2d(x[:,0], x[:,1], coefs[:])
This returns an error about the shape of hermval2d, which according to just running the hermval2d function instead of assigning it:
In [XX]: hermval2d(x[:,0], x[:,1], coefs[:])
Out[XX]:
array([[ 9., 81.],
[ 6., 18.]])
I would expect the hermval2d to be a scalar for every x, y, and coefficient matrix, which is what you would expect from the documentation. So what am I missing here? What's the score?
It's right there in the docs :)
hermval2d(x, y, c)
[...]
The shape of the result will be c.shape[2:] + x.shape
In your case this seems to return the Hermite values for x and y evaluated for each ith 2d array in c[:,:,i].
from numpy.random import rand
from sklearn.preprocessing import normalize
from scipy.sparse import csr_matrix
from scipy.linalg import norm
w = (rand(1,10)<0.25)*rand(1,10)
x = (rand(1,10)<0.25)*rand(1,10)
w_csr = csr_matrix(w)
x_csr = csr_matrix(x)
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
norm(w,ord='fro')*norm(x,ord='fro')
I am working with scipy csr_matrix and would like to normalize two matrices using the frobenius norm and get their product. But norm from scipy.linalg and normalize from sklearn.preprocessing seem to evaluate the matrices differently. Since technically in the above two cases I am calculating the same frobenius norm shouldn't the two expressions evaluate to the same thing? But I get the following answer:
matrix([[ 0.962341]])
0.4431811178371029
for sklearn.preprocessing and scipy.linalg.norm respectively. I am really interested to know what I am doing wrong.
sklearn.prepocessing.normalize divides each row by its norm. It returns a matrix with the same shape as its input. scipy.linalg.norm returns the norm of the matrix. So your calculations are not equivalent.
Note that your code is not correct as it is written. This line
(normalize(w_csr,axis=1,copy=False,norm='l2')*normalize(x_csr,axis=1,copy=False,norm='l2')).todense()
raises ValueError: dimension mismatch. The two calls to normalize both return matrices with shapes (1, 10), so their dimensions are not compatible for a matrix product. What did you do to get matrix([[ 0.962341]])?
Here's a simple function to compute the Frobenius norm of a sparse (e.g. CSR or CSC) matrix:
def spnorm(a):
return np.sqrt(((a.data**2).sum()))
For example,
In [182]: b_csr
Out[182]:
<3x5 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [183]: b_csr.A
Out[183]:
array([[ 1., 0., 0., 0., 0.],
[ 0., 2., 0., 4., 0.],
[ 0., 0., 0., 2., 1.]])
In [184]: spnorm(b_csr)
Out[184]: 5.0990195135927845
In [185]: norm(b_csr.A)
Out[185]: 5.0990195135927845