Error getting more than two eigenvalues in PCA - python

I am trying to perform PCA from scratch on a subset of MNIST data(digits 0 and 1) using Python.
(NOTE: x_train_0_scaled has dimensions : 5923x784 where 5923 is the number of images and 784 is the 28*28 flattened pixel values)
Here's my code to find eigenvalues:
# matrix multiplication using numpy
covar_matrix = np.matmul(x_train_0_scaled.T, x_train_0_scaled)
print("The shape of variance matrix = ", covar_matrix.shape)
# the parameter 'eigvals' is defined (low value to heigh value)
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783)(index) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782, 783))
print("Shape of eigen vectors = ", vectors.shape)
However when I try to get more than two eigenvalues, I get this error:
values, vectors = eigh(covar_matrix, eigvals=(782, 783, 781))
File "/usr/local/lib/python3.8/site-packages/scipy/linalg/decomp.py", line 484, in eigh
lo, hi = [int(x) for x in subset_by_index]
ValueError: too many values to unpack (expected 2)
The reason I want more than two eigenvectors is because as per the image below, I guess my data is not clearly seperable so I want to find more dimensions to plot my data on. Is my intuition correct?

The issue is resolved. eigvals takes arguments as (lo, hi). So instead of specifying (781,782, 783) I needed to specify lo=781 and hi=783 to get the top 3 eigenvalues.

Related

scipy.stats.multivariate_normal error: input matrix must be symmetric positive definite

i'm trying to compute the cumulative distribution function of a multivariate normal using scipy.
i'm having trouble with the "input matrix must be symmetric positive definite" error.
to my knowledge, a diagonal matrix with positive diagonal entries is positive definite (see page 1 problem 2)
However, for different (relatively) small values of these diagonal values, the error shows up for the smaller values.
For example, this code:
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2)).cdf([2,1])
returns 0.15865525393145702
while changing the third line with:
std = np.array([0.00001, 2])
causes the error to show up.
i'm guessing that it has something to do with computation error of floats.
The problem is, when the dimension of the cov matrix is larger, the accepted positive values on the diagoanal are bigger and bigger.
I tried multiple values on the diagonal of the covariance matrix of dimension 9x9. It seems that when other diagonal values are very large, small values cause the error.
Examining the stack trace you will see that it assumes the condition number as
1e6*np.finfo('d').eps ~ 2.2e-10 in _eigvalsh_to_eps
In your example the difference the smaller eigenvalue is 5e-6**2 times smaller than the largest eigenvalue so it will be treated as zero.
You can pass allow_singular=True to get it working
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.000001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2), allow_singular=True).cdf([2,1])

Reshaping numpy array

What I am trying to do is take a numpy array representing 3D image data and calculate the hessian matrix for every voxel. My input is a matrix of shape (Z,X,Y) and I can easily take a slice along z and retrieve a single original image.
gx, gy, gz = np.gradient(imgs)
gxx, gxy, gxz = np.gradient(gx)
gyx, gyy, gyz = np.gradient(gy)
gzx, gzy, gzz = np.gradient(gz)
And I can access the hessian for an individual voxel as follows:
x = 100
y = 100
z = 63
H = [[gxx[z][x][y], gxy[z][x][y], gxz[z][x][y]],
[gyx[z][x][y], gyy[z][x][y], gyz[z][x][y]],
[gzx[z][x][y], gzy[z][x][y], gzz[z][x][y]]]
But this is cumbersome and I can't easily slice the data.
I have tried using reshape as follows
H = H.reshape(Z, X, Y, 3, 3)
But when I test this by retrieving the hessian for a specific voxel the, the value returned from the reshaped array is completely different than the original array.
I think I could use zip somehow but I have only been able to find that for making lists of tuples.
Bonus: If there's a faster way to accomplish this please let me know, I essentially need to calculate the three eigenvalues of the hessian matrix for every voxel in the 3D data set. Calculating the hessian values is really fast but finding the eigenvalues for a single 2D image slice takes about 20 seconds. Are there any GPUs or tensor flow accelerated libraries for image processing?
We can use a list comprehension to get the hessians -
H_all = np.array([np.gradient(i) for i in np.gradient(imgs)]).transpose(2,3,4,0,1)
Just to give it a bit of explanation : [np.gradient(i) for i in np.gradient(imgs)] loops through the two levels of outputs from np.gradient calls, resulting in a (3 x 3) shaped tensor at the outer two axes. We need these two as the last two axes in the final output. So, we push those at the end with the transpose.
Thus, H_all holds all the hessians and hence we can extract our specific hessian given x,y,z, like so -
x = 100
y = 100
z = 63
H = H_all[z,y,x]

Saving confusion matrix

Is there any possibility to save the confusion matrix which is generated by sklearn.metrics?
I would like to save multiple results of different classification algorithms in an array or maybe a pandas data frame so I can show which algorithm works best.
print('Neural net: \n',confusion_matrix(Y_test, Y_pred), sep=' ')
How could I save the generated confusion matrix within a loop? (I am training over a set of 200 different target variables)
array[i] = confusion_matrix(Y_test,Y_pred)
I run into some definition problems here [array is not defined whereas in the non [i] - version it runs smoothly]
Additionally, I am normalizing the confusion matrix. How could I print out the average result of the confusion matrix after the whole loop? (average of the 200 different confusion matrices)
I am not that fluent with python yet.
First getting to array not defined problem.
In python list is declared as :
array=[]
Since size of list is not given during declaration, no space is allocated. Hence we can't assign values the place which is not allocated.
array[i]=some value, but no space is allocated for array
So if you know the required size of array,fill zeroes during declaration and the use array this way or use array.append() method inside the loop.
Now for saving confusion matrix:
Since confusion matrix returns 2-D array and you need to save multiple such arrays, use 3-D array for saving the value.
import numpy as np
matrix_result=np.zeroes((200,len(y_pred),len(y_pred)))
for i in range(200):
matrix_result[i]=confusion_matrix(X_pred,y_pred)
For averaging
matrix_result_average=matrix_result.mean(axis=0)
I'm not sure what you mean by training over a set of target variables (please elaborate), but here is a start at averaging over confusion matrices, using numpy.
First an empty result matrix is created, which is three-dimensional and the size of 200 stacked confusion matrices. These are then filled in one-by-one in the for-loop. Finally the resulting matrix is averaged along the dimension of the targets, resulting in the average confusion matrix.
import numpy as np
N = len(Y_pred)
result = np.zeros((len(targets), N, N))
for i, target in enumerate(targets):
result[i] = confusion_matrix(Y_test, Y_pred) # do someting with target?
print(result.mean(axis=0))

One dimensional Mahalanobis Distance in Python

I've been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV)
My data points are of 1 dimension each (5 rows x 1 column).
In OpenCV (C++), I was successful in calculating the Mahalanobis distance when the dimension of a data point was with above dimensions.
The following code was unsuccessful in calculating Mahalanobis distance when dimension of the matrix was 5 rows x 1 column. But it works when the number of columns in the matrix are more than 1:
import numpy;
import scipy.spatial.distance;
s = numpy.array([[20],[123],[113],[103],[123]]);
covar = numpy.cov(s, rowvar=0);
invcovar = numpy.linalg.inv(covar)
print scipy.spatial.distance.mahalanobis(s[0],s[1],invcovar);
I get the following error:
Traceback (most recent call last):
File "/home/abc/Desktop/Return.py", line 6, in <module>
invcovar = numpy.linalg.inv(covar)
File "/usr/lib/python2.6/dist-packages/numpy/linalg/linalg.py", line 355, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range
One-dimensional Mahalanobis distance is really easy to calculate manually:
import numpy as np
s = np.array([[20], [123], [113], [103], [123]])
std = s.std()
print np.abs(s[0] - s[1]) / std
(reducing the formula to the one-dimensional case).
But the problem with scipy.spatial.distance is that for some reason np.cov returns a scalar, i.e. a zero-dimensional array, when given a set of 1d variables. You want to pass in a 2d array:
>>> covar = np.cov(s, rowvar=0)
>>> covar.shape
()
>>> invcovar = np.linalg.inv(covar.reshape((1,1)))
>>> invcovar.shape
(1, 1)
>>> mahalanobis(s[0], s[1], invcovar)
2.3674720531046645
Covariance needs 2 arrays to compare. In both np.cov() and Opencv CalcCovarMatrix, it expects the two arrays to be stacked on top of each other (Use vstack). You can also have the 2 arrays to be side by side if you change the Rowvar to false in numpy or use COVAR_COL in opencv. If your arrays are multidimentional, just flatten() them first.
So if I want to compare two 24x24 images, I flatten them both into 2 1x1024 images, then stack the two to get a 2x1024, and that is the first argument of np.cov().
You should then get a large square matrix, where it shows the results of comparing each element in array1 with each element in array2. In my example it will be 1024x1024. THAT is what you pass into your invert function.

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories