Trouble in understanding how PCA is achieving image compression and reducing dimension - python

I was going through this amazing playlist for SVD by Steve Brunton in youtube. I think I got majority of the concepts but there are some gaps. Let me add a couple of screenshots so that it's easier for me to explain.
He is considering the input matrix X to be a collection of images. So, considering an image is 28x28 pixels, we flatten it to create a 784x1 column vector. So, each column denotes an image, and the rows denote pixel indices. Let's take the dimension of X to be n x m. Now, after computing the economy SVD, if we keep only the first r (<< m) singular values, then the approximation of X is given by
X' = σ1.u1.v1(T) + σ2.u2.v2(T) + ... + σr.ur.vr(T)
I understand that here, we're throwing away information, so the reconstructed images would be pixelated but they would still be of the same dimension (28x28). So, how are we achieving compression here? Is it because instead of storing 784m pixel values, we'll have to store r x (28 (length of each u) + 28 (length of each v)) pixels? Or is there something more to it?
My second question is, if I try to draw an analogy to numerical features, e.g. let's say a housing price dataset, that has 50 features, and 1000 data points. So, our X matrix has dimension 50 x 1000 (each column being a feature vector). In that case, if there are useless features, we'll get << 50 features (maybe 20, or 10... whatever) after applying PCA, right? I'm not able to grasp how that smaller feature vector is derived when we select only the biggest r singular values. Because X and X' have the same dimensions.
Let's have a sample code. The dimensions are reversed because of how sklearn expects it.
pca = PCA(n_components=10)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape) # original shape: (1000, 50)
print("transformed shape:", X_pca.shape) # transformed shape: (1000, 10)
So, how are we going from 50 to 10 here? I get that that in this case there would be 50 U basis vectors. So, even if we choose top r from these 50, the dimensions will still be the same, right? Any help is appreciated.

I've been searching for the answer all over the web, and finally it clicked when I saw this video tutorial. We know X = U x ∑ x V.T. Here, columns of U give us the principal components for the colspace of X. Similarly rows of V.T give us the principal components for the rowspace of X. Since, in pca we tend to represent a feature vector by a row (unlike svd), so we'd select the first r principal components from the matrix V.T.
Let's assume the dimensions of X to be mxn. So, we have m samples each having n features. That gives us the following dimensions for the SVD:
U: mxm
∑: mxn
V: nxn
Now, if we select only r (<< n) principal components then the projection of X to the r-dimensional space would be given by X.[v1 v2 ... vr]. Here each of v1, v2, ... vr is a column vector. So, the dimension of [v1 v2 ... vr] is nxr. If we now multiply X with this vector we get an nxr matrix, which is nothing but the projection of all the data points to r dimensions.

Related

Why does this euclidean distance calculation method explodes RAM usage?

I'm studying the KNN algorithm to classify images using some material from a 2017 Stanford course. We're given a dataset consisting of many images, later those sets are represented as 2D numpy arrays, and we're supposed to write functions that calculate distances between those images. More specifically, given a 2D array of the test images and a 2D array of the training images, I'm asked to write a L_2 distance function, which takes those two sets as inputs and returns a distance matrix, where every row i represents a test image and every column j represents a training image.
The exercise also asked me to do it without any loops and without using np.abs function. So I gave it a try and tried:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
dists = np.sqrt(np.sum(all_test_subs_sq), axis = 2)
return dists
Apparently that makes Google's Colab environment crash in 6 seconds due to allocating about 60 GB of RAM. I guess I should clarify the training set X_train has a shape of (5000, 3072), and the test set X has shape (500, 3072). I am not sure what happens here that is so RAM intensive, but then again I'm not the smartest guy to figure out space complexity.
I googled a bit and found out a solution that works without the need for a NASA computer, it uses the sum of the squares formula:
dists = np.reshape(np.sum(X**2, axis=1), [num_test,1]) + np.sum(self.X_train**2, axis=1)\
- 2 * np.matmul(X, self.X_train.T)
dists = np.sqrt(dists)
I'm also not really sure why doesn't this solution explode like mine did. I'd really appreciate any insight here, thank you very much for reading.
In the compute_distances_no_loops() function the intermediate array all_test_subs_sq has the shape (500, 3072, 5000), so it consists of 500 * 3072 * 5000 = 7,680,000,000 elements. Assuming that the dtype of X and X_train is float64, each element weights 8 bytes, so the total size of the array is 61,440,000,000 bytes i.e. about 60 GB.
The other solution you included avoids this problem since it does not create such a large intermediate array. The shape of np.reshape(np.sum(X**2, axis=1), [num_test,1]) is (500, 1) and the shape of np.sum(self.X_train**2, axis=1) is (5000,). When you add them you obtain an array of the shape (500, 5000). np.matmul(X, self.X_train.T) also produces an array of the same shape.
The problem is in
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
X[:, np.newaxis] is equivalent to X[:, np.newaxis, :] of shape (50, 1, 3072). After broadcasting, X[:, np.newaxis] - self.X_train yields a dense (500, 5000, 3072) array which is humongous 500 x 5000 x 3072 x 8 bytes ≈ 61.44 GB since you have np.float64.

Reshaping numpy array

What I am trying to do is take a numpy array representing 3D image data and calculate the hessian matrix for every voxel. My input is a matrix of shape (Z,X,Y) and I can easily take a slice along z and retrieve a single original image.
gx, gy, gz = np.gradient(imgs)
gxx, gxy, gxz = np.gradient(gx)
gyx, gyy, gyz = np.gradient(gy)
gzx, gzy, gzz = np.gradient(gz)
And I can access the hessian for an individual voxel as follows:
x = 100
y = 100
z = 63
H = [[gxx[z][x][y], gxy[z][x][y], gxz[z][x][y]],
[gyx[z][x][y], gyy[z][x][y], gyz[z][x][y]],
[gzx[z][x][y], gzy[z][x][y], gzz[z][x][y]]]
But this is cumbersome and I can't easily slice the data.
I have tried using reshape as follows
H = H.reshape(Z, X, Y, 3, 3)
But when I test this by retrieving the hessian for a specific voxel the, the value returned from the reshaped array is completely different than the original array.
I think I could use zip somehow but I have only been able to find that for making lists of tuples.
Bonus: If there's a faster way to accomplish this please let me know, I essentially need to calculate the three eigenvalues of the hessian matrix for every voxel in the 3D data set. Calculating the hessian values is really fast but finding the eigenvalues for a single 2D image slice takes about 20 seconds. Are there any GPUs or tensor flow accelerated libraries for image processing?
We can use a list comprehension to get the hessians -
H_all = np.array([np.gradient(i) for i in np.gradient(imgs)]).transpose(2,3,4,0,1)
Just to give it a bit of explanation : [np.gradient(i) for i in np.gradient(imgs)] loops through the two levels of outputs from np.gradient calls, resulting in a (3 x 3) shaped tensor at the outer two axes. We need these two as the last two axes in the final output. So, we push those at the end with the transpose.
Thus, H_all holds all the hessians and hence we can extract our specific hessian given x,y,z, like so -
x = 100
y = 100
z = 63
H = H_all[z,y,x]

Broadcasting - 3D field of coefficients to 3D field of matrices given matrix basis

I have a (large) 4D array, consisting of the 5 coefficients in a given basis for a matrix field. Given the 5 basis matrices, I want to efficiently calculate the matrix field.
The coefficient field c[x,y,z,i] being the value of i-th coefficient at position x,y,z
And the matrix field M[x,y,z,a,b] being the (3,3) matrix at position x,y,z
And the basis matrices T_1,...T_5, being the (3,3) basis matrices
I could loop over each position in space:
M[x,y,z,:,:] = T_1[:,:]*c[x,y,z,0] + T_2[:,:]*c[x,y,z,1]...T_5[:,:]*c[x,y,z,4]
But this is very inefficient. My attempts at using np.multiply,np.sum result in broadcasting errors due to the ambiguity of the desired product being a field of 3x3 matrices.
Keep in mind that to numpy, these 4 and 5d arrays are just that, not 3d arrays containing 2d matrices, etc.
Let's try to write your calculation in a way that clarifies dimensions:
M[x,y,z] = T_1*c[x,y,z,0] + T_2*c[x,y,z,1]...T_5*c[x,y,z,4]
M[x,y,z,:,:] = T_1[:,:]*c[x,y,z,0] + T_2[:,:]*c[x,y,z,1]...T_5[:,:]*c[x,y,z,4]
c[x,y,z,i] is a coefficient, right? So M is a weighted sum of the T_n arrays?
One way of expressing this is:
T = np.stack([T_1, T_2, ...T_5], axis=0) # 3d (nab)
M = np.einsum('nab,xyzn->xyzab', T, c)
We could alternatively stack T_i on a new last axis
T = np.stack([T_1, T_2 ...T_5], axis=2) # (abn)
M = np.einsum('abn,xyzn->xyzab', T, c)
or as broadcasted multiplication plus sum:
M = (T[None,None,None,:,:,:] * c[:,:,:,None,None,:]).sum(axis=-1)
I'm writing this code without testing, so there may be errors, but I think the basic outline is right.
It could also be written as a dot, if I can put the n dimension last in one argument, and 2nd to the last in the other. Or with tensordot. But there's less control over broadcasting of the other dimensions.
For test calculations you could also reshape these arrays so that the x,y,z are rolled into one, and the a,b into another, e.g
M[xyz,:] = T_n[ab]*c[xyz,n] # etc

Python error in array manipulation

I have 2 arrays (for the sake of the example, let's name them A and B) and i perform the following manipulations at them, but i get an error at the assignment of "d2" in my code.
n = len(tracks) #tracks is a list containing different-length 3d arrays
n=30; #test with a few tracks
length = len(tracks) #list containing the total number of "samples"
perm_index = np.random.permutation(length) #uniform sampling without replacement
subset_len = 5 # choose the size of subset of tracks A
subset_A = [tracks[x:x+1] for x in xrange(0, subset_len, 1)]
subset_B = [tracks[x:x+1] for x in xrange(subset_len, n, 1)]
tempA = distance_calc.dist_calcsub(len(subset_A), subset_A) # distance matrix calculation
tempA = mcp.sym_mcp(len(subset_A), tempA) # symmetrize mcp ???
tempB = distance_calc.dist_calcsubs(subset_A, subset_B) # distance matrix calculation
#symmetrize mcp ? ? its not diagonal, symmetric . . .
A = affinity.aff_conv(60, tempA) # conversion to affinity
B = affinity.aff_conv(60, tempB) # conversion to affinity
#((row,col)) = np.shape(A)
#A = normalization_affinity.norm_aff(row, col, A) # normalization of affinity matrix
# Normalize A and B for Laplacian using row sums of W, where W = [A B; B' B'*A^-1*B].
# Let d1 = [A B]*1, d2 = [B' B'*A^-1*B]*1, dhat = sqrt(1./[d1; d2]).
d1 = np.sum( np.vstack((A, np.transpose(B))) )
d2 = np.sum(B,0) + np.dot(np.sum(np.transpose(B),0), np.dot(np.linalg.pinv(A), B ))
dhat = np.transpose(np.sqrt( 1/ np.hstack((d1, d2)) ))
A = A* np.dot( dhat[0:subset_len], np.transpose(dhat[0:subset_len]) )
B = B* np.dot( dhat[0:subset_len], np.transpose(dhat[subset_len:n]) )
The error again is "ValueError: matrices are not aligned." because the np.dot vectors are 1d vectors of different size; I know the reason why this is happening but I am following exactly the equations to perform the Nystrom method.
P.S: I am following the method described in p.90-92 in this thesis: thesis link
Looking at the paper, you've got two problems here.
Let's start with the information you left out of your question. You're trying to do this operation:
bc + B.T * A^−1 * br
where ar and br are column vectors containing the row sums of A and B and bc is
the column sum of B.
In particular, you're mapping that A^-1 * br to np.dot( np.linalg.pinv(A), np.sum(B, 0)).
The first problem is that np.linalg.pinv is the pseudo-inverse, A+, not the multiplicative inverse, A^-1. Using a completely different operation just because it doesn't give you an error doesn't solve the problem.
So, how do you calculate the multiplicative inverse? Well, you can't. In general, the multiplicative inverse doesn't exist for non-square matrices, so given a 5x10 A, you're stuck right at the beginning.
Anyway, the second problem comes from the fact that your br isn't a column vector. If you want to think in matrix terms, as the paper does, it's a row vector, 10x1 instead of 1x10. If you want to think in numpy ndarray terms, it's a 1D (10,) array instead of a 2D (1, 10) array. If you think of the operation in matrix multiplication terms, you can't multiply a 10x5 matrix with a 10x1 matrix; if you think of it in NumPy terms as the multidimensional dot product, you can't multiply a (10, 5) array with a (10,) array.
It's true that you can extend the dot product to specifically the domain of MxN matrices vs. M vectors, and under that definition your multiplication would make sense. But that's not the definition used by either the paper's standard matrix multiplication notation or NumPy's dot function. So, what can you do? Well, note that the operation you're trying to do is commutative, so swapping the order of operands is perfectly legal—and if you do that, then it does happen to correspond to the general dot product. So, you could write this as np.dot(np.sum(B, 0), np.linalg.pinv(A)) and get the result you want. And there are a number of other ways you could transform the arrays that are idempotent in your matrix-vs.-vector multiplication domain but meaningful for np.dot, and they will all get you the same result. For example, np.dot(np.linalg.pinv(A).T, np.sum(B, 0)) will also work.
I'm also not sure why you're using dot product in the first place. I don't see anything in the notation to imply that
But all of this is a sideshow; if you inverted A properly, you would have something with the same dimensions as A, and multiply a 5x10 matrix with a 10x1 vector, or a (5, 10) array with a (10,) array, is already perfectly well defined. The only problem is that, again, you can't generally invert non-square matrices, so there's no way you can actually get to this place.
So, the real solution is to go back to wherever you decided on those shapes for A and B and try again.
In particular, it's pretty clear from the illustration in the paper showing the derivation of A and B from the larger matrix that the height of A is the height of B, and the width of A is the width of B.T, which is of course the height of B again.
Also, if the larger matrix is supposed to symmetric, and A is the upper left corner of a symmetric matrix, A has to be symmetric.
I also think you've mixed up row-column order and x-y order a few times, and bc is supposed to the column sums of B, not the column sums of B.T (which would just be the row sums of B, flipped into a row vector instead of a column vector).
While we're at it, let's use methods and operators where possible instead of writing everything in the longest possible way.
So, I think what you wanted is something like this:
A = np.random.random_sample((4, 4)) # square
A = (A + A.T) / 2 # and symmetric
B = np.random.random_sample((4, 10))
ar = A.sum(1)
br = B.sum(1)
bc = B.sum(0) # not B.T.sum(0), that's just br again!
d1 = ar + br
d2 = bc + np.dot(B.T, np.dot(np.linalg.inv(A), br))
Without actually reading the paper I can't be sure this is what you actually want, but this looks like it fits with a quick skim of those two pages, and it runs without any errors, so hopefully you can at least look at the results and see if they are what you want.
You are summing over the first dimension of B, so the shape is 10, the size of the second dimension of B.
You can calculate
np.dot( np.sum(B, 0), np.linalg.pinv(A))
but this gives you a vector with 5 elements, but B_T has only a size of 4. So something doesn't fit in your sample data.

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories