I want to whiten the CIFAR10 dataset using ZCA. The input X_train is of shape (40000, 32, 32, 3) where 40000 is the number of images, and 32x32x3 is the size of each image. I'm using the code from this answer for this purpose:
X_flat = np.reshape(X_train, (-1, 32*32*3))
# compute the covariance of the image data
cov = np.cov(X_flat, rowvar=True) # cov is (N, N)
# singular value decomposition
U,S,V = np.linalg.svd(cov) # U is (N, N), S is (N,)
# build the ZCA matrix
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
# transform the image data zca_matrix is (N,N)
zca = np.dot(zca_matrix, X_flat) # zca is (N, 3072)
However, at run time I encountered the following warning:
D:\toolkits.win\anaconda3-5.2.0\envs\dlwin36\lib\site- packages\ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in sqrt
So after I got the SVD output, I tried:
print(np.min(S)) # prints -1.7798217
Which is unexpected because S can only have positive values. Also, the ZCA whitening result was not correct and it contained nan values.
I tried reproducing this by re-running this same code a second time and this time I did not encounter any warnings or any negative S values, but instead I got:
print(np.min(S)) # prints nan
Any idea for why this might have happened?
Update: Restarted the kernel to free up cpu and RAM resources, and tried running this code again. Again got the same warning for feeding in negative values to np.sqrt(). Not sure if it helps but I've also attached the cpu and ram utilization figures:
activity monitor figures
Here are a couple of ideas. I don't have your dataset so I can't be totally sure that these will fix your problem, but I'm confident enough to post this as an answer instead of a comment.
First. Your X_train is 40'000 by 3072, where each row is a data vector, and each column is a variable or feature. You want the covariance matrix that is 3072 by 3072: pass in rowvar=False to np.cov.
I'm not really sure why the 40'000 by 40'000 covariance matrix's SVD is diverging. Assuming you have enough RAM to store the 12 GB covariance matrix, the one thing I can think of is numerical overflow, because you're perhaps not removing the mean of the data, as is expected by ZCA (and any other whitening technique)?
So second. Remove the mean: X_zeromean = X_flat - np.mean(X_flat, 0).
If you do these, then the final step has to be modified a tiny bit (to make dimensions line up). Here's a quick check using uniform random data:
import numpy as np
X_flat = np.random.rand(40000, 32*32*3)
X_zeromean = X_flat - np.mean(X_flat, 0)
cov = np.cov(X_zeromean, rowvar=False)
U,S,V = np.linalg.svd(cov)
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
zca = np.dot(zca_matrix, X_zeromean.T) # <-- transpose needed here
As a sanity check np.cov(zca) now is very close to the identity matrix, as desired (zca will have flipped dimensions as the input).
(As a sidenote, this is a really expensive and numerically unstable way to whiten the data array: you don't need to compute the covariance and then take the SVD—you're doing twice the work. You can take the skinny SVD of the data matrix itself (np.linalg.svd with the full_matrices=False flag) and compute the whitening matrix directly from there, without ever evaluating the expensive outer product for the covariance matrix.)
Related
I'm studying the KNN algorithm to classify images using some material from a 2017 Stanford course. We're given a dataset consisting of many images, later those sets are represented as 2D numpy arrays, and we're supposed to write functions that calculate distances between those images. More specifically, given a 2D array of the test images and a 2D array of the training images, I'm asked to write a L_2 distance function, which takes those two sets as inputs and returns a distance matrix, where every row i represents a test image and every column j represents a training image.
The exercise also asked me to do it without any loops and without using np.abs function. So I gave it a try and tried:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
dists = np.sqrt(np.sum(all_test_subs_sq), axis = 2)
return dists
Apparently that makes Google's Colab environment crash in 6 seconds due to allocating about 60 GB of RAM. I guess I should clarify the training set X_train has a shape of (5000, 3072), and the test set X has shape (500, 3072). I am not sure what happens here that is so RAM intensive, but then again I'm not the smartest guy to figure out space complexity.
I googled a bit and found out a solution that works without the need for a NASA computer, it uses the sum of the squares formula:
dists = np.reshape(np.sum(X**2, axis=1), [num_test,1]) + np.sum(self.X_train**2, axis=1)\
- 2 * np.matmul(X, self.X_train.T)
dists = np.sqrt(dists)
I'm also not really sure why doesn't this solution explode like mine did. I'd really appreciate any insight here, thank you very much for reading.
In the compute_distances_no_loops() function the intermediate array all_test_subs_sq has the shape (500, 3072, 5000), so it consists of 500 * 3072 * 5000 = 7,680,000,000 elements. Assuming that the dtype of X and X_train is float64, each element weights 8 bytes, so the total size of the array is 61,440,000,000 bytes i.e. about 60 GB.
The other solution you included avoids this problem since it does not create such a large intermediate array. The shape of np.reshape(np.sum(X**2, axis=1), [num_test,1]) is (500, 1) and the shape of np.sum(self.X_train**2, axis=1) is (5000,). When you add them you obtain an array of the shape (500, 5000). np.matmul(X, self.X_train.T) also produces an array of the same shape.
The problem is in
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
X[:, np.newaxis] is equivalent to X[:, np.newaxis, :] of shape (50, 1, 3072). After broadcasting, X[:, np.newaxis] - self.X_train yields a dense (500, 5000, 3072) array which is humongous 500 x 5000 x 3072 x 8 bytes ≈ 61.44 GB since you have np.float64.
I was going through this amazing playlist for SVD by Steve Brunton in youtube. I think I got majority of the concepts but there are some gaps. Let me add a couple of screenshots so that it's easier for me to explain.
He is considering the input matrix X to be a collection of images. So, considering an image is 28x28 pixels, we flatten it to create a 784x1 column vector. So, each column denotes an image, and the rows denote pixel indices. Let's take the dimension of X to be n x m. Now, after computing the economy SVD, if we keep only the first r (<< m) singular values, then the approximation of X is given by
X' = σ1.u1.v1(T) + σ2.u2.v2(T) + ... + σr.ur.vr(T)
I understand that here, we're throwing away information, so the reconstructed images would be pixelated but they would still be of the same dimension (28x28). So, how are we achieving compression here? Is it because instead of storing 784m pixel values, we'll have to store r x (28 (length of each u) + 28 (length of each v)) pixels? Or is there something more to it?
My second question is, if I try to draw an analogy to numerical features, e.g. let's say a housing price dataset, that has 50 features, and 1000 data points. So, our X matrix has dimension 50 x 1000 (each column being a feature vector). In that case, if there are useless features, we'll get << 50 features (maybe 20, or 10... whatever) after applying PCA, right? I'm not able to grasp how that smaller feature vector is derived when we select only the biggest r singular values. Because X and X' have the same dimensions.
Let's have a sample code. The dimensions are reversed because of how sklearn expects it.
pca = PCA(n_components=10)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape) # original shape: (1000, 50)
print("transformed shape:", X_pca.shape) # transformed shape: (1000, 10)
So, how are we going from 50 to 10 here? I get that that in this case there would be 50 U basis vectors. So, even if we choose top r from these 50, the dimensions will still be the same, right? Any help is appreciated.
I've been searching for the answer all over the web, and finally it clicked when I saw this video tutorial. We know X = U x ∑ x V.T. Here, columns of U give us the principal components for the colspace of X. Similarly rows of V.T give us the principal components for the rowspace of X. Since, in pca we tend to represent a feature vector by a row (unlike svd), so we'd select the first r principal components from the matrix V.T.
Let's assume the dimensions of X to be mxn. So, we have m samples each having n features. That gives us the following dimensions for the SVD:
U: mxm
∑: mxn
V: nxn
Now, if we select only r (<< n) principal components then the projection of X to the r-dimensional space would be given by X.[v1 v2 ... vr]. Here each of v1, v2, ... vr is a column vector. So, the dimension of [v1 v2 ... vr] is nxr. If we now multiply X with this vector we get an nxr matrix, which is nothing but the projection of all the data points to r dimensions.
The context of the problem is that I have a resnet model in Jax (basically NumPy), and I take the gradient of an image with respect to its class prediction. This gives me a gradient vector, g, which I then want to normalize. The trouble is, the magnitudes of the components, g[i], are such that g[i]**2 == 0, meaning that just dividing by np.linalg.norm(g) gives a value of 0, hence giving me nans.
What I've done so far is just checking if the norm is 0 then multiplying by some constant factor, as in (g = np.where(np.linalg.norm(g) < 1e-20, g * 1e20, g)).
Was thinking maybe I should instead divide by the smallest nonzero element then normalize. Does anyone have ideas on how to properly normalize this vector?
I have two tensors named x_t, x_k with follwing shapes NxHxW and KxNxHxW respectively, where K, is the number of autoencoders used to reconstruct x_t (if you have no idea what is this, assume they're K different nets aiming to predict x_t, this probably has nothing to do with the question anyways) N is batch size, H matrix height, W matrix width.
I'm trying to apply Kullback-Leibler divergence algorithm to both tensors (after broadcasting x_t as x_k along the Kth dimension) using Pytorch's nn.functional.kl_div method.
However, it does not seem to be working as I expected. I'm looking to calcualte the kl_div between each observation in x_t and x_k resulting in a tensor of size KxN (i.e., kl_div of each observation for each K autoencoder).
The actual output is a single value if I use the reduction argument, and the same tensor size (i.e., KxNxHxW) if I do not use it.
Has anyone tried something similar?
Reproducible example:
import torch
import torch.nn.functional as F
# K N H W
x_t = torch.randn( 10, 5, 5)
x_k = torch.randn( 3, 10, 5, 5)
x_broadcasted = x_t.expand_as(x_k)
loss = F.kl_div(x_t, x_k, reduction="none") # or "batchmean", or there are many options
It's unclear to me what exactly constitutes a probability distribution in your model. With reduction='none', kl_div, given log(x_n) and y_n, computes kl_div = y_n * (log(y_n) - log(x_n)), which is the "summed" part of the actual Kullback-Leibler divergence. Summation (or, in other words, taking the expectation) is up to you. If your point is that H, W are the two dimensions over which you want to take expectation, it's as simple as
loss = F.kl_div(x_t, x_k, reduction="none").sum(dim=(-1, -2))
Which is of shape [K, N]. If your network output is to be interpreted differently, you need to better specify which are the event dimensions and which are sample dimensions of your distribution.
I am currently calculating multiple confusion matrices and normalizing them.
for i in range(0,215)
[...]
matrix_confusion[i] = np.asarray(confusion_matrix(Y_test, Y_pred))
matrix_confusion[i] = matrix_confusion[i].astype(float) /
matrix_confusion[i].sum(axis=1)[:,np.newaxis]
The goal is to calculate the mean out of all confusion matrices which are filled in the loop above. The problem is that a lot of matrices are not filled because I am skipping the iterations when a ValueError is raised. So I have some matrices which are empty (prefilled with zeros).
Now I thought about doing the following:
matrix_confusion = matrix_confusion[matrix_confusion!=0]
But this also kills the 0s out of the normalized calculated confusion matrice. How could I proceed if I just want a confusion matrice which represents the mean of all previously filled 2x2 confusion matrices in the loop and to not concider the prefilled ones?
#prefilling
matrix_confusion = np.zeros((200,2,2))
Thanks for your help!
First find the matrices that are not all zeros:
valids = np.logical_or.reduce(matrix_confusion != 0, axis=(1, 2))
Then compute the mean:
matrix_confusion_mean = np.mean(matrix_confusion[valids], axis=0)
You should still be careful that at least some matrix is valid, otherwise you would get a matrix of NaNs. You could do:
if np.any(valids):
matrix_confusion_mean = np.mean(matrix_confusion[valids], axis=0)
else:
matrix_confusion_mean = np.zeros((2, 2))