Applying Kullback-Leibler (aka kl divergence) element-wise in Pytorch

Applying Kullback-Leibler (aka kl divergence) element-wise in Pytorch - python

I have two tensors named x_t, x_k with follwing shapes NxHxW and KxNxHxW respectively, where K, is the number of autoencoders used to reconstruct x_t (if you have no idea what is this, assume they're K different nets aiming to predict x_t, this probably has nothing to do with the question anyways) N is batch size, H matrix height, W matrix width.
I'm trying to apply Kullback-Leibler divergence algorithm to both tensors (after broadcasting x_t as x_k along the Kth dimension) using Pytorch's nn.functional.kl_div method.
However, it does not seem to be working as I expected. I'm looking to calcualte the kl_div between each observation in x_t and x_k resulting in a tensor of size KxN (i.e., kl_div of each observation for each K autoencoder).
The actual output is a single value if I use the reduction argument, and the same tensor size (i.e., KxNxHxW) if I do not use it.
Has anyone tried something similar?
Reproducible example:
import torch
import torch.nn.functional as F
# K N H W
x_t = torch.randn( 10, 5, 5)
x_k = torch.randn( 3, 10, 5, 5)
x_broadcasted = x_t.expand_as(x_k)
loss = F.kl_div(x_t, x_k, reduction="none") # or "batchmean", or there are many options

It's unclear to me what exactly constitutes a probability distribution in your model. With reduction='none', kl_div, given log(x_n) and y_n, computes kl_div = y_n * (log(y_n) - log(x_n)), which is the "summed" part of the actual Kullback-Leibler divergence. Summation (or, in other words, taking the expectation) is up to you. If your point is that H, W are the two dimensions over which you want to take expectation, it's as simple as
loss = F.kl_div(x_t, x_k, reduction="none").sum(dim=(-1, -2))
Which is of shape [K, N]. If your network output is to be interpreted differently, you need to better specify which are the event dimensions and which are sample dimensions of your distribution.

Related

Numerically stable normalizing for vectors of small magnitudes

The context of the problem is that I have a resnet model in Jax (basically NumPy), and I take the gradient of an image with respect to its class prediction. This gives me a gradient vector, g, which I then want to normalize. The trouble is, the magnitudes of the components, g[i], are such that g[i]**2 == 0, meaning that just dividing by np.linalg.norm(g) gives a value of 0, hence giving me nans.
What I've done so far is just checking if the norm is 0 then multiplying by some constant factor, as in (g = np.where(np.linalg.norm(g) < 1e-20, g * 1e20, g)).
Was thinking maybe I should instead divide by the smallest nonzero element then normalize. Does anyone have ideas on how to properly normalize this vector?

Getting negative S value from SVD decomposition in Numpy?

I want to whiten the CIFAR10 dataset using ZCA. The input X_train is of shape (40000, 32, 32, 3) where 40000 is the number of images, and 32x32x3 is the size of each image. I'm using the code from this answer for this purpose:
X_flat = np.reshape(X_train, (-1, 32*32*3))
# compute the covariance of the image data
cov = np.cov(X_flat, rowvar=True) # cov is (N, N)
# singular value decomposition
U,S,V = np.linalg.svd(cov) # U is (N, N), S is (N,)
# build the ZCA matrix
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
# transform the image data zca_matrix is (N,N)
zca = np.dot(zca_matrix, X_flat) # zca is (N, 3072)
However, at run time I encountered the following warning:
D:\toolkits.win\anaconda3-5.2.0\envs\dlwin36\lib\site- packages\ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in sqrt
So after I got the SVD output, I tried:
print(np.min(S)) # prints -1.7798217
Which is unexpected because S can only have positive values. Also, the ZCA whitening result was not correct and it contained nan values.
I tried reproducing this by re-running this same code a second time and this time I did not encounter any warnings or any negative S values, but instead I got:
print(np.min(S)) # prints nan
Any idea for why this might have happened?
Update: Restarted the kernel to free up cpu and RAM resources, and tried running this code again. Again got the same warning for feeding in negative values to np.sqrt(). Not sure if it helps but I've also attached the cpu and ram utilization figures:
activity monitor figures

Here are a couple of ideas. I don't have your dataset so I can't be totally sure that these will fix your problem, but I'm confident enough to post this as an answer instead of a comment.
First. Your X_train is 40'000 by 3072, where each row is a data vector, and each column is a variable or feature. You want the covariance matrix that is 3072 by 3072: pass in rowvar=False to np.cov.
I'm not really sure why the 40'000 by 40'000 covariance matrix's SVD is diverging. Assuming you have enough RAM to store the 12 GB covariance matrix, the one thing I can think of is numerical overflow, because you're perhaps not removing the mean of the data, as is expected by ZCA (and any other whitening technique)?
So second. Remove the mean: X_zeromean = X_flat - np.mean(X_flat, 0).
If you do these, then the final step has to be modified a tiny bit (to make dimensions line up). Here's a quick check using uniform random data:
import numpy as np
X_flat = np.random.rand(40000, 32*32*3)
X_zeromean = X_flat - np.mean(X_flat, 0)
cov = np.cov(X_zeromean, rowvar=False)
U,S,V = np.linalg.svd(cov)
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
zca = np.dot(zca_matrix, X_zeromean.T) # <-- transpose needed here
As a sanity check np.cov(zca) now is very close to the identity matrix, as desired (zca will have flipped dimensions as the input).
(As a sidenote, this is a really expensive and numerically unstable way to whiten the data array: you don't need to compute the covariance and then take the SVD—you're doing twice the work. You can take the skinny SVD of the data matrix itself (np.linalg.svd with the full_matrices=False flag) and compute the whitening matrix directly from there, without ever evaluating the expensive outer product for the covariance matrix.)

Two different cost in Logistic Regression cost function

I am implementing logistic Regression algorithm with two feature x1 and x2. I am writing the code of cost function in logistic regression.
def computeCost(X,y,theta):
J =((np.sum(-y*np.log(sigmoid(np.dot(X,theta)))-(1-y)*(np.log(1-sigmoid(np.dot(X,theta))))))/m)
return J
Here My X is the training set matrix, y is the output.
the shape of X is (100,3) and shape of y is (100,) as determined by shape attribute of numpy library. my theta is initially contained all zero entry with shape (3,1). When I calculate cost with these parameters I got the cost 69.314. But it is incorrect. The correct Cost is 0.69314. Actually, I got this correct cost when I reshape my y vector as y = numpy.reshape(y,(-1,1)) .
But I actually didn't get how this reshaping corrects my cost.
Here m(numbers of the training set) is 100.

First of all never in future simply dump your code! You post(code + explanation) should be as descriptive as it could be! (not verbose, nobody will read it). Here is what your code is doing! Please post readable code in future! Else it's hard to read & answer!
def computeCost(X,y,theta):
'''
Using Mean Absolute Error
X:(100,3)
y: (100,1)
theta:(3,1)
Returns 1D matrix of predictions
Cost = ( log(predictions) + (1-labels)*log(1-predictions) ) / len(labels)
'''
m = len(y)
# calculate the prediction
predictions = sigmoid(np.dot(X,theta))
# error for when label is of class1
class1_cost= -y * np.log(predictions)
# error for when label is of class1
class2_cost= (1-y)*np.log(1-predictions)
# total cost
cost = class1_cost-class2_cost
# averaging cost
cost =cost.sum() / m
return cost
You should first understand How dot product works in math and what shape of input you algorithm would take to give you the correct answer! Don't throw random shapes! Your feature_vector is of shape(100,3) which when multiplied by your theta which of shape(3,1) will output a prediction vector of shape (100,1).
Matrix multiplication: The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the 1st and columns of the 2nd
So, your y dimension should be in (100,1) shape and not (100,). Huge difference! One is [[3],[4],[6],[7],[9],...] and another [3,4,6,7,9,.....].
Your dimension should match for correct output!
A better way of asking the question would be, how to calculate error/cost in logistic regression using the correct dimensions of my labels.!
For additional understanding!
import numpy as np
label_type1= np.random.rand(100,1)
label_type2= np.random.rand(100,)
predictions= np.random.rand(100,1)
print(label_type1.shape, label_type2.shape, predictions.shape)
# When you mutiply (100,1) with (100,1) --> (100,1)
print((label_type1 * predictions).shape)
# When you do a dot product (100,1) with (100,1) --> Error, for which you have to take a transpose which isn't relavant to the context!
# print( np.dot(label_type1,predictions).shape) # error: shapes (100,1) and (100,1) not aligned: 1 (dim 1) != 100 (dim 0)
print( np.dot(label_type1.T,predictions).shape) #
print('*'*5)
# When you mutiply (100,) with (100,1) --> (100,100) !
print((label_type2 * predictions).shape) #
# When you do a dot product (100,) with (100,1) --> (1,) !
print(np.dot(label_type2, predictions).shape)
print('*'*5)
# what you are doin
label_type1_addDim = np.reshape(label_type2,(-1,1))
print(label_type1_transpose.shape)
So, coming straight to the point, What you wanna achieve is a cost with dim (100,1)! so either you do 1st which you aren't doing! or you do the fifth, where you unknowingly adding a dimension to your y
making it from (100,) to (100,1) and doing the same * operation as of 1st case! to get dim (100,1).

Why are dot products backwards in Stanford's cs231n SVM?

I'm watching the Youtube videos of Stanford's cs231n, and trying to do the assignments as exercice. While doing the SVM one I ran into the following piece of code:
def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W) # This line
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
Heres the line I'm having trouble with:
scores = X[i].dot(W)
This is doing the product xW, shouldn't it be Wx? by that I mean W.dot(X[i])

Because the array shapes are (D, C) and (N, D) for W and X respectively, you can't take the dot product directly, without transposing them both first (they must be (C, D)·(D, N) for matrix multiplication.
Since X.T.dot(W.T) == W.dot(X), the implementation simply reverses the order of the dot product as opposed to taking the transform of each array. Effectively, this just comes down to a decision around how the inputs are arranged. In this case the (somewhat arbitrary) decision was made to arrange the samples and features in a more intuitive way versus having the dot product as x·W.

PyTorch Linear layer input dimension mismatch

Im getting this error when passing the input data to the Linear (Fully Connected Layer) in PyTorch:
matrices expected, got 4D, 2D tensors
I fully understand the problem since the input data has a shape (N,C,H,W) (from a Convolutional+MaxPool layer) where:
N: Data Samples
C: Channels of the data
H,W: Height and Width
Nevertheless I was expecting PyTorch to do the "reshaping" of the data form:
[ N , D1,...Dn] --> [ N, D] where D = D1*D2*....Dn
I try to reshape the Variable.data, but I've read that this approach is not recommended since the gradients will conserve the previous shape, and that in general you should not mutate a Variable.data shape.
I am pretty sure there is a simple solution that goes along with the framework, but i haven't find it.
Is there a good solution for this?
PD: The Fully connected layer has as input size the value C * H * W

After reading some Examples I found the solution. here is how you do it without messing up the forward/backward pass flow:
(_, C, H, W) = x.data.size()
x = x.view( -1 , C * H * W)

A more general solution (would work regardless of how many dimensions x has) is to take the product of all dimension sizes but the first one (the "batch size"):
n_features = np.prod(x.size()[1:])
x = x.view(-1, n_features)

It is common to save the batch size and infer the other dimension in a flatten:
batch_size = x.shape[0]
...
x = x.view(batch_size, -1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying Kullback-Leibler (aka kl divergence) element-wise in Pytorch - python

Related

Numerically stable normalizing for vectors of small magnitudes

Getting negative S value from SVD decomposition in Numpy?

Two different cost in Logistic Regression cost function

Why are dot products backwards in Stanford's cs231n SVM?

PyTorch Linear layer input dimension mismatch

Categories

Resources