I am working with a Variational Autoencoder Type model and part of my loss function is the KL divergence between a Normal Distribution with mean 0 and variance 1 and another Normal Distribution whose mean and variance are predicted by my model.
I defined the loss in the following way:
def kl_loss(mean, log_sigma):
normal=tf.contrib.distributions.MultivariateNormalDiag(tf.zeros(mean.get_shape()),
tf.ones(log_sigma.get_shape()))
enc_normal = tf.contrib.distributions.MultivariateNormalDiag(mean,
tf.exp(log_sigma),
validate_args=True,
allow_nan_stats=False,
name="encoder_normal")
kl_div = tf.contrib.distributions.kl_divergence(normal,
enc_normal,
allow_nan_stats=False,
name="kl_divergence")
return kl_div
The input are unconstrained vectors of length N with
log_sigma.get_shape() == mean.get_shape()
Now during training I observe a negative KL divergence after a few thousand iterations up to values of -10. Below you can see the Tensorboard training curves:
KL divergence curve
Zoom in of KL divergence curve
Now this seems odd to me as the KL divergence should be positive under certain conditions. I understand that we require "The K-L divergence is only defined if P and Q both sum to 1 and if Q(i) > 0 for any i such that P(i) > 0." (see https://mathoverflow.net/questions/43849/how-to-ensure-the-non-negativity-of-kullback-leibler-divergence-kld-metric-rela) but I don't see how this could be violated in my case. Any help is highly appreciated!
Faced the same problem.
It happened because of float precision used.
If you notice the negative values occur close to 0 and is bounded to a small negative value. Adding a small positive value to the loss is a work around.
Related
resid = df['Actual'] - df['Predicted']
resid_mean = resid.mean()
print(resid_mean)
Output:
250.8173868583906
Is my model predicting value correctly or not?
Linear regression involve minimising the mean squared error (Q) to find the best fitting slope (a) and intercept(b). That is, Q will be minimized at the values of a and b for which ∂Q / ∂a = 0 and ∂Q / ∂b = 0.
The sum of the residuals, and therefore the mean is always zero, for the data that you regressed on. That is one of the above 2 conditions in linear regression.
So, unless you are checking residual mean for data not used in training, there appears to be some mistake in the linear regression procedure you employed.
Detailed proof available here: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
I have two tensors (matrices) that I've initialized:
sm=Var(torch.randn(20,1),requires_grad=True)
sm = torch.mm(sm,sm.t())
freq_m=Var(torch.randn(12,20),requires_grad=True)
I am creating two lists from the data inside these 2 matrices, and I am using spearmanr to get a correlation value between these 2 lists. How I am creating the lists is not important, but the goal is to adjust the values inside the matrices so that the calculated correlation value is as close to 1 as possible.
If I were to solve this problem manually, I would tweak values in the matrices by .01 (or some small number) each time and recalculate the lists and correlation score. If the new correlation value is higher than the previous one, I would save the 2 matrices and tweak a different value until I get the 2 matrices that give me the highest correlation score possible.
Is PyTorch capable of doing this automatically? I know PyTorch can adjust based on an equation but the way I want to adjust the matrix values is not against an equation, it's against a correlation value that I calculate. Any guidance with this is greatly appreciated!
Pytorch has an autograd package, that means if you have variable and you pass them through differentiable functions and get a scalar result, you can perform a gradient descent to update the variable to lower or augment the scalar result.
So what you need to do is to define a function f that works on tensor level such that f(sm, freq_m) will give you the desired correlation.
Then, you should do something like:
lr = 1e-3
for i in range(100):
# 100 updates
loss = 1 - f(sm, freq_m)
print(loss)
loss.backward()
with torch.no_grad():
sm -= lr * sm.grad
freq_m -= lr * freq_m.grad
# Manually zero the gradients after updating weights
sm.grad.zero_()
freq_m.grad.zero_()
The learning rate is basically the size of the step you do, a learning rate too high will cause the loss to explode, and a learning rate too little will cause a slow convergence, I suggest you experiment.
Edit : To answer the comment on loss.backward : for any differentiable function f, f is a function of multiple tensors t1, ..., tn with requires_grad=True as a result, you can calculate the gradient of the loss with respect to each of those tensors. When you do loss.backward, it calculates those gradients and store those in t1.grad, ..., tn.grad. Then you update t1, ..., tn using gradient descent in order to lower the value of f. This update doesn't need a computational graph, so this is why you use with torch.no_grad().
At the end of the loop, you zero the gradients because .backward doesn't overwrite the gradients but rather add the new gradients to them. More on that here : https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903
I'm currently training a WGAN in keras with (approx) Wasserstein loss as below:
def wasserstein_loss(y_true, y_pred):
return K.mean(y_true * y_pred)
However, this loss can obviously be negative, which is weird to me.
I trained the WGAN for 200 epochs and got the critic Wasserstein loss training curve below.
The above loss is calculated by
d_loss_valid = critic.train_on_batch(real, np.ones((batch_size, 1)))
d_loss_fake = critic.train_on_batch(fake, -np.ones((batch_size, 1)))
d_loss, _ = 0.5*np.add(d_loss_valid, d_loss_fake)
The resulting generated sample quality is great, so I think I trained the WGAN correctly. However I still cannot understand why the Wasserstein loss can be negative and the model still works. According to the original WGAN paper, Wasserstein loss can be used as a performance indicator for GAN, so how should we interpret it? Am I misunderstand anything?
The Wasserstein loss is a measurement of Earth-Movement distance, which is a difference between two probability distributions. In tensorflow it is implemented as d_loss = tf.reduce_mean(d_fake) - tf.reduce_mean(d_real) which can obviously give a negative number if d_fake moves too far on the other side of d_real distribution. You can see it on your plot where during the training your real and fake distributions changing sides until they converge around zero. So as a performance measurement you can use it to see how far the generator is from the real data and on which side it is now.
See the distributions plot:
P.S. it's crossentropy loss, not Wasserstein.
Perhaps this article can help you more, if you didn't read it yet. However, the other question is how the optimizer can minimize the negative loss (to zero).
Looks like I cannot make a comment to the answer given by Sergeiy Isakov because I do not have enough reputations. I wanted to comment because I think that information is not correct.
In principle, Wasserstein distance cannot be negative because distance metric cannot be negative. The actual expression (dual form) for Wasserstein distance involves the supremum of all the 1-Lipschitz functions (You can refer to it on the web). Since it is the supremum, we always take that Lipschitz function that gives the largest value to obtain the Wasserstein distance. However, the Wasserstein we compute using WGAN is just an estimate and not really the real Wasserstein distance. If the inner iterations of the critic are low it may not have enough iterations to move to a positive value.
Thought experiment: If we suppose that we obtain a Wasserstein estimate that is negative, we can always negate the critic function to make the estimate positive. That means there exist a Lipschitz function that gives a positive value which is larger than that Lipschitz function that gives negative value. So Wasserstein estimates cannot be negative as by definition we need to have the supremum of all the 1-Lipschitz functions.
I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss
I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?
Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.