Calculating the KL divergence term for VAE loss - python

I am interested in calculating the KL divergence term for a VAE loss function.
I have seen many examples using some facsimile of the following keras code:
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
However, I have also found another method (which uses the torch library) of calculating the KL divergence, as shown by this code (I'll call this the latter example):
kl = kl_divergence(pred_result['latent_dist'], Normal(0,1).mean(dim=0).sum()
In the latter example, pred_result['latent_dist'] is distribution object describing normal distributions parameterised by two layers: mean and st_dev (not log_var, in this case).
Normal(0,1) is also a distribution object, representing a normal distribution with mean = 0 and standard deviation = 1.
My question is this: Is the latter example a legitimate/correct way of calculating the KL divergence term for a VAE?

Related

Sampling from beta distribution in a Neural Network

I ask myself this question after reading about Variational Autoencoders, where the bottleneck of the model produce a mean m and a standard deviation u. Then from a uniform distribution X=U(0, 1), the VAE computes the latent vector v=X*u + m, that follows a U(m, v) distribution and allows the gradient to propagate.
I want to do the same with a beta distribution (so with parameters a and b). How is it possible to sample from a beta distribution while allowing the gradient to propagate (because otherwise I could simply use the tfp.distributions.Beta function but the gradient wouldn't propagate ...)?

why reconstruction loss function multiplied by constant in VAE?

I try to understand best way how to use autoencoders loss functions.
So the often point is that common loss function consist of KL loss and reconstruction loss.
And what really confuse me is that reconstruction loss is multiplied by some big constant for example
xent_loss = self.original_dim * metrics.binary_crossentropy(self.x, x_decoded_mean)
kl_loss = - 0.5 * K.sum(1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var), axis=-1)
vae_loss = K.mean(xent_loss + kl_loss)
from https://github.com/mattiacampana/Autoencoders/blob/master/models/vae.py
or from https://towardsdatascience.com/variational-autoencoders-as-generative-models-with-keras-e0c79415a7eb
Could you explain this? Also what I don't understand is that somehow they use reduce_mean and somehow - reduce_sum aggregate function. Is there some difference in these terms? Woud be it painfull for example to use reduce_sum in KL and reduce_mean in reconstruction function?
Thanks for any response

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

How to understand the loss function in scikit-learn logestic regression code?

The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.
First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.

Gradually update weights of custom loss in Keras during training

I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss

Categories