I ask myself this question after reading about Variational Autoencoders, where the bottleneck of the model produce a mean m and a standard deviation u. Then from a uniform distribution X=U(0, 1), the VAE computes the latent vector v=X*u + m, that follows a U(m, v) distribution and allows the gradient to propagate.
I want to do the same with a beta distribution (so with parameters a and b). How is it possible to sample from a beta distribution while allowing the gradient to propagate (because otherwise I could simply use the tfp.distributions.Beta function but the gradient wouldn't propagate ...)?
Related
I am interested in calculating the KL divergence term for a VAE loss function.
I have seen many examples using some facsimile of the following keras code:
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
However, I have also found another method (which uses the torch library) of calculating the KL divergence, as shown by this code (I'll call this the latter example):
kl = kl_divergence(pred_result['latent_dist'], Normal(0,1).mean(dim=0).sum()
In the latter example, pred_result['latent_dist'] is distribution object describing normal distributions parameterised by two layers: mean and st_dev (not log_var, in this case).
Normal(0,1) is also a distribution object, representing a normal distribution with mean = 0 and standard deviation = 1.
My question is this: Is the latter example a legitimate/correct way of calculating the KL divergence term for a VAE?
I'm currently reading papers on Variational Autoencoders (VAE). According to this article (http://proceedings.mlr.press/v95/guo18a/guo18a.pdf):
By fitting the input data sample x(i) into the Gaussian distribution with the reconstructed mean vector and the reconstructed standard deviation vector, we can get the corresponding reconstruction probability N (x(i)|µxˆ(i, l), σxˆ(i, l)) of the lth generated latent vector.
Basically, I don't understand what "fitting the input data sample x(i) into the Gaussian distribution" means.
I assume we first build a Multivariate Gaussian distribution with the vectors µxˆ(i, l) and σxˆ(i, l). But then, what does that mean to fit a vector into that distribution?
Also, does the result of that calculation correspond to what we call the likelihood then?
Many thanks,
Guillaume
I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24
I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss
I am working with a Variational Autoencoder Type model and part of my loss function is the KL divergence between a Normal Distribution with mean 0 and variance 1 and another Normal Distribution whose mean and variance are predicted by my model.
I defined the loss in the following way:
def kl_loss(mean, log_sigma):
normal=tf.contrib.distributions.MultivariateNormalDiag(tf.zeros(mean.get_shape()),
tf.ones(log_sigma.get_shape()))
enc_normal = tf.contrib.distributions.MultivariateNormalDiag(mean,
tf.exp(log_sigma),
validate_args=True,
allow_nan_stats=False,
name="encoder_normal")
kl_div = tf.contrib.distributions.kl_divergence(normal,
enc_normal,
allow_nan_stats=False,
name="kl_divergence")
return kl_div
The input are unconstrained vectors of length N with
log_sigma.get_shape() == mean.get_shape()
Now during training I observe a negative KL divergence after a few thousand iterations up to values of -10. Below you can see the Tensorboard training curves:
KL divergence curve
Zoom in of KL divergence curve
Now this seems odd to me as the KL divergence should be positive under certain conditions. I understand that we require "The K-L divergence is only defined if P and Q both sum to 1 and if Q(i) > 0 for any i such that P(i) > 0." (see https://mathoverflow.net/questions/43849/how-to-ensure-the-non-negativity-of-kullback-leibler-divergence-kld-metric-rela) but I don't see how this could be violated in my case. Any help is highly appreciated!
Faced the same problem.
It happened because of float precision used.
If you notice the negative values occur close to 0 and is bounded to a small negative value. Adding a small positive value to the loss is a work around.