why reconstruction loss function multiplied by constant in VAE?

why reconstruction loss function multiplied by constant in VAE? - python

I try to understand best way how to use autoencoders loss functions.
So the often point is that common loss function consist of KL loss and reconstruction loss.
And what really confuse me is that reconstruction loss is multiplied by some big constant for example
xent_loss = self.original_dim * metrics.binary_crossentropy(self.x, x_decoded_mean)
kl_loss = - 0.5 * K.sum(1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var), axis=-1)
vae_loss = K.mean(xent_loss + kl_loss)
from https://github.com/mattiacampana/Autoencoders/blob/master/models/vae.py
or from https://towardsdatascience.com/variational-autoencoders-as-generative-models-with-keras-e0c79415a7eb
Could you explain this? Also what I don't understand is that somehow they use reduce_mean and somehow - reduce_sum aggregate function. Is there some difference in these terms? Woud be it painfull for example to use reduce_sum in KL and reduce_mean in reconstruction function?
Thanks for any response

Related

Calculating the KL divergence term for VAE loss

I am interested in calculating the KL divergence term for a VAE loss function.
I have seen many examples using some facsimile of the following keras code:
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
However, I have also found another method (which uses the torch library) of calculating the KL divergence, as shown by this code (I'll call this the latter example):
kl = kl_divergence(pred_result['latent_dist'], Normal(0,1).mean(dim=0).sum()
In the latter example, pred_result['latent_dist'] is distribution object describing normal distributions parameterised by two layers: mean and st_dev (not log_var, in this case).
Normal(0,1) is also a distribution object, representing a normal distribution with mean = 0 and standard deviation = 1.
My question is this: Is the latter example a legitimate/correct way of calculating the KL divergence term for a VAE?

Pytorch: Why does altering the scale of the loss functions improve the convergence in some models?

I have a question surrounding a pretty complex loss function I have.
This is a variational autoencoder loss function and it is fairly complex. It is made of two reconstruction losses, KL divergence and a discriminator as a regularizer. All of those losses are on the same scale, but I have found out that increasing one of the reconstruction losses by a factor of 20 (while leaving the rest on the previous scale) heavily increases the performance of my model.
Since I am still fairly novice on DL, I dont completely understand why this happens, or how I could identify this sort of thing on successive models.
Any advice/explanation is greatly appreciated.

To summarize your setting first:
loss = alpha1 * loss1 + alpha2 * loss2
When computing the gradients for backpropagation, we compute back through this formular. By backpropagating through our error function we get the gradient:
dError/dLoss
To continue our propagation downwards, we now want to compute dError/dLoss1 and dError/dLoss2.
dError/dLoss1 can be expanded to dError/dLoss * dLoss/dLoss1 via the cain rule (https://en.wikipedia.org/wiki/Chain_rule).
We already computed dError/dLoss so we only need to compute dLoss derived with respect to dLoss1, which is
dLoss/dLoss1 = alpha1
The backpropagation now continues until we reach our weights (dLoss1/dWeight). The gradient our weight receives is:
dError/dWeight = dError/dLoss * dLoss/dLoss1 * dLoss1/dWeight = dError/dLoss * alpha1 * dLoss1/dWeight
As you can see, the gradient used to update our weight does now depend on alpha1, the factor we use to scale Loss1.
If we increase alpha1 while not changing alpha2 the gradients depending on Loss1 will have higher different impact than the gradients of Loss2 and therefor changing the optimization of our model.

Perplexity about implementation of VAE loss function

I have some perplexities about the implementation of Variational autoencoder loss. This is the one I've been using so far:
def vae_loss(recon_loss, mu, logvar):
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(),dim=1)
return recon_loss + KLD
After having noticed problems in my loss convergence, even in simple tasks of 1d vectors reconstruction, I started googling around and I have find a variation of this:
def vae_loss(recon_loss, mu, logvar):
KLD = torch.mean(-0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(),dim=1),dim=0)
return recon_loss + KLD
With this second vae loss the performance increases noticeably, and it's clearly visible in reconstructed vectors as well. What I'm doing with the second implementation is taking the average over batch of samples I guess.
My perplexity arises from the fact that I've found these two separate implementations on different blogposts and I don't know which one is correct.

Is there a differentiable alternative to K.cast?

For a custom Keras loss function, I need to create a float tensor from a bool tensor. Unfortunately, K.cast() is not differentiable and therefore can't be used. Is there an alternative way to do this that is differentiable?
less_than_tau = y_pred < tau
less_than_tau = K.cast(less_than_tau, 'float32')

Dr. Snoopy is right.
The way you solve for this in deep learning is "soft" functions, such as softmax instead of max.
In your case, if you want to minimize y-pred relative y-tau, you'd do something like
switch = sigmoid(y_pred - y_tau)
loss = switch * true_case + (1. - switch) * false_case

A differentiable tensor operation approximating less than or equal to?

I'm trying to maximize the number of predictions that are close to the true value, even if this results in crazy outliers that may otherwise skew a median (which I already have a working loss for) or mean.
So, I try this custom loss function:
def lossMetricPercentGreaterThanTenPercentError(y_true, y_pred):
"""
CURRENTLY DOESN'T WORK AS LOSS: NOT DIFFERENTIABLE
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
See https://keras.io/losses/
"""
from keras import backend as K
import tensorflow as tf
diff = K.abs((y_true - y_pred) / K.clip(K.abs(y_true), K.epsilon(), None))
withinTenPct = tf.reduce_sum(tf.cast(K.less_equal(diff, 0.1), tf.int32), axis= -1) / tf.size(diff, out_type= tf.int32)
return 100 * (1 - tf.cast(withinTenPct, tf.float32))
I understand that at least the less_equal function isn't differentiable (I'm not sure if it's also throwing a fit over tf.size); is there some tensor operation that can approximate "less than or equal to"?
I'm on Tensorflow 1.12.3 and cannot upgrade, so even if tf.numpy_function(lambda x: np.sum(x <= 0.1) / len(x), diff, tf.float32) would work as a wrapper I can't use tf.numpy_function.

From the error message it looks like some gradient operation has not been implemented in Keras.
You could try to use Tensorflow operations to achieve the same result (Untested!):
diff = tf.abs(y_true - y_pred) / tf.clip_by_value(tf.abs(y_true), 1e-12, 1e12))
withinTenPct = tf.reduce_mean(tf.cast(tf.less_equal(diff, tf.constant(0.1, dtype=tf.float32, shape=tf.shape(diff)), tf.int32)))
return 100.0 * (1.0 - tf.cast(withinTenPct, tf.float32))
Alternatively, you can try the tf.keras.losses.logcosh(y_true, y_pred).
As it seems to fit your use case. See Tf Doc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

why reconstruction loss function multiplied by constant in VAE? - python

Related

Calculating the KL divergence term for VAE loss

Pytorch: Why does altering the scale of the loss functions improve the convergence in some models?

Perplexity about implementation of VAE loss function

Is there a differentiable alternative to K.cast?

A differentiable tensor operation approximating less than or equal to?

Categories

Resources