I have some perplexities about the implementation of Variational autoencoder loss. This is the one I've been using so far:
def vae_loss(recon_loss, mu, logvar):
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(),dim=1)
return recon_loss + KLD
After having noticed problems in my loss convergence, even in simple tasks of 1d vectors reconstruction, I started googling around and I have find a variation of this:
def vae_loss(recon_loss, mu, logvar):
KLD = torch.mean(-0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(),dim=1),dim=0)
return recon_loss + KLD
With this second vae loss the performance increases noticeably, and it's clearly visible in reconstructed vectors as well. What I'm doing with the second implementation is taking the average over batch of samples I guess.
My perplexity arises from the fact that I've found these two separate implementations on different blogposts and I don't know which one is correct.
Related
I am interested in calculating the KL divergence term for a VAE loss function.
I have seen many examples using some facsimile of the following keras code:
kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
However, I have also found another method (which uses the torch library) of calculating the KL divergence, as shown by this code (I'll call this the latter example):
kl = kl_divergence(pred_result['latent_dist'], Normal(0,1).mean(dim=0).sum()
In the latter example, pred_result['latent_dist'] is distribution object describing normal distributions parameterised by two layers: mean and st_dev (not log_var, in this case).
Normal(0,1) is also a distribution object, representing a normal distribution with mean = 0 and standard deviation = 1.
My question is this: Is the latter example a legitimate/correct way of calculating the KL divergence term for a VAE?
I tried to define a custom loss function for an AE according to the Keras spec. that takes y, y_hat.
The loss is a combination of the MSE and the Frobenius norm of the Jacobian. When using the loss, the training is very fast until I return the sum of MSE and Norm, i.e. returning ret, which slows down the training. Not returning ret but keeping all computations the same, makes the training fast again.
E.g. the below version is slow. Just returning mse makes the training fast again.
#tf.function
def orthogonal_loss(y, y_hat):
"""
Computes the orthogonal loss, a combination of reconstruction loss and
regularization of the orthogonality of the Jacobian.
Args:
y: input vector of shape (batch, dim)
y_hat: reconstruction of y of shape (batch, dim)
Returns: loss of MSE(y, y_hat) + scaling || J'J - I * diag(J'J) ||_F
"""
mse = tf.keras.losses.mean_squared_error(y, y_hat)
with tf.GradientTape() as tape:
z = ae.encoder(y)
tape.watch(z)
y_tilde = ae.decoder(z)
# the Jacobian will be of shape (batch, output dim., latent dim.)
jacobian = tape.batch_jacobian(y_tilde, z)
# will use the batch matrix. mult. as the last two dim. specify valid matrix
jj = tf.matmul(jacobian, jacobian, transpose_a=True)
# jj_diag = tf.linalg.diag_part(jj)
# - tf.eye(128)
ortho = tf.linalg.norm(jj, ord="fro", axis=(-2, -1))
ret = mse + 0.0001 * ortho
return ret
Any idea what the cause of this phenomenon is? I could only think of a complex gradient which slows down the optimizer.
I have a question surrounding a pretty complex loss function I have.
This is a variational autoencoder loss function and it is fairly complex. It is made of two reconstruction losses, KL divergence and a discriminator as a regularizer. All of those losses are on the same scale, but I have found out that increasing one of the reconstruction losses by a factor of 20 (while leaving the rest on the previous scale) heavily increases the performance of my model.
Since I am still fairly novice on DL, I dont completely understand why this happens, or how I could identify this sort of thing on successive models.
Any advice/explanation is greatly appreciated.
To summarize your setting first:
loss = alpha1 * loss1 + alpha2 * loss2
When computing the gradients for backpropagation, we compute back through this formular. By backpropagating through our error function we get the gradient:
dError/dLoss
To continue our propagation downwards, we now want to compute dError/dLoss1 and dError/dLoss2.
dError/dLoss1 can be expanded to dError/dLoss * dLoss/dLoss1 via the cain rule (https://en.wikipedia.org/wiki/Chain_rule).
We already computed dError/dLoss so we only need to compute dLoss derived with respect to dLoss1, which is
dLoss/dLoss1 = alpha1
The backpropagation now continues until we reach our weights (dLoss1/dWeight). The gradient our weight receives is:
dError/dWeight = dError/dLoss * dLoss/dLoss1 * dLoss1/dWeight = dError/dLoss * alpha1 * dLoss1/dWeight
As you can see, the gradient used to update our weight does now depend on alpha1, the factor we use to scale Loss1.
If we increase alpha1 while not changing alpha2 the gradients depending on Loss1 will have higher different impact than the gradients of Loss2 and therefor changing the optimization of our model.
I try to understand best way how to use autoencoders loss functions.
So the often point is that common loss function consist of KL loss and reconstruction loss.
And what really confuse me is that reconstruction loss is multiplied by some big constant for example
xent_loss = self.original_dim * metrics.binary_crossentropy(self.x, x_decoded_mean)
kl_loss = - 0.5 * K.sum(1 + self.z_log_var - K.square(self.z_mean) - K.exp(self.z_log_var), axis=-1)
vae_loss = K.mean(xent_loss + kl_loss)
from https://github.com/mattiacampana/Autoencoders/blob/master/models/vae.py
or from https://towardsdatascience.com/variational-autoencoders-as-generative-models-with-keras-e0c79415a7eb
Could you explain this? Also what I don't understand is that somehow they use reduce_mean and somehow - reduce_sum aggregate function. Is there some difference in these terms? Woud be it painfull for example to use reduce_sum in KL and reduce_mean in reconstruction function?
Thanks for any response
I am working on Multiclass Classification (4 classes) for Language Task and I am using the BERT model for classification task. I am following this blog as reference. My BERT Fine Tuned model returns nn.LogSoftmax(dim=1).
My data is pretty imbalanced so I used sklearn.utils.class_weight.compute_class_weight to compute weights of the classes and used the weights inside the Loss.
class_weights = compute_class_weight('balanced', np.unique(train_labels), train_labels)
weights= torch.tensor(class_weights,dtype=torch.float)
cross_entropy = nn.NLLLoss(weight=weights)
My results were not so good so I thought of Experementing with Focal Loss and have a code for Focal Loss.
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2, logits=False, reduce=True):
super(FocalLoss, self).__init__()
self.alpha = alpha
self.gamma = gamma
self.logits = logits
self.reduce = reduce
def forward(self, inputs, targets):
BCE_loss = nn.CrossEntropyLoss()(inputs, targets)
pt = torch.exp(-BCE_loss)
F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss
if self.reduce:
return torch.mean(F_loss)
else:
return F_loss
I have 3 questions now. First and the Most important is
Should I use Class Weight with Focal Loss?
If I have to Implement weights inside this Focal Loss, can I use weights parameters inside nn.CrossEntropyLoss()
If this implement is incorrect, what should be the proper code for this one including the weights (if possible)
You may find answers to your questions as follows:
Focal loss automatically handles the class imbalance, hence weights are not required for the focal loss. The alpha and gamma factors handle the class imbalance in the focal loss equation.
No need of extra weights because focal loss handles them using alpha and gamma modulating factors
The implementation you mentioned is correct according to the focal loss formula but I had trouble in causing my model to converge with this version hence, I used the following implementation from mmdetection framework
pred_sigmoid = pred.sigmoid()
target = target.type_as(pred)
pt = (1 - pred_sigmoid) * target + pred_sigmoid * (1 - target)
focal_weight = (alpha * target + (1 - alpha) *
(1 - target)) * pt.pow(gamma)
loss = F.binary_cross_entropy_with_logits(
pred, target, reduction='none') * focal_weight
loss = weight_reduce_loss(loss, weight, reduction, avg_factor)
return loss
You can also experiment with another focal loss version available
I think OP would've gotten his answer by now. I am writing this for other people who might ponder upon this.
There in one problem in OPs implementation of Focal Loss:
F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss
In this line, the same alpha value is multiplied with every class output probability i.e. (pt). Additionally, code doesn't show how we get pt. A very good implementation of Focal Loss could be find here. But this implementation is only for binary classification as it has alpha and 1-alpha for two classes in self.alpha tensor.
In case of multi-class classification or multi-label classification, self.alpha tensor should contain number of elements equal to the total number of labels. The values could be inverse label frequency of labels or inverse label normalized frequency (just be cautious with labels which has 0 as frequency).
I think the implementation in your question is wrong. The alpha is the class weight.
In cross entropy the class weight is the alpha_t as shown in the following expression:
you see that it is alpha_t rather than alpha.
In focal loss the fomular is
and we can see from this popular Pytorch implementation the alpha acts the same way as class weight.
References:
https://amaarora.github.io/2020/06/29/FocalLoss.html#alpha-and-gamma
https://github.com/clcarwin/focal_loss_pytorch