Gradually update weights of custom loss in Keras during training - python

I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss

Related

Pytorch: Why does altering the scale of the loss functions improve the convergence in some models?

I have a question surrounding a pretty complex loss function I have.
This is a variational autoencoder loss function and it is fairly complex. It is made of two reconstruction losses, KL divergence and a discriminator as a regularizer. All of those losses are on the same scale, but I have found out that increasing one of the reconstruction losses by a factor of 20 (while leaving the rest on the previous scale) heavily increases the performance of my model.
Since I am still fairly novice on DL, I dont completely understand why this happens, or how I could identify this sort of thing on successive models.
Any advice/explanation is greatly appreciated.
To summarize your setting first:
loss = alpha1 * loss1 + alpha2 * loss2
When computing the gradients for backpropagation, we compute back through this formular. By backpropagating through our error function we get the gradient:
dError/dLoss
To continue our propagation downwards, we now want to compute dError/dLoss1 and dError/dLoss2.
dError/dLoss1 can be expanded to dError/dLoss * dLoss/dLoss1 via the cain rule (https://en.wikipedia.org/wiki/Chain_rule).
We already computed dError/dLoss so we only need to compute dLoss derived with respect to dLoss1, which is
dLoss/dLoss1 = alpha1
The backpropagation now continues until we reach our weights (dLoss1/dWeight). The gradient our weight receives is:
dError/dWeight = dError/dLoss * dLoss/dLoss1 * dLoss1/dWeight = dError/dLoss * alpha1 * dLoss1/dWeight
As you can see, the gradient used to update our weight does now depend on alpha1, the factor we use to scale Loss1.
If we increase alpha1 while not changing alpha2 the gradients depending on Loss1 will have higher different impact than the gradients of Loss2 and therefor changing the optimization of our model.

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

How does PyTorch compute the backward pass when optimizing triplet loss?

I am implementing a triplet network in Pytorch where the 3 instances (sub-networks) share the same weights. Since the weights are shared, I implemented it as a single instance network that is called three times to produce the anchor, positive, and negative embeddings. The embeddings are learned by optimizing the triplet loss. Here is a small snippet for illustration:
from dependencies import *
model = SingleSubNet() # represents each instance in the triplet net
for epoch in epochs:
for anch, pos, neg in enumerate(train_loader):
optimizer.zero_grad()
fa, fp, fn = model(anch), model(pos), model(neg)
loss = triplet_loss(fa, fp, fn)
loss.backward()
optimizer.step()
# Do more stuff ...
My complete code works as expected. However, I do not understand what does the loss.backward() compute the gradient(s) in this case. I am confused because there are 3 gradients of loss is in each learning step (the gradients formulas are here). I assume the gradients are summed before performing optimizer.step(). But then it looks from the equations that if the gradients are summed, they will cancel each other out and yield zero update term. Of course, this is not true as the network learns meaningful embeddings at the end.
Thanks in advance
Late answer, but hope this helps someone.
The gradients that you linked are the gradients of the loss with respect to the embeddings (the anchor, positive embedding and negative embedding). To update the model parameters, you use the gradient of the loss with respect to the model parameters. This does not sum to zero.
The reason for this is that when calculating the gradient of the loss with respect to the model parameters, the formula makes use of the activations from the forward pass, and the 3 different inputs (anchor image, positive example and negative example) have different activations in the forward pass.

Reconstruction loss on regression type of Variational Autoencoder

I'm currently working on a variation of Variational Autoencoder in a sequential setting, where the task is to fit/recover a sequence of real-valued observation data (hence it is a regression problem).
I have built my model using tf.keras with eager execution enabled, and tensorflow_probability (tfp). Following VAE concept, the generative net emits the distribution parameters of the observation data, which I model as multivariate normal. Therefore the outputs are mean and logvar of the predicted distribution.
Regarding training process, the first component of the loss is reconstruction error. That is the log likelihood of the true observation, given the predicted (parameters) distribution from the generative net. Here, I use tfp.distributions, since it is fast and handy.
However, after training is done, marked by a considerably low loss value, it turns out that my model seems not to learn anything. The predicted value from the model is just barely flat across the time dimension (recall that the problem is sequential).
Nevertheless, for the sake of sanity check, when I replace log likelihood with MSE loss (which is not justifiable while working on VAE), it yields very good data fitting. So I conclude that there must be something wrong with this log likelihood term. Is there anyone having some clue and/or solution for this?
I have considered replacing the log likelihood with cross-entropy loss, but I think that is not applicable in my case, since my problem is regression and the data can't be normalized into [0,1] range.
I also have tried to implement annealed KL term (i.e. weighing the KL term with constant < 1) when using the log likelihood as the reconstruction loss. But it also didn't work.
Here is my code snippet of the original (using log likelihood as reconstruction error) loss function:
import tensorflow as tf
tfe = tf.contrib.eager
tf.enable_eager_execution()
import tensorflow_probability as tfp
tfd = tfp.distributions
def loss(model, inputs):
outputs, _ = SSM_model(model, inputs)
#allocate the corresponding output component
infer_mean = outputs[:,:,:latent_dim] #mean of latent variable from inference net
infer_logvar = outputs[:,:,latent_dim : (2 * latent_dim)]
trans_mean = outputs[:,:,(2 * latent_dim):(3 * latent_dim)] #mean of latent variable from transition net
trans_logvar = outputs[:,:, (3 * latent_dim):(4 * latent_dim)]
obs_mean = outputs[:,:,(4 * latent_dim):((4 * latent_dim) + output_obs_dim)] #mean of observation from generative net
obs_logvar = outputs[:,:,((4 * latent_dim) + output_obs_dim):]
target = inputs[:,:,2:4]
#transform logvar to std
infer_std = tf.sqrt(tf.exp(infer_logvar))
trans_std = tf.sqrt(tf.exp(trans_logvar))
obs_std = tf.sqrt(tf.exp(obs_logvar))
#computing loss at each time step
time_step_loss = []
for i in range(tf.shape(outputs)[0].numpy()):
#distribution of each module
infer_dist = tfd.MultivariateNormalDiag(infer_mean[i],infer_std[i])
trans_dist = tfd.MultivariateNormalDiag(trans_mean[i],trans_std[i])
obs_dist = tfd.MultivariateNormalDiag(obs_mean[i],obs_std[i])
#log likelihood of observation
likelihood = obs_dist.prob(target[i]) #shape = 1D = batch_size
likelihood = tf.clip_by_value(likelihood, 1e-37, 1)
log_likelihood = tf.log(likelihood)
#KL of (q|p)
kl = tfd.kl_divergence(infer_dist, trans_dist) #shape = batch_size
#the loss
loss = - log_likelihood + kl
time_step_loss.append(loss)
time_step_loss = tf.convert_to_tensor(time_step_loss)
overall_loss = tf.reduce_sum(time_step_loss)
overall_loss = tf.cast(overall_loss, dtype='float32')
return overall_loss

Multiple losses for imbalanced dataset with Keras

My Model:
I built a siamese network that take two input and has three outputs. So My loss functions is :
total loss = alpha( loss1) + alpah( loss2) + (1_alpah) ( loss3)
loss1 and loss2 is categorical cross entropy loss function, to classify the class identity from total of 8 classes.
loss3 is similarity loss function ( euclidean distance loss), to verify if the both input from same class or different classes.
My questions are as follow:
If I have different losses, and I want to weight them by using the variable alpha which its value depend on the epoch number. So I have to set the value pf alpha through callback. My question is it possible to pass this alpha variable that its value changed with epoch number (i.e not scalar) through the loss_weights in model.complie. The documentation said:
loss_weights: Optional list or dictionary specifying scalar
coefficients (Python floats) to weight the loss contributions of
different model outputs. The loss value that will be minimized by the
model will then be the weighted sum of all individual losses, weighted
by the loss_weights coefficients. If a list, it is expected to have a
1:1 mapping to the model's outputs. If a tensor, it is expected to map
output names (strings) to scalar coefficients.
Example
alpha = K.variable(0., dtype=tf.float32)
def changeAlpha(epoch,logs):
new_alpha = some_function(epoch)
K.set_value(alpha, new_alpha)
return
alphaChanger = LambdaCallback(on_epoch_end=changeAlpha)
model.compile(loss= [loss1, loss2, loss3], loss_weights = [alpha, alpha, (1-alpha)])
My dataset is imbalanced, so I want to use class_wights in model.fit(). So for the same model of three losses, I want to apply the class weights only for categorical cross entropy losses ( loss 1 and loss2) , So will it work on both losses and except the third loss if I pass it to model.fit? Knowing that the third loss is custom loss function.
If I want to classify the classes for siamese network, would my metric be model.compile(metrics= ['out1':'accuracy', 'out2':accuracy']])? But the final accuracy need to be the average of both,I can solve it by building my own custom metric. But is there anyway to weighted summing of both metrics.

Categories