How keras calculate loss if you have multiple output - python

I searched all over the web but I couldn't find it. how does Keras calculate the loss if we have multiple output values.

It depends on which loss function you are using.
Usually, For each batch during training, Keras will call the
loss function to compute the loss and use it to perform a Gradient
Descent step. Moreover, it will keep track of the total loss since the
beginning of the epoch, and it will display the mean loss.
Regarding multiple outputs, same process happens and calculation wise you can pick any loss function mentioned in the document here and check the examples.
Let's say for SparseCategoricalCrossentropy you can check this document.

Related

Keras GradientType: Calculating gradients with respect to the output node

For startes: this question does not ask for help regarding reinforcement learning (RL), RL is only used as an example.
The Keras documentation contains an example actor-critic reinforcement learning implementation using Gradient Tape. Basically, they've created a model with two separate outputs: one for the actor (n actions) and one for the critic (1 reward). The following lines describe the backpropagation process (found somewhere in the code example):
# Backpropagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Despite the fact that the actor and critic losses are calculated differently, they sum up those two losses to obtain the final loss value used for calculating the gradients.
When looking at this code example, one question came to my mind: Is there a way to calculate the gradients of the output layer with respect to the corresponding losses, i.e. calculate the gradients of the first n output nodes based on the actor loss and the gradient of the last output node using the critic loss? For my understanding, this would be much more convenient than adding both losses (different!) and updating the gradients based on this cumulative approach. Do you agree?
Well, after some research I found the answer myself: It is possible to extract the trainable variables of a given layer based on the layer name. Then we can apply tape.gradient and optimizer.apply_gradients to the extracted set of trainable variables. My current solution is pretty slow, but it works. I just need to figure out how to improve its runtime.

Loss Function in PyTorch where not all my training examples are equally weighted

I want to train a Neural Network in PyTorch. I have my training dataset, however I care more about some examples than about others. I want to include this information in the loss function - to let the NN know that it is very important to get some examples right and to not punish errors on other examples very much.
I want to do this by weighting the loss for training examples, let's say:
loss = weight_for_example*(y_true - y_pred)^2
Is there an easy way to do this in PyTorch?
It mainly depends on your task: for instance, BCEWithLogitsLoss has a weight parameter that allows a custom weight for each batch. Many other built-in losses also provide this option.
Aside from solutions already available in the framework such as this, a simple approach could be the following:
build a custom dataset, returning your data and a scalar weight for that sample in your __getitem__
proceed with the forward pass
compute your loss, which you can now multiply by the factors you provided.
There's only a caveat (which is the same of the BCELoss): you probably iterate on batches with size > 1, so your dataloader will provide a batch of data, with a batch of weights. You need to make sure you don't reduce your loss beforehand, so that you can still multiply it by your batch weight, then you can proceed with a manual reduction (e.g. loss = loss.mean()).
See some examples here.

How implement Batch Norm with SWA in Tensorflow?

I am using Stochastic Weight Averaging (SWA) with Batch Normalization layers in Tensorflow 2.2. For Batch Norm I use tf.keras.layers.BatchNormalization. For SWA I use my own code to average the weights (I wrote my code before tfa.optimizers.SWA appeared). I have read in multiple sources that if using batch norm and SWA we must run a forward pass to make certain data (running mean and st dev of activation weights and/or momentum values?) available to the batch norm layers. What I do not understand - despite a lot of reading - is exactly what needs to be done and how. Specifically:
When must the forward/prediction pass be run? At the end of each
mini-batch, end of each epoch, end of all training?
When the forward pass is run, how are the running mean & stdev values made available
to the batch norm layers?
Is this process performed magically by the tfa.optimizers.SWA class?
When must the forward/prediction pass be run? At the end of each
mini-batch, end of each epoch, end of all training?
At the end of training. Think of it like this, SWA is performed by swapping your final weights with a running average. But all batch norm layers are still calculated based on statistics from your old weights. So we need to run a forward pass to let them catch up.
When the forward pass is run, how are the running mean & stdev values
made available to the batch norm layers?
During a normal forward pass (prediction) the running mean and standard deviation will not be updated. So what we actually need to do is to train the network, but not update the weights. This is what the paper refers to when it says to run the forward pass in "training mode".
The easiest way to achieve this (that I know) is to reset the batch normalization layers and train one additional epoch with learning rate set to 0.
Is this process performed magically by the tfa.optimizers.SWA class?
I don't know. But if you are using Tensorflow Keras then I have made this Keras SWA callback that does it like in the paper including the learning rate schedules.

In keras, is the loss value of model.fit the average over batches or over samples?

The background for this question is that I am trying to debug a TensorFlow pipeline. I manually computed the loss for each example based on the network's current prediction and averaged the error terms. The number I get is different from what Keras is reporting, so the question is if I found my bug, or if I am computing the wrong thing to compare to the value reported as "loss".
What exactly is Keras computing here?
The training loss is a running mean of the loss values across batches, during training, after each batch update. The weights are changing during training so you cannot compare this loss to to making predictions with fixed weights and then computing a loss.
This means that if you compute the average (or even running mean) over batches, you will get a different result.

Formulaically updating parameters in Keras layers

I am trying to write some custom layers in Keras. The ultimate goal is that certain parameters (updated according to a fixed formula after each batch of data is optimized over in the training process) be passed to the loss function. I do not believe it is possible to use dynamic loss functions in Keras, but that I should be able to pass these parameters to the loss function using multiple inputs and a custom layer.
I want to know whether it is possible to create a layer in Keras having parameters that are not trainable (and not optimized over at all in the training process), but instead updated according to a fixed formula at the end of each batch optimization in the training process.
The simplest example I can give: instead of optimizing a generic cost function (like cross-entropy), I want to optimize something proportional to the cross entropy (c*cross_entropy). After one batch of data is processed in the training procedure, I want to set, for example, c = 1.2*c, and this to be used as the c value in the batch of data.
(This should be more or less useless in this case as a positive constant times the loss function shouldn't affect the minima but it's fairly close to what I actually need to do).

Categories