PyTorch - custom loss function - calculations outside graph - python

I'm kinda new to pytorch and trying to wrap my head around it.
I've read about custom loss functions and as far as I've seen, they cannot be decoupled from internal computational graph. This means loss function consumes tensors, does operations on them, which are implemented in pytorch, and outputs tensor. Is there any way to have decoupled loss calculation and plug it back somehow?
Use case
I'm trying to train an encoder, where latent space will be optimized to some statistical quality. This means I don't train in batches and I calculate single loss value for whole epoch and whole data set. Is it even possible to teach net that way?
class Encoder(nn.Module):
def __init__(self, genome_size: int):
super(Encoder, self).__init__()
self.fc1 = nn.Linear(genome_size, genome_size)
self.fc2 = nn.Linear(genome_size, genome_size)
self.fc3 = nn.Linear(genome_size, genome_size)
self.genome_size = genome_size
def forward(self, x):
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
return x
def train_encoder(
net: nn.Module,
optimizer: Optimizer,
epochs: int,
population: Tensor,
fitness: Tensor,
running_loss = 0.0
for epoch in range(epochs):
outputs = net(population)
# encoder_loss is computationally heavy and cannot be done only on tensors
# I need to unwrap those tensors to numpy arrays and use them as an input to another model
loss = encoder_loss(outputs, fitness)
running_loss += loss
print('Encoder loss:', loss)
I've seen some examples with accumulated running_loss, but my encoder is unable to learn anything. Convergence plot just jumps all over the place.
Thanks for your time <3


Keras loss value significant jump

I am working on a simple neural network in Keras with Tensorflow. There is a significant jump in loss value from the last mini-batch of epoch L-1 to the first mini-batch of epoch L.
I am aware that the loss should decrease with an increase in the number of iterations but a significant jump in loss after each epoch does looks strange. Here is the code snippet
initializer = tf.keras.initializers.he_uniform()
def my_loss(y_true, y_pred):
epsilon=1e-30 #epsilon is added to avoid inf/nan
y_pred = K.cast(y_pred, K.floatx())
y_true = K.cast(y_true, K.floatx())
loss = y_true* K.log(y_pred+epsilon) + (1-y_true)*K.log(1-y_pred+epsilon)
loss = K.mean(loss, axis= -1)
loss = K.mean(loss)
loss = -1*loss
return loss
inputs = tf.keras.Input(shape=(140,))
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(inputs)
x = tf.keras.layers.BatchNormalization()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(1000,kernel_initializer=initializer)(x)
x = tf.keras.layers.ReLU()(x)
x = tf.keras.layers.Dense(100, kernel_initializer=initializer)(x)
outputs = tf.keras.activations.sigmoid(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
opt = tf.keras.optimizers.Adam()
recall1 = tf.keras.metrics.Recall(top_k = 8)
c_entropy = tf.keras.losses.BinaryCrossentropy()
model.compile(loss=c_entropy, optimizer= opt , metrics = [recall1,my_loss], run_eagerly=True), Y_train_test, epochs=epochs, batch_size=batch, shuffle=True, verbose = 1)
When I search online, I found this article, which suggests that Keras calculates the moving average over the mini-batches. Also, I found somewhere that the array for calculating the moving average is reset after each epoch that's why we obtain a very smooth curve within an epoch but a jump after the epoch.
In order to avoid the moving average, I implemented my own loss function, which should output the loss values of the mini-batch instead of the moving average over the batches. As each mini-batch is different from each other; therefore the corresponding loss must also be different from each other. Due to this reason, I was expecting an arbitrary loss value on each mini-batch through my implementation of the loss function. Instead, I obtain exactly the same values as the loss function by Keras.
I am unclear about:
Is Keras calculating the moving average over the mini-batches, the array of which is reset after each epoch causing the jump. If not, then what is causing the jump behaviour in loss value.
Is my implementation of loss for each mini-batch correct? If not, then how can I obtain the loss value of the mini-batch during the training.
Keras in fact shows the moving average instead of the "raw" loss values. The moving average array is reset after each epoch that's why we can see a huge jump after each epoch. In order to acquire the raw loss values, one should implement a callback as shown below:
class LossHistory(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
#initialize a list at the begining of training
self.losses = []
def on_batch_end(self, batch, logs={}):
mycallback = LossHistory()
Then call it in, Y, epochs=epochs, batch_size=batch, shuffle=True, verbose = 0, callbacks=[mycallback])
I tested with the following configuration
Keras 2.3.1
Tensorflow 2.1.0
Python 3.7.9
For some reason, it didn't work with the following configuration
Keras 2.4.3
Tensorflow 2.2.0
Python 3.8.5
To answer the second question, the implementation of the loss function my_loss is correct and the values obtained are pretty much close to the values generated by the built-in function.
In TensorFlow version 2.2 and newer, the loss provided to on_train_batch_end is now the average loss of all batches up until the current batch within the given epoch. This is also the case for additional metrics, and applies to the built-in losses/metrics as well as any custom losses/metrics.
Fortunately, the loss for the current batch can be calculated from the average loss as follows:
from tensorflow.keras.callbacks import Callback
class CustomCallback(Callback):
''' This callback converts the average loss (default behavior in TF>=2.2)
into the loss for only the current batch.
def on_epoch_begin(self, epoch, logs={}):
self.previous_loss_sum = 0
def on_train_batch_end(self, batch, logs={}):
# calculate loss of current batch:
current_loss_sum = (batch + 1) * logs['loss']
current_loss = current_loss_sum - self.previous_loss_sum
self.previous_loss_sum = current_loss_sum
# use current_loss:
# ...
This code can be added to any custom callback that needs the loss for the current batch instead of the average loss, including the LossHistory callback provided in Doc Jazzy's answer.
Also, if you are using Tensorflow 1 or TensorFlow 2 version <= 2.1, then do not include this code in your callback, as in those versions the current loss is already provided, instead of the average loss.

Backpropagating multiple losses in Pytorch

I am building up a cascade of neural networks and I would like to backpropagate the main loss back to the DNNs and also compute an auxillary loss back to each DNN.
I am trying to figure out what is the best practice when building such a model and how to make sure that my losses are computed properly. Do I build a single torch.nn.Module and a single optimizer, or do I have to create separate modules and optimizers for each network? Also I am likely to have more than three cascaded DNNs.
Approach a)
import torch
from torch import nn, optim
class MasterNetwork(nn.Module):
def init(self):
super(MasterNetwork, self).__init__()
dnn1 = nn.ModuleList()
dnn2 = nn.ModuleList()
dnn3 = nn.ModuleList()
def forward(self, x, z1, z2):
out1 = dnn1(x)
out2 = dnn2(out1 + z1)
out3 = dnn3(out2 + z2)
return [out1, out2, out3]
def LossFunction(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_1_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_2_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_3_fn(in):
# do stuff
return loss # loss is a scalar value
model = MasterNetwork()
optimizer = optim.Adam(model.parameters())
input = torch.tensor()
z1 = torch.tensor()
z2 = torch.tensor()
outputs = model(input, z1, z2)
main_loss = LossFunction(outputs[2])
ac1_loss = ac_loss_1_fn(outputs[0])
ac2_loss = ac_loss_2_fn(outputs[1])
ac3_loss = ac_loss_3_fn(outputs[2])
This is where I am uncertain about how to backpropagate the AC losses for each DNN
in addition to the main loss.
Approach b)
This would creating a nn.Module class and optimizer for each DNN and then forwarding the loss to the next DNN.
I would prefer to have a solution for approach a) since it is less tedious and I don't have to deal with tuning multiple optimizers. However, I am not sure if this is possible. There was a similar question about backpropagating multiple losses, however, I was not able to understand how combining the losses would work for the distinct components.
the solution you are looking for is likely to use some form of the following:
y = torch.tensor([main_loss, ac1_loss, ac2_loss, ac3_loss])
See for confirmation.
A similar question exists but this one uses a different phrasing and was the question which I found first when hitting the issue. The similar question can be found at Pytorch. Can autograd be used when the final tensor has more than a single value in it?

neural network trained with PyTorch outputs the mean value for every input

I am using PyTorch in order to get my neural network to recognize digits from the MNIST database.
import torch
import torchvision
I'd like to implement a very simple design similar to what is shown in 3Blue1Brown's video series about neural networks. The following design in particular achieved an error rate of 1.6%.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = torch.sigmoid(self.layer1(x))
x = torch.sigmoid(self.layer2(x))
return x
The data is gathered using torchvision and organised in mini batches containing 32 images each.
batch_size = 32
training_set = torchvision.datasets.MNIST("./", download=True, transform=torchvision.transforms.ToTensor())
training_loader =, batch_size=32)
I am using the mean squared error as a loss funtion and stochastic gradient descent with a learning rate of 0.001 as my optimization algorithm.
net = Net()
loss_function = torch.nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)
Finally the network gets trained and saved using the following code:
for images, labels in training_loader:
for i in range(batch_size):
output = net(torch.flatten(images[i]))
desired_output = torch.tensor([float(j == labels[i]) for j in range(10)])
loss = loss_function(output, desired_output)
optimizer.step(), "./trained_net.pth")
However, here are the outputs of some test images:
tensor([0.0978, 0.1225, 0.1018, 0.0961, 0.1022, 0.0885, 0.1007, 0.1077, 0.0994,
0.1081], grad_fn=<SigmoidBackward>)
tensor([0.0978, 0.1180, 0.1001, 0.0929, 0.1006, 0.0893, 0.1010, 0.1051, 0.0978,
0.1067], grad_fn=<SigmoidBackward>)
tensor([0.0981, 0.1227, 0.1018, 0.0970, 0.0979, 0.0908, 0.1001, 0.1092, 0.1011,
0.1088], grad_fn=<SigmoidBackward>)
tensor([0.1061, 0.1149, 0.1037, 0.1001, 0.0957, 0.0919, 0.1044, 0.1022, 0.0997,
0.1052], grad_fn=<SigmoidBackward>)
tensor([0.0996, 0.1137, 0.1005, 0.0947, 0.0977, 0.0916, 0.1048, 0.1109, 0.1013,
0.1085], grad_fn=<SigmoidBackward>)
tensor([0.1008, 0.1154, 0.0986, 0.0996, 0.1031, 0.0952, 0.0995, 0.1063, 0.0982,
0.1094], grad_fn=<SigmoidBackward>)
tensor([0.0972, 0.1235, 0.1013, 0.0984, 0.0974, 0.0907, 0.1032, 0.1075, 0.1001,
0.1080], grad_fn=<SigmoidBackward>)
tensor([0.0929, 0.1258, 0.1016, 0.0978, 0.1006, 0.0889, 0.1001, 0.1068, 0.0986,
0.1024], grad_fn=<SigmoidBackward>)
tensor([0.0982, 0.1207, 0.1040, 0.0990, 0.0999, 0.0910, 0.0980, 0.1051, 0.1039,
0.1078], grad_fn=<SigmoidBackward>)
As you can see the network seems to approach a state where the answer for every input is:
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
This neural network is not better than just guessing. Where did I go wrong in my design or code?
Here are a few points that would be useful for you:
At first glance your model is not learning since your prediction are as good as a random guess. The first initiative would be to monitor your loss, here you only have a single epoch. At least you could evaluate your model on unseen data:
validation_set = torchvision.datasets.MNIST('./',
download=True, train=False, transform=T.ToTensor())
validation_loader = DataLoader(validation_set, batch_size=32)
You are using a MSE loss (the L2-norm) to train a classification task which is not the right tool for this kind of task. You could instead be using the negative log-likelihood. PyTorch offers nn.CrossEntropyLoss which includes a log-softmax and the negative log-likelihood loss in one module. This change can be implemented by adding in:
loss_function = nn.CrossEntropyLoss()
and using the right target shapes when applying loss_function (see below). Since the loss function will apply a log-softmax, you shouldn't have an activation function on your model's output.
You are using sigmoid as an activation function, intermediate non-linearities will work better as ReLU (see related post). A sigmoid is more suited for a binary classification task. Again, since we are using nn.CrossEntropyLoss, we have to remove the activation after layer2.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.layer1 = torch.nn.Linear(784, 800)
self.layer2 = torch.nn.Linear(800, 10)
def forward(self, x):
x = self.flatten(x)
x = torch.relu(self.layer1(x))
x = self.layer2(x)
return x
A less crucial point is the fact that you could infer estimations on a whole batch instead of looping through each batch one element at a time. A typical training loop for one epoch would look like:
for images, labels in training_loader:
output = net(images)
loss = loss_function(output, labels)
With these kinds of modifications, you can expect to have a validation of around 80% after a single epoch.

TensorFlow 2.0: Eager execution of training either returns bad results or doesn't learn at all

I am experimenting with TensorFlow 2.0 (alpha). I want to implement a simple feed forward Network with two output nodes for binary classification (it's a 2.0 version of this model).
This is a simplified version of the script. After I defined a simple Sequential() model, I set:
# import layers + dropout & activation
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.activations import elu, softmax
# Neural Network Architecture
n_input = X_train.shape[1]
n_hidden1 = 15
n_hidden2 = 10
n_output = y_train.shape[1]
model = tf.keras.models.Sequential([
Dense(n_input, input_shape = (n_input,), activation = elu), # Input layer
Dense(n_hidden1, activation = elu), # hidden layer 1
Dense(n_hidden2, activation = elu), # hidden layer 2
Dense(n_output, activation = softmax) # Output layer
# define loss and accuracy
bce_loss = tf.keras.losses.BinaryCrossentropy()
accuracy = tf.keras.metrics.BinaryAccuracy()
# define optimizer
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
# save training progress in lists
loss_history = []
accuracy_history = []
# loop over 1000 epochs
for epoch in range(1000):
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(X_train), y_train)
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# save in history vectors
current_loss = current_loss.numpy()
accuracy.update_state(model(X_train), y_train)
current_accuracy = accuracy.result().numpy()
# print loss and accuracy scores each 100 epochs
if (epoch+1) % 100 == 0:
print(str(epoch+1) + '.\tTrain Loss: ' + str(current_loss) + ',\tAccuracy: ' + str(current_accuracy))
print('\nTraining complete.')
Training goes without errors, however strange things happen:
Sometimes, the Network doesn't learn anything. All loss and accuracy scores are constant throughout all the epochs.
Other times, the network is learning, but very very badly. Accuracy never went beyond 0.4 (while in TensorFlow 1.x I got an effortless 0.95+). Such a low performance suggests me that something went wrong in the training.
Other times, the accuracy is very slowly improving, while the loss remains constant all the time.
What can cause these problems? Please help me understand my mistakes.
After some corrections, I can make the Network learn. However, its performance is extremely poor. After 1000 epochs, it reaches about %40 accuracy, which clearly means something is still wrong. Any help is appreciated.
The tf.GradientTape is recording every operation that happens inside its scope.
You don't want to record in the tape the gradient calculation, you only want to compute the loss forward.
with tf.GradientTape() as tape:
# take binary cross-entropy (bce_loss)
current_loss = bce_loss(model(df), classification)
# End of tape scope
# Update weights based on the gradient of the loss function
gradients = tape.gradient(current_loss, model.trainable_variables)
# The tape is now consumed
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
More importantly, I don't see the loop on the training set, therefore I suppose the complete code looks like:
for epoch in range(n_epochs):
for df, classification in dataset:
# your code that computes loss and trains
Moreover, the usage of the metrics is wrong.
You want to accumulate, thus update the internal state of the accuracy operation, at every training step and measure the overall accuracy at the end of every epoch.
Thus you have to:
# Measure the accuracy inside the training loop
accuracy.update_state(model(df), classification)
And call accuracy.result() only at the end of the epoch, when all the accuracy value have been saved into the metric.
Remember to call to the .reset_states() method to clears the variable states, resetting it to zero at the end of every epoch.

why is my simple feedforward neural network diverging (pytorch)?

I am experimenting with a simple 2 layer neural network with pytorch, feeding in only three inputs of size 10 each, with a single value as output. I have normalised inputs and lowered learning rate. It is my understanding that a two layer fully connected neural network should be able to trivially fit to this data
0.8138 1.2342 0.4419 0.8273 0.0728 2.4576 0.3800 0.0512 0.6872 0.5201
1.5666 1.3955 1.0436 0.1602 0.1688 0.2074 0.8810 0.9155 0.9641 1.3668
1.7091 0.9091 0.5058 0.6149 0.3669 0.1365 0.3442 0.9482 1.2550 1.6950
[torch.FloatTensor of size 3x10]
[124, 125, 122]
[torch.FloatTensor of size 3]
The code is adapted from a simple example and I am using MSELoss as the loss function. The loss diverges to infinity after just a few iterations:
features = torch.from_numpy(np.array(features))
x_data = Variable(torch.Tensor(features))
y_data = Variable(torch.Tensor(targets))
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear = torch.nn.Linear(10,5)
self.linear2 = torch.nn.Linear(5,1)
def forward(self, x):
l_out1 = self.linear(x)
y_pred = self.linear2(l_out1)
return y_pred
model = Model()
criterion = torch.nn.MSELoss(size_average = False)
optim = torch.optim.SGD(model.parameters(), lr = 0.001)
def main():
for iteration in range(1000):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
Any help would be appreciated. Thanks
Indeed it seems that this was simply due to the learning rate being too high. Setting to 0.00001 fixes convergence issues, albeit giving very slow convergence.
This is because you're not using a non-linearity between layers, and your network is still Linear.
You can use Relu in order to make it non linear. You can change the forward method like this :
y_pred = torch.nn.functional.F.relu(self.linear2(l_out1))
Maybe you can try to predict a log(y) instead of y to improve the convergence even more. Also Adam optimizer (adaptive learning rate) should help + BatchNormalization (for example between your linear layers).
