Custom loss function using multiple indirect values in Keras - python

I am using a Keras neural network inside a system of ODEs. Here is my model:
model = Sequential()
model.add(Dense(10, input_dim=3, activation='relu'))
model.add(Dense(1))
And here is a function that describes my differential equations. That Keras model is used in the calculation of ODEs.
def dxdt_new(t, x, *args):
N, beta, gamma, delta = args
deltaInfected = beta * x[0] * x[1] / N
quarantine = model.predict(np.expand_dims(x[:3], axis=0)) / N
recoveredQ = delta * x[3]
recoveredNoQ = gamma * x[1]
S = -deltaInfected
I = deltaInfected - recoveredNoQ - quarantine
R = recoveredNoQ + recoveredQ
Q = quarantine - recoveredQ
return [S, I, R, Q]
And I need to use a custom loss function for training. Inside my loss function, I cannot use the values predicted by a neural network since I do not have real data on it. I am trying to use the values that are affected by the predicted value. So I do not use y_true and y_pred.
def my_loss(y_true, y_pred):
infected = K.constant(INFECTED)
recovered = K.constant(RECOVERED)
dead = K.constant(DEAD)
pred = K.constant(predicted)
loss = K.sum((K.log(infected) - K.log(pred[1][:] + pred[3][:]))**2)
loss += K.sum((K.log(recovered + dead) - K.log(pred[2][:]))**2)
return loss
But when I try to train my neural network, I get the following error:
ValueError: An operation has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
So it seems like this loss function does not work properly. How can I organize my code to get it to work? Is there any other way to construct a loss function?

I cannot use the values predicted by a neural network since I do not have real data on it
For the customized loss function to work with the Backpropagation algorithm, you need to have it defined in terms of y_true and y_pred. In the case when you do not have this data, or when your loss function is non differentiable, you have to use another algorithm to optimize the weights in your neural network. Some examples for this could be a Genetic Algorithm or Particle Swarm Optimization.

Related

Backpropagating multiple losses in Pytorch

I am building up a cascade of neural networks and I would like to backpropagate the main loss back to the DNNs and also compute an auxillary loss back to each DNN.
I am trying to figure out what is the best practice when building such a model and how to make sure that my losses are computed properly. Do I build a single torch.nn.Module and a single optimizer, or do I have to create separate modules and optimizers for each network? Also I am likely to have more than three cascaded DNNs.
Approach a)
import torch
from torch import nn, optim
class MasterNetwork(nn.Module):
def init(self):
super(MasterNetwork, self).__init__()
dnn1 = nn.ModuleList()
dnn2 = nn.ModuleList()
dnn3 = nn.ModuleList()
def forward(self, x, z1, z2):
out1 = dnn1(x)
out2 = dnn2(out1 + z1)
out3 = dnn3(out2 + z2)
return [out1, out2, out3]
def LossFunction(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_1_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_2_fn(in):
# do stuff
return loss # loss is a scalar value
def ac_loss_3_fn(in):
# do stuff
return loss # loss is a scalar value
model = MasterNetwork()
optimizer = optim.Adam(model.parameters())
input = torch.tensor()
z1 = torch.tensor()
z2 = torch.tensor()
outputs = model(input, z1, z2)
main_loss = LossFunction(outputs[2])
ac1_loss = ac_loss_1_fn(outputs[0])
ac2_loss = ac_loss_2_fn(outputs[1])
ac3_loss = ac_loss_3_fn(outputs[2])
optimizer.zero_grad()
'''
This is where I am uncertain about how to backpropagate the AC losses for each DNN
in addition to the main loss.
'''
optimizer.step()
Approach b)
This would creating a nn.Module class and optimizer for each DNN and then forwarding the loss to the next DNN.
I would prefer to have a solution for approach a) since it is less tedious and I don't have to deal with tuning multiple optimizers. However, I am not sure if this is possible. There was a similar question about backpropagating multiple losses, however, I was not able to understand how combining the losses would work for the distinct components.
the solution you are looking for is likely to use some form of the following:
y = torch.tensor([main_loss, ac1_loss, ac2_loss, ac3_loss])
y.backward(gradient=torch.tensor([1.0,1.0,1.0,1.0]))
See https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients for confirmation.
A similar question exists but this one uses a different phrasing and was the question which I found first when hitting the issue. The similar question can be found at Pytorch. Can autograd be used when the final tensor has more than a single value in it?

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

The output of softmax makes the binary cross entropy's output NAN, what should I do?

I have implemented a neural network in Tensorflow where the last layer is a convolution layer, I feed the output of this convolution layer into a softmax activation function then I feed it to a cross-entropy loss function which is defined as follows along with the labels but the problem is I got NAN as the output of my loss function and I figured out it is because I have 1 in the output of softmax. So, my question is what should I do in this case?
My input is a 16 by 16 image where I have 0 and 1 as the values of each pixel (binary classification)
My loss function:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
log_pred = tf.log(prediction, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
Note that log(0) is undefined so if ever prediction==0 or prediction==1 you will have a NaN.
In order to get around this it is commonplace to add a very small value epsilon to the value passed to tf.log in any loss function (we also do a similar thing when dividing to avoid dividing by zero). This makes our loss function numerically stable and the epsilon value is small enough to be negligible in terms of any inaccuracy it introduces to our loss.
Perhaps try something like:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
epsilon = tf.constant(0.000001)
log_pred = tf.log(prediction + epsilon, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction + epsilon, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
UPDATE:
As jdehesa points out in his comments though - the 'out of the box' loss functions handle the numerical stability issue nicely already

Reconstruction loss on regression type of Variational Autoencoder

I'm currently working on a variation of Variational Autoencoder in a sequential setting, where the task is to fit/recover a sequence of real-valued observation data (hence it is a regression problem).
I have built my model using tf.keras with eager execution enabled, and tensorflow_probability (tfp). Following VAE concept, the generative net emits the distribution parameters of the observation data, which I model as multivariate normal. Therefore the outputs are mean and logvar of the predicted distribution.
Regarding training process, the first component of the loss is reconstruction error. That is the log likelihood of the true observation, given the predicted (parameters) distribution from the generative net. Here, I use tfp.distributions, since it is fast and handy.
However, after training is done, marked by a considerably low loss value, it turns out that my model seems not to learn anything. The predicted value from the model is just barely flat across the time dimension (recall that the problem is sequential).
Nevertheless, for the sake of sanity check, when I replace log likelihood with MSE loss (which is not justifiable while working on VAE), it yields very good data fitting. So I conclude that there must be something wrong with this log likelihood term. Is there anyone having some clue and/or solution for this?
I have considered replacing the log likelihood with cross-entropy loss, but I think that is not applicable in my case, since my problem is regression and the data can't be normalized into [0,1] range.
I also have tried to implement annealed KL term (i.e. weighing the KL term with constant < 1) when using the log likelihood as the reconstruction loss. But it also didn't work.
Here is my code snippet of the original (using log likelihood as reconstruction error) loss function:
import tensorflow as tf
tfe = tf.contrib.eager
tf.enable_eager_execution()
import tensorflow_probability as tfp
tfd = tfp.distributions
def loss(model, inputs):
outputs, _ = SSM_model(model, inputs)
#allocate the corresponding output component
infer_mean = outputs[:,:,:latent_dim] #mean of latent variable from inference net
infer_logvar = outputs[:,:,latent_dim : (2 * latent_dim)]
trans_mean = outputs[:,:,(2 * latent_dim):(3 * latent_dim)] #mean of latent variable from transition net
trans_logvar = outputs[:,:, (3 * latent_dim):(4 * latent_dim)]
obs_mean = outputs[:,:,(4 * latent_dim):((4 * latent_dim) + output_obs_dim)] #mean of observation from generative net
obs_logvar = outputs[:,:,((4 * latent_dim) + output_obs_dim):]
target = inputs[:,:,2:4]
#transform logvar to std
infer_std = tf.sqrt(tf.exp(infer_logvar))
trans_std = tf.sqrt(tf.exp(trans_logvar))
obs_std = tf.sqrt(tf.exp(obs_logvar))
#computing loss at each time step
time_step_loss = []
for i in range(tf.shape(outputs)[0].numpy()):
#distribution of each module
infer_dist = tfd.MultivariateNormalDiag(infer_mean[i],infer_std[i])
trans_dist = tfd.MultivariateNormalDiag(trans_mean[i],trans_std[i])
obs_dist = tfd.MultivariateNormalDiag(obs_mean[i],obs_std[i])
#log likelihood of observation
likelihood = obs_dist.prob(target[i]) #shape = 1D = batch_size
likelihood = tf.clip_by_value(likelihood, 1e-37, 1)
log_likelihood = tf.log(likelihood)
#KL of (q|p)
kl = tfd.kl_divergence(infer_dist, trans_dist) #shape = batch_size
#the loss
loss = - log_likelihood + kl
time_step_loss.append(loss)
time_step_loss = tf.convert_to_tensor(time_step_loss)
overall_loss = tf.reduce_sum(time_step_loss)
overall_loss = tf.cast(overall_loss, dtype='float32')
return overall_loss

keras combining two losses with adjustable weights where the outputs do not have the same dimensionality

My question is similar to the one posed here:
keras combining two losses with adjustable weights
However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?
The question:
I have a keras functional model with two layers with outputs x1 and x2.
x1 = Dense(1,activation='relu')(prev_inp1)
x2 = Dense(2,activation='relu')(prev_inp2)
I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.
For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.
Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.
Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.
The prediction model
Let's use a very basic example of model with two outputs and one input:
#any input your true model takes
inp = Input((5,5,2))
#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)
#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)
#the model
predictionModel = Model(inp, [outImg,outClass])
You use this one regularly for predictions. It's not necessary to compile this one.
The losses for each branch
Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.
Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.
def calcImgLoss(x):
true,pred = x
loss = binary_crossentropy(true,pred)
return K.mean(loss, axis=[1,2])
def calcClassLoss(x):
true,pred = x
return binary_crossentropy(true,pred)
These will be used in Lambda layers in the training model.
The loss weighting layer - (WARNING! EDITED! - See explanation at the end)
Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.
class LossWeighter(Layer):
def __init__(self, **kwargs): #kwargs can have 'name' and other things
super(LossWeighter, self).__init__(**kwargs)
#create the trainable weight here, notice the constraint between 0 and 1
def build(self, inputShape):
self.weight = self.add_weight(name='loss_weight',
shape=(1,),
initializer=Constant(0.5),
constraint=Between(0,1),
trainable=True)
super(LossWeighter,self).build(inputShape)
def call(self,inputs):
#old answer: will always tend to completely ignore the biggest loss
#return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
#problem: alpha tends to 0 or 1, eliminating the biggest of the two losses
#proposal of working alpha optimization
#return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
#problem: might not train any of the losses, and even increase one of them
#in order to minimize the difference between the two losses
#new answer - a mix between the two, applying gradients to the right weights
loss1, loss2 = inputs #trainable
static_loss1 = K.stop_gradient(loss1) #non_trainable
static_loss2 = K.stop_gradient(loss2) #non_trainable
a1 = self.weight #trainable
a2 = 1 - a1 #trainable
static_a1 = K.stop_gradient(a1) #non_trainable
static_a2 = 1 - static_a1 #non_trainable
#this trains only alpha to minimize the difference between both losses
alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
#or K.abs (.....)
#this trains only the original model weights to minimize both original losses
model_loss = (static_a1 * loss1) + (static_a2 * loss2)
return alpha_loss + model_loss
def compute_output_shape(self,inputShape):
return inputShape[0]
Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:
class Between(Constraint):
def __init__(self,min_value,max_value):
self.min_value = min_value
self.max_value = max_value
def __call__(self,w):
return K.clip(w,self.min_value, self.max_value)
def get_config(self):
return {'min_value': self.min_value,
'max_value': self.max_value}
The training model
This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:
def ignoreLoss(true,pred):
return pred #this just tries to minimize the prediction without any extra computation
Model inputs:
#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))
#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]
Model outputs = losses:
imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])
Model:
trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)
Dummy training
inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))
trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
predictionModel.predict(inputImages)
Necessary imports
from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need
(EDIT) Explaining the problem with the old answer:
The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)
A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.
The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.
The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.
See this table comparing the old and new proposals:
This does not guarantee the best optimization though!!!
Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.
I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).
There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:
def custom_loss(x1, x2, y1, y2, alpha):
def loss(y_true, y_pred):
return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
return loss
And then compile your functional model as:
x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))
model.compile('sgd',
loss=custom_loss(x1, x2, y1, y2, 0.5),
target_tensors=[y1, y2])
NOTE: Not tested.

Categories