TF 2.2: How to compute custom metric when using MirroredStrategy

TF 2.2: How to compute custom metric when using MirroredStrategy - python

the tf.keras.Model I am training has the following primary performance indicators:
escape rate: (#samples with predicted label 0 AND true label 1) / (#samples with true label 1)
false call rate: (#samples with predicted label 1 AND true label 0) / (#samples with true label 0)
The targeted escape rate is predefined, which means the decision threshold will have to be set appropriately. To calculate the resulting false call rate, I would like to implement a custom metric somewhere along the lines of the following pseudo code:
# separate predicted probabilities by their true label
all_ok_probabilities = all_probabilities.filter(true_label == 0)
all_nok_probabilities = all_probabilities.filter(true_label == 1)
# sort NOK samples
sorted_nok_probabilities = all_nok_probabilities.sort(ascending)
# determine decision threshold
threshold_idx = round(target_escape_rate * num_samples) - 1
threshold = sorted_nok_probabilities[threshold_idx]
# calculate false call rate
false_calls = count(all_ok_probabilities > threshold)
false_call_rate = false_calls / num_ok_samples
My issue is that, in a MirroredStrategy environment, tf.keras automatically distributes metric calculation across all replicas, each of them getting (batch_size / n_replicas) samples per update, and finally sums the results. My algorithm however only works correctly if ALL labels & predictions are combined (final summing could probably be overcome by dividing by the number of replicas).
My idea is to concatenate all y_true and y_pred in my metric's update_state() method into sequences, and running the evaluation in result(). The first step already seems impossible, however; tf.Variable only provides suitable aggregation methods for numeric scalars, not for sequences: tf.VariableAggregation.ONLY_FIRST_REPLICA makes me loose all data from 2nd to nth replica, SUM silently locks up the fit() call, MEAN does not make any sense in my application (and might hang just as well).
I already tried to instantiate the metric outside of the MirroredStrategy scope, but tf.keras.Model.compile() does not accept that.
Any hints/ideas?
P.S.: Let me know if you need a minimal code example, I am working on it. :)

Solved myself by implementing it as callback instead of metric. I run fit() without "validation_data" and instead have all validation set metrics calculated in the callback. This avoids two redundant validation set predictions.
In order to inject the resulting metric values back into the training procedure, I used the rather hackish approach from Access variables of caller function in Python.
class ValidationCallback(tf.keras.callbacks.Callback):
"""helper class to calculate validation set metrics after each epoch"""
def __init__(self, val_data, escape_rate, **kwargs):
# call parent constructor
super(ValidationCallback, self).__init__(**kwargs)
# save parameters
self.val_data = val_data
self.escape_rate = escape_rate
# declare batch_size - we will get that later
self.batch_size = 0
def on_epoch_end(self, epoch, logs=None):
# initialize empty arrays
y_pred = np.empty((0,2))
y_true = np.empty(0)
# iterate over validation set batches
for batch in self.val_data:
# save batch size, if not yet done
if self.batch_size == 0:
self.batch_size = batch[1].shape[0]
# concat all batch labels & predictions
# need to do predict()[0] due to several model outputs
y_pred = np.concatenate([y_pred, self.model.predict(batch[0])[0]], axis=0)
y_true = np.concatenate([y_true, batch[1]], axis=0)
# calculate classical accuracy for threshold 0.5
acc = ((y_pred[:, 1] >= 0.5) == y_true).sum() / y_true.shape[0]
# calculate cross-entropy loss
cce = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.SUM)
loss = cce(y_true, y_pred).numpy() / self.batch_size
# caculate false call rate
y_pred_nok = np.sort(y_pred[y_true == 1, 1])
idx = int(np.round(self.escape_rate * y_pred_nok.shape[0]))
threshold = y_pred_nok[idx]
false_calls = y_pred[(y_true == 0) & (y_pred[:, 1] >= threshold), 1].shape[0]
fcr = false_calls / y_true[y_true == 0].shape[0]
# add metrics to 'logs' dict of our caller (tf.keras.callbacks.CallbackList.on_epoch_end()),
# so that they become available to following callbacks
for f in inspect.stack():
if 'logs' in f[0].f_locals:
f[0].f_locals['logs'].update({'val_accuracy': acc,
'val_loss': loss,
'val_false_call_rate': fcr})
return

Related

Polynomial regression exploding gradients and underfitting

I wanted to code my implementation of polynomial regression, but my model's gradients either exploded or my model didn't fit the data well enough.
For testing purposes, my dataset is just the function x^2 and my model is a second-degree polynomial ax^2 + bx + c. I trained it for 50 epochs using batch gradient descent.
I noticed that the model explodes with the learning rate >=0.001 and underfits with a learning rate <=0.0001
To visualize the model, at the end of each epoch, I plot the model's predictions with the labels. So, in the ideal case, these lines should be indistinguishable.
*The orange line is the labels and the blue one is the model's predictions.
Here is the model exploding:*
And here it underfits:*
One interesting thing is, that even though the model's predictions are way too big, the line still resembles the correct polynomial. And the picture where the predictions go into negatives is also correct, but just flipped/mirrored.
I made the code in python. This is my main.py:
from decimal import Decimal
from matplotlib.pyplot import plot, draw, pause, clf
from model import PolynomialRegression
POLYNOMIAL_FUNCTION = [0, 1, 2]
LEARNING_RATE = Decimal(0.0001)
DATASET = [0, 1, 2, 3, 4, 5, 6, 7]
LABELSET = [0, 1, 4, 9, 16, 32, 64, 128]
EPOCHS = 50
model = PolynomialRegression(POLYNOMIAL_FUNCTION, LEARNING_RATE)
for _ in range(EPOCHS):
for data, label in zip(DATASET, LABELSET):
# train the model
model.train(data, label)
# update the model
model.update()
# predict the dataset
predictions = [model.predict(data) for data in DATASET]
# plot predictions and labels
plot(predictions)
plot(LABELSET)
draw()
pause(0.1)
clf()
print(model.parameters)
# erase the stored gradients
model.clear_grad()
And this is my model.py:
from decimal import Decimal
class PolynomialRegression:
"""
Polynomial regression model.
"""
def __init__(self, polynomial_function: list, learning_rate: Decimal) -> None:
# the structure of the polynomial function (the exponents)
self.polynomial_function = polynomial_function
# parameters of the model set to be 1
self.parameters = [Decimal(1)] * len(polynomial_function)
self.learning_rate = learning_rate
# stored gradients to update the model
self.gradients = []
def predict(self, x: Decimal) -> Decimal:
"""
Make a prediction based on the input.
Args:
x (Decimal): Input to the model.
Returns:
Decimal: A prediction.
"""
y = Decimal(0)
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute a term and add it to the final output
y += param * (x ** exponent)
return y
def train(self, x: Decimal, y: Decimal) -> Decimal:
"""
Compute a gradient from a given input and target output.
Args:
x (Decimal): Input for the model.
y (Decimal): Target/Desired output.
Returns:
Decimal: An MSE loss.
"""
prediction = self.predict(x)
error = prediction - y
loss = error ** 2
gradient = []
# go through each parameter and exponent
for param, exponent in zip(self.parameters, self.polynomial_function):
# compute the gradient for a single parameter
param_gradient = error * (x ** exponent) * self.learning_rate
# add the parameter gradient to the gradient list
gradient.append(param_gradient)
# add the gradient to a list
self.gradients.append(gradient)
return loss
def __sum_gradients(self) -> Decimal:
"""
Return a sum of gradients along the 0 axis.
(equivalent of numpy.sum(x, axis=0))
Returns:
list: List of summed Decimals.
"""
result = [Decimal(0)] * len(self.parameters)
# iterate through the y axis
for gradient in self.gradients:
# iterate through the x axis
for i, param_gradient in enumerate(gradient):
result[i] += param_gradient
return result
def update(self) -> None:
"""
Update the model's parameters based on the stored gradients.
"""
summed_gradients = self.__sum_gradients()
# fraction used to calculate the average for every gradient
averaging_fraction = Decimal(1) / len(self.gradients)
for param_index, grad in enumerate(summed_gradients):
self.parameters[param_index] -= averaging_fraction * grad
def clear_grad(self) -> None:
"""
Clear/Reset the stored gradients.
"""
self.gradients = []
I think the problem lies somewhere in my gradient descent calculations, but it may also be something unexpected and silly.

First your dataset consists of only 8 datapoints. This is to few data to generalize a model, which means that you are probably overfitting.
The second thing I see, is that you do not normalize the x data. The model is not very complex so I guess it doesn't really matter in that context. But if you had a more complex model with n features and one feature is very small and one is very big, the feature with the bigger values would influence the result much more than the smaller one. Which might result in a bad performing model.
But your last plot doesn't look like it's underfitting to me. You have to realize that a ML model will always have an error. In my opinion for 8 datapoints, a model with only one layer and 50 epochs that looks fine. You probably could improve the results by learning longer, but that would mean to overfit the model even more. But to be honest if your goal is to emulate a mathematical function with ML this should be okay. You could also add a new layer.
The fact that your lr has to be that small to not fuck up the results tells me that you are correct, there is something wrong with the gradient descent you might want to look into this behavior.
An easy way to evaluate this is to build your model in pytorch and then use your optimizer to update the weights. If you get the same problem, it was your gradient descent, if not the problem lies somewhere else. But I strongly believe it is your gradient descent. Maybe debug into this function and look at the actual values you are subtracting.

How to customize an LSTM loss function to only concider a given index range of prediction and target sequence?

I am currently working with an LSTM sequence to sequence model for time domain signal predictions. Because of domain knowledge, I know that the first part of the prediction (about 20%) can never be predicted correctly, since the information required is not available in the given input sequence. The remaining 80% of the predicted sequence are usually predicted quite well. In order to exclude the first 20% from the training optimization, it would be nice to define a loss function that would basically operate on a given index range like the numpy code below:
start = int(0.2*sequence_length)
stop = sequence_length
def mse(pred, target):
""" Mean squared error between two time series np.arrays """
return 1/target.shape[0]*np.sum((pred-target)**2)
def range_mse_loss(y_pred, y):
return mse(y_pred[start:stop],y[start:stop])
How do I have to write this loss function in order to have it work with my preexisting keras code, where loss is simply given by model.compile(loss='mse') ?

You can slice your tensor to just last 80% of the data.
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred[:-size], y_true[:-size])
You can also use the sample_weights at the tf.keras.losses.MeanSquaredError(), passing an array of weights and the first 20% of weights is zero
size = int(y_true.shape[0] * 0.8) # for 2D vector, e.g., (100, 1)
zeros = tf.zeros((y_true.shape[0] - size), dtype=tf.int32)
ones = tf.ones((size), dtype=tf.int32)
weights = tf.concat([zeros, ones], 0)
loss_fn = tf.keras.losses.MeanSquaredError(name='mse')
loss_fn(y_pred, y_true, sample_weights=weights)
There is a warming of the second solution, the final loss will be lower than the first solution, because you are putting zero in the first predictions values, but you aren't removing them in the formula MSE = 1 /n * sum((y-y_hat)^2).

One workaround would be to mark the observations as None/nan and then overwrite the train_step method. Following tensorflow's tutorial about customizing train_step, you would do something like this
#tf.function
def train_step(keras_model, data):
print('custom train_step')
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x, y = data
with tf.GradientTape() as tape:
y_pred = keras_model(x, training=True) # Forward pass
# masking nan values in observations, also assuming that targets are >0.0
mask = tf.greater(y, 0.0)
true_y = tf.boolean_mask(y, mask)
pred_y = tf.boolean_mask(y_pred, mask)
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = keras_model.compiled_loss(true_y, pred_y, regularization_losses=keras_model.losses)
# Compute gradients
trainable_vars = keras_model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
keras_model.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
keras_model.compiled_metrics.update_state(true_y, pred_y)
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in keras_model.metrics}
This will work for all the performance metrics you are tracking. Alternative way would be to mask the nans inside the loss function but that would be limited to only one loss function and not any other loss function/performance metrics.

Correct way to use custom weight maps in unet architecture

There is a famous trick in u-net architecture to use custom weight maps to increase accuracy. Below are the details of it:
Now, by asking here and at multiple other place, I get to know about 2 approaches. I want to know which one is correct or is there any other right approach which is more correct?
First is to use torch.nn.Functional method in the training loop:
loss = torch.nn.functional.cross_entropy(output, target, w) where w will be the calculated custom weight.
Second is to use reduction='none' in the calling of loss function outside the training loop
criterion = torch.nn.CrossEntropy(reduction='none')
and then in the training loop multiplying with the custom weight:
gt # Ground truth, format torch.long
pd # Network output
W # per-element weighting based on the distance map from UNet
loss = criterion(pd, gt)
loss = W*loss # Ensure that weights are scaled appropriately
loss = torch.sum(loss.flatten(start_dim=1), axis=0) # Sums the loss per image
loss = torch.mean(loss) # Average across a batch
Now, I am kinda confused which one is right or is there any other way, or both are right?

The weighting portion looks like just simply weighted cross entropy which is performed like this for the number of classes (2 in the example below).
weights = torch.FloatTensor([.3, .7])
loss_func = nn.CrossEntropyLoss(weight=weights)
EDIT:
Have you seen this implementation from Patrick Black?
# Set properties
batch_size = 10
out_channels = 2
W = 10
H = 10
# Initialize logits etc. with random
logits = torch.FloatTensor(batch_size, out_channels, H, W).normal_()
target = torch.LongTensor(batch_size, H, W).random_(0, out_channels)
weights = torch.FloatTensor(batch_size, 1, H, W).random_(1, 3)
# Calculate log probabilities
logp = F.log_softmax(logits)
# Gather log probabilities with respect to target
logp = logp.gather(1, target.view(batch_size, 1, H, W))
# Multiply with weights
weighted_logp = (logp * weights).view(batch_size, -1)
# Rescale so that loss is in approx. same interval
weighted_loss = weighted_logp.sum(1) / weights.view(batch_size, -1).sum(1)
# Average over mini-batch
weighted_loss = -1. * weighted_loss.mean()

Note that torch.nn.CrossEntropyLoss() is a class that calls torch.nn.functional.
See https://pytorch.org/docs/stable/_modules/torch/nn/modules/loss.html#CrossEntropyLoss
You can use the weights when you define the criteria. Comparing them functionally, both methods are the same.
Now, I do not understand your idea of computing loss inside the training loop in method 1 and outside the training loop in method 2. if you compute loss outside the loop then how will you backpropagate?

keras combining two losses with adjustable weights where the outputs do not have the same dimensionality

My question is similar to the one posed here:
keras combining two losses with adjustable weights
However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?
The question:
I have a keras functional model with two layers with outputs x1 and x2.
x1 = Dense(1,activation='relu')(prev_inp1)
x2 = Dense(2,activation='relu')(prev_inp2)
I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.

For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.
Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.
Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.
The prediction model
Let's use a very basic example of model with two outputs and one input:
#any input your true model takes
inp = Input((5,5,2))
#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)
#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)
#the model
predictionModel = Model(inp, [outImg,outClass])
You use this one regularly for predictions. It's not necessary to compile this one.
The losses for each branch
Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.
Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.
def calcImgLoss(x):
true,pred = x
loss = binary_crossentropy(true,pred)
return K.mean(loss, axis=[1,2])
def calcClassLoss(x):
true,pred = x
return binary_crossentropy(true,pred)
These will be used in Lambda layers in the training model.
The loss weighting layer - (WARNING! EDITED! - See explanation at the end)
Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.
class LossWeighter(Layer):
def __init__(self, **kwargs): #kwargs can have 'name' and other things
super(LossWeighter, self).__init__(**kwargs)
#create the trainable weight here, notice the constraint between 0 and 1
def build(self, inputShape):
self.weight = self.add_weight(name='loss_weight',
shape=(1,),
initializer=Constant(0.5),
constraint=Between(0,1),
trainable=True)
super(LossWeighter,self).build(inputShape)
def call(self,inputs):
#old answer: will always tend to completely ignore the biggest loss
#return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
#problem: alpha tends to 0 or 1, eliminating the biggest of the two losses
#proposal of working alpha optimization
#return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
#problem: might not train any of the losses, and even increase one of them
#in order to minimize the difference between the two losses
#new answer - a mix between the two, applying gradients to the right weights
loss1, loss2 = inputs #trainable
static_loss1 = K.stop_gradient(loss1) #non_trainable
static_loss2 = K.stop_gradient(loss2) #non_trainable
a1 = self.weight #trainable
a2 = 1 - a1 #trainable
static_a1 = K.stop_gradient(a1) #non_trainable
static_a2 = 1 - static_a1 #non_trainable
#this trains only alpha to minimize the difference between both losses
alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
#or K.abs (.....)
#this trains only the original model weights to minimize both original losses
model_loss = (static_a1 * loss1) + (static_a2 * loss2)
return alpha_loss + model_loss
def compute_output_shape(self,inputShape):
return inputShape[0]
Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:
class Between(Constraint):
def __init__(self,min_value,max_value):
self.min_value = min_value
self.max_value = max_value
def __call__(self,w):
return K.clip(w,self.min_value, self.max_value)
def get_config(self):
return {'min_value': self.min_value,
'max_value': self.max_value}
The training model
This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:
def ignoreLoss(true,pred):
return pred #this just tries to minimize the prediction without any extra computation
Model inputs:
#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))
#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]
Model outputs = losses:
imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])
Model:
trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)
Dummy training
inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))
trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
predictionModel.predict(inputImages)
Necessary imports
from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need
(EDIT) Explaining the problem with the old answer:
The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)
A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.
The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.
The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.
See this table comparing the old and new proposals:
This does not guarantee the best optimization though!!!
Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.
I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).

There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:
def custom_loss(x1, x2, y1, y2, alpha):
def loss(y_true, y_pred):
return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
return loss
And then compile your functional model as:
x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))
model.compile('sgd',
loss=custom_loss(x1, x2, y1, y2, 0.5),
target_tensors=[y1, y2])
NOTE: Not tested.

How to have Keras model do early-stopping in different fit calls

Since the data dimension is big for my task, 32 samples would consume nearly 9% of memory in server, of which total free memory is about 105G. So I have to do consecutive calls to fit() in the loop. And I also want to do early-stopping with the consecutive calls to fit().
However, since the callback methods introduced in Keras documents only apply to one single fit() call.
How can I do early-stopping in this case?
Following is my code snippet:
for sen_batch, cls_batch in train_data_gen:
sen_batch = np.array(sen_batch).reshape(-1, WORD_LENGTH, 50, 1)
cls_batch = np.array(cls_batch)
model.fit(x = sen_batch,y = cls_batch)
num_iterations += 1

Use fit_generator: as you have generator - you could use generator traning instead of classical fit. This method supports Callbacks so you could use keras.callbacks.EarlyStopping.
When you cannot use fit_generator:
So - first of all - you need to use train_on_batch method - as fit call resets many model states (e.g. optimizer states).
train_on_batch method returns a loss value, but it doesn't accept callbacks. So you need to implement early stopping on your own. You can do it e.g. like this:
from six import next
patience = 4
best_loss = 1e6
rounds_without_improvement = 0
for epoch_nb in range(nb_of_epochs):
losses_list = list()
for batch in range(nb_of_batches):
x, y = next(train_data_gen)
losses_list.append(model.train_on_batch(x, y))
mean_loss = sum(losses_list) / len(losses_list)
if mean_loss < best_loss:
best_loss = mean_loss
rounds_witout_improvement = 0
else:
rounds_without_improvement +=1
if rounds_without_improvement == patience:
break

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

TF 2.2: How to compute custom metric when using MirroredStrategy - python

Related

Polynomial regression exploding gradients and underfitting

How to customize an LSTM loss function to only concider a given index range of prediction and target sequence?

Correct way to use custom weight maps in unet architecture

keras combining two losses with adjustable weights where the outputs do not have the same dimensionality

How to have Keras model do early-stopping in different fit calls

Categories

Resources