I'm trying to implement the Multiclass Hybrid loss function in Python from following article https://arxiv.org/pdf/1808.05238.pdf for my semantic segmentation problem using an imbalanced dataset. I managed to get my implementation correct enough to start while training the model, but the results are very poor. Model architecture - U-net, learning rate in Adam optimizer is 1e-5. Mask shape is (None, 512, 512, 3), with 3 classes (in my case forest, deforestation, other). The formula I used to implement my loss:
The code I created:
def build_hybrid_loss(_lambda_=1, _alpha_=0.5, _beta_=0.5, smooth=1e-6):
def hybrid_loss(y_true, y_pred):
C = 3
tversky = 0
# Calculate Tversky Loss
for index in range(C):
inputs_fl = tf.nest.flatten(y_pred[..., index])
targets_fl = tf.nest.flatten(y_true[..., index])
#True Positives, False Positives & False Negatives
TP = tf.reduce_sum(tf.math.multiply(inputs_fl, targets_fl))
FP = tf.reduce_sum(tf.math.multiply(inputs_fl, 1-targets_fl[0]))
FN = tf.reduce_sum(tf.math.multiply(1-inputs_fl[0], targets_fl))
tversky_i = (TP + smooth) / (TP + _alpha_ * FP + _beta_ * FN + smooth)
tversky += tversky_i
tversky += C
# Calculate Focal loss
loss_focal = 0
for index in range(C):
f_loss = - (y_true[..., index] * (1 - y_pred[..., index])**2 * tf.math.log(y_pred[..., index]))
# Average over each data point/image in batch
axis_to_reduce = range(1, 3)
f_loss = tf.math.reduce_mean(f_loss, axis=axis_to_reduce)
loss_focal += f_loss
result = tversky + _lambda_ * loss_focal
return result
return hybrid_loss
The prediction of the model after the end of an epoch (I have a problem with swapped colors, so the red in the prediction is actually green, which means forest, so the prediction is mostly forest and not deforestation):
The question is what is wrong with my hybrid loss implementation, what needs to be changed to make it work?
To simplify things a little, I have divided the Hybrid loss into four separate functions: Tversky's loss, Dice coefficient, Dice loss, Hybrid loss. You can see the code below.
def TverskyLoss(targets, inputs, alpha=0.5, beta=0.5, smooth=1e-16, numLabels=3):
tversky = 0
for index in range(numLabels):
inputs_fl = tf.nest.flatten(inputs[..., index])
targets_fl = tf.nest.flatten(targets[..., index])
#True Positives, False Positives & False Negatives
TP = tf.reduce_sum(tf.math.multiply(inputs_fl, targets_fl))
FP = tf.reduce_sum(tf.math.multiply(inputs_fl, 1-targets_fl[0]))
FN = tf.reduce_sum(tf.math.multiply(1-inputs_fl[0], targets_fl))
tversky_i = (TP + smooth) / (TP + alpha*FP + beta*FN + smooth)
tversky += tversky_i
return numLabels - tversky
def dice_coef(y_true, y_pred, smooth=1e-16):
y_true_f = tf.nest.flatten(y_true)
y_pred_f = tf.nest.flatten(y_pred)
intersection = tf.math.reduce_sum(tf.math.multiply(y_true_f, y_pred_f))
return (2. * intersection + smooth) / (tf.math.reduce_sum(y_true_f) + tf.math.reduce_sum(y_pred_f) + smooth)
def dice_coef_multilabel(y_true, y_pred, numLabels=3):
dice=0
for index in range(numLabels):
dice -= dice_coef(y_true[..., index], y_pred[..., index])
return numLabels + dice
def build_hybrid_loss(_lambda_=0.5, _alpha_=0.5, _beta_=0.5, smooth=1e-16, C=3):
def hybrid_loss(y_true, y_pred):
tversky = TverskyLoss(y_true, y_pred, alpha=_alpha_, beta=_beta_)
dice = dice_coef_multilabel(y_true, y_pred)
result = tversky + _lambda_ * dice
return result
return hybrid_loss
Adding the loss=build_hybrid_loss() during model compilation will add Hybrid loss as the loss function of the model.
After a short research, I came to the conclusion that in my particular case, a Hybrid loss with _lambda_ = 0.2, _alpha_ = 0.5, _beta_ = 0.5 would not be much better than a single Dice loss or a single Tversky loss. Neither IoU (intersection over union) nor the standard accuracy metric are much better with Hybrid loss. But I believe it is not a rule of thumb that such a Hybrid loss will be worser or at the same level of performance as single loss at all cases.
link to Accuracy graph
link to IoU graph
See EDIT below, the initial post almost has no meaning now but the question still remains.
I developing a neural network to semantically segment imagery. I have worked through various loss functions (categorical cross entropy (CCE), weight CCE, focal loss, tversky loss, jaccard loss, focal tversky loss, etc) which attempt to handle highly skewed class representation, though none are producing the desired effect. My advisor mentioned attempting to create a custom loss function which ignores false negatives for a specific class (but still penalizes false positives).
I have a 6 class problem and my network is setup to work in/with one-hot encoded truth data. As a result my loss function will accept two tensors, y_true, y_pred, of shape (batch, row, col, class) (which is currently (8, 128, 128, 6)). To be able to utilize the losses I have already explored I would like to alter y_pred to set the predicted value for the specific class (the 0th class) to always be correct. That is where y_true == class 0 set y_pred == class 0, otherwise do nothing.
I have spent way too much time attempting to create this loss function as a result of tensorflow tensors being immutable. My first attempt (which I was led to through my experience with numpy)
def weighted_categorical_crossentropy_ignore(weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
y_pred[tf.where(y_true == [1, 0, 0, 0, 0, 0])] = [1, 0, 0, 0, 0, 0]
# Scale predictions so that the class probs of each sample sum to 1
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
# Clip to prevent NaN's and Inf's
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
return loss
Though obviously I cannot alter y_pred so this attempt failed. I ended up creating a few monstrosities attempting to "build" a tensor by iterating over [batch, row, col] and performing comparisons. While this(ese) attempts did not technically fail, they never actually began training. I assume it was taking on the order of minutes to compute the loss.
After many more failed efforts I started attempting to perform the requisite computation in pure numpy in a SSCCE. But keeping cognizant I was essentially limited to instantiating "simple" tensors (ie ones, zeros) and only performing "simple" operations like element-wise multiply, addition, and reshaping. Thus I arrived at this SSCCE
import numpy as np
from tensorflow.keras.utils import to_categorical
# Generate the "images" at random
true_flat = np.argmax(np.random.rand(1, 2, 2, 4), axis=3).astype('int')
true = to_categorical(true_flat, num_classes=4).astype('int')
pred_flat = np.argmax(np.random.rand(1, 2, 2, 4), axis=3).astype('int')
pred = to_categorical(pred_flat, num_classes=4).astype('int')
print('True:\n', true_flat)
print('Pred:\n', pred_flat)
# Create a mask representing an all "class 0" image
class_zero_label = np.array([1, 0, 0, 0])
czl_all = class_zero_label * np.ones(true.shape).astype('int')
# Mask both the truth and pred to locate class 0 pixels
czl_true_locs = czl_all * true
czl_pred_locs = czl_all * pred
# Subtract to create "addition" matrix
a = (czl_true_locs - czl_pred_locs) * czl_true_locs
print('a:\n', a)
# Do this
m = ((a + 1) - (a * 2))
print('m - ', m.shape, ':\n', m)
# Pull the front entry from 'm' and "expand" its value
#x = (m[:, :, :, 0].flatten() * np.ones(pred.shape).astype('int')).T.reshape(pred.shape)
m_front = m[:, :, :, 0]
print('m_front - ', m_front.shape, ':\n', m_front)
#m_flat = m_front.flatten()
m_flat = m_front.reshape(m_front.shape[0], m_front.shape[1]*m_front.shape[2])
print('m_flat - ', m_flat.shape, ':\n', m_flat)
m_expand = m_flat * np.ones(pred.shape).astype('int')
print('m_expand - ', m_expand.shape, ':\n', m_expand)
m_trans = m_expand.T
m_fixT = m_trans.reshape(pred.shape)
print('m_fixT - ', m_fixT.shape, ':\n', m_fixT)
m = m_fixT
print('m:\n', m.shape)
# Perform the math as described
pred = (pred * m) + a
print('Pred:\n', np.argmax(pred, axis=3))
This SSCCE, is well, terrible and complex. Essentially my goal here was to create two matrices, the "addition" and "multiplication" matrices. The multiplication matrix is meant to "zero out" every pixel in the predicted values where the truth value was equal to class 0. That is no matter the pixel value (ie a one-hot encoded vector) zero it out to be equal to [0, 0, 0, 0, 0, 0]. The addition matrix is then meant to add the vector [1, 0, 0, 0, 0, 0] to each of the zero'ed out locations. In the end this would achieve the goal of setting the predicted value of every truly class 0 pixel to correct.
The issue is that this SSCCE does not translate fully to tensorflow operations. The first issue is with the generation of the multiplication matrix, it is not defined correctly for when batch_size > 1. I thought no matter, just to see if it work I will break down and tf.unstack the y_true and y_pred tensors and iteration over them. Which has led me to the current instantiation of my loss function
def weighted_categorical_crossentropy_ignore(weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
y_true_un = tf.unstack(y_true)
y_pred_un = tf.unstack(y_pred)
y_pred_new = []
for i in range(0, y_true.shape[0]):
yt = y_true_un[i]
yp = y_pred_un[i]
# Pred:
# [[[0 3] * [[[1 0] + [[[0 1] = [[[0 0]
# [3 1]]] [[1 1]]] [[0 0]]] [[3 1]]]
# If we multiple pred by a tensor which zeros out only incorrect class 0 labelleling
# Then add class zero to those zero'd out locations
# We can negate the effect of mis-classified class 0 pixels but still punish for
# incorrectly predicted class 0 labels for other classes.
# Create a mask respresenting an all "class 0" image
class_zero_label = K.variable([1.0, 0.0, 0.0, 0.0, 0.0, 0.0])
czl_all = class_zero_label * K.ones(yt.shape)
# Mask both true and pred to locate class 0 pixels
czl_true = czl_all * yt
czl_pred = czl_all * yp
# Subtract to create "addition matrix"
a = czl_true - czl_pred
# Do this.
m = ((a + 1) - (a * 2.))
# And this.
x = K.flatten(m[:, :, 0])
x = x * K.ones(yp.shape)
x = K.transpose(x)
x = K.reshape(x, yp.shape)
# Voila.
ypnew = (yp * x) + a
y_pred_new.append(ypnew)
y_pred_new = tf.concat(y_pred_new, 0)
# Continue calculating weighted categorical crossentropy
# -------------------------------------------------------
# Scale predictions so that the class probs of each sample sum to 1
y_pred_new /= K.sum(y_pred_new, axis=-1, keepdims=True)
# Clip to prevent NaN's and Inf's
y_pred_new = K.clip(y_pred_new, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(y_pred_new) * weights
loss = -K.sum(loss, -1)
return loss
return loss
The current issue with this loss function lies in the apparent difference in the behavior between numpy and tensorflow when performing the operation
x = K.flatten(m[:, :, 0])
x = x * K.ones(yp.shape)
Which is meant to represent the behavior
m_flat = m_front.flatten()
m_expand = m_flat * np.ones(pred.shape).astype('int')
from the SSCCE.
So at this point I feel like I have delved so far into caveman coding I can't get out of it. I have to image there is some simple way akin to my initial attempt to perform the described behavior.
So, I guess my direct question is How do I implement
y_pred[tf.where(y_true == [1, 0, 0, 0, 0, 0])] = [1, 0, 0, 0, 0, 0]
in a custom tensorflow loss function?
EDIT: After fumbling around quite a bit more I have finally determined how to call .numpy() on the y_true, y_pred tensors to utilize numpy operations (Apparently setting tf.compat.v1.enable_eager_execution at the start of the program "doesn't work". I had to pass run_eagerly=True to Model().compile(...)).
This has allowed me to implement essentially the first attempt outlined
def weighted_categorical_crossentropy_ignore(weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
yp = y_pred.numpy()
yt = y_true.numpy()
yp[np.nonzero(np.all(yt == [1, 0, 0, 0, 0, 0], axis=3))] = [1, 0, 0, 0, 0, 0]
# Continue calculating weighted categorical crossentropy
# -------------------------------------------------------
# Scale predictions so that the class probs of each sample sum to 1
yp /= K.sum(yp, axis=-1, keepdims=True)
# Clip to prevent NaN's and Inf's
yp = K.clip(yp, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(yp) * weights
loss = -K.sum(loss, -1)
return loss
return loss
Though it seems by calling y_pred.numpy() (or the use of it thereafter) I have apparently "destroyed" the path/flow through the network. Based on the error when attempting to .fit
ValueError: No gradients provided for any variable: ['conv3d/kernel:0', <....>
I assume I somehow need to "remarshall" the tensor back to GPU memory? I have tried
yp = tf.convert_to_tensor(yp)
to no avail; same error. So I guess the same question still lies, but from a different motivation..
EDIT2: Well it seems from this SO Answer that I can't actually use numpy() to marshall the y_true, y_pred to use vanilla numpy operations. This necessarily "destroys" the network path and thus gradients cannot be calculated.
As I result I had realized with run_eagerly=True I can tf.Variable my y_true/y_pred and perform assignment. So in pure tensorflow I attempted to recreate the same code again
def weighted_categorical_crossentropy_ignore(weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
# yp = y_pred.numpy().copy()
# yt = y_true.numpy().copy()
# yp[np.nonzero(np.all(yt == [1, 0, 0, 0, 0, 0], axis=3))] = [1, 0, 0, 0, 0, 0]
yp = K.variable(y_pred)
yt = K.variable(y_true)
#np.all
x = K.all(yt == [1, 0, 0, 0, 0, 0], axis=3)
#np.nonzero
ne = tf.not_equal(x, tf.constant(False))
y = tf.where(ne)
# Perform the desired operation
yp[y] = [1, 0, 0, 0, 0, 0]
# Continue calculating weighted categorical crossentropy
# -------------------------------------------------------
# Scale predictions so that the class probs of each sample sum to 1
#yp /= K.sum(yp, axis=-1, keepdims=True) # Cannot use \= on tf.var, must use var = var /
yp = yp / K.sum(yp, axis=-1, keepdims=True)
# Clip to prevent NaN's and Inf's
yp = K.clip(yp, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(yp) * weights
loss = -K.sum(loss, -1)
return loss
return loss
But alas, this apparently creates the same issue as when calling .numpy(); no gradients can be computed. So I am again seemingly back at square 1.
EDIT3: Using the solution proposed by gobrewers14 in the answer posted below but modified based on my knowledge of the problem I have produced this loss function
def weighted_categorical_crossentropy_ignore(weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
print('y_true.shape: ', y_true.shape)
print('y_pred.shape: ', y_pred.shape)
# Generate modified y_pred where all truly class0 pixels are correct
y_true_class0_indicies = tf.where(tf.math.equal(y_true, [1., 0., 0., 0., 0., 0.]))
y_pred_updates = tf.repeat([
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
repeats=y_true_class0_indicies.shape[0],
axis=0)
yp = tf.tensor_scatter_nd_update(y_pred, y_true_class0_indicies, y_pred_updates)
# Continue calculating weighted categorical crossentropy
# -------------------------------------------------------
# Scale predictions so that the class probs of each sample sum to 1
yp /= K.sum(yp, axis=-1, keepdims=True)
# Clip to prevent NaN's and Inf's
yp = K.clip(yp, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(yp) * weights
loss = -K.sum(loss, -1)
return loss
return loss
Provided the original answer assumed y_true to be of shape [8, 128, 128] (ie a "flat" class representation, versus a one-hot encoded representation [8, 128, 128, 6]) I first print the shapes of the y_true and y_pred input tensors for sanity
y_true.shape: (8, 128, 128, 6)
y_pred.shape: (8, 128, 128, 6)
For further sanity, the output shape of the network, provided by the tail of model.summary is
conv2d_18 (Conv2D) (None, 128, 128, 6) 1542 dropout_5[0][0]
__________________________________________________________________________________________________
activation_9 (Activation) (None, 128, 128, 6) 0 conv2d_18[0][0]
==================================================================================================
Total params: 535,551,494
Trainable params: 535,529,478
Non-trainable params: 22,016
__________________________________________________________________________________________________
I then follow "the pattern" in the proposed solution and replace the original tf.math.equal(y_true, 0) with tf.math.equal(y_true, [1., 0., 0., 0., 0., 0.]) to handle the one-hot encoded case. From my understanding of the proposed solution currently (after ~10min of inspecting it) I assumed this should work. Though when attempting to train a model the following exception is thrown
InvalidArgumentError: Inner dimensions of output shape must match inner dimensions of updates shape. Output: [8,128,128,6] updates: [684584,6] [Op:TensorScatterUpdate]
Thus it seems as if the production of the (as I have named them) y_pred_updates produces a "collapsed" tensor with "too many" elements. I understand the motivation of the use of tf.repeat but its specific use seems to be incorrect. I assume it should produce a tensor with shape (8, 128, 128, 6) based on what I understand tf.tensor_scatter_nd_update to do. I assume this most likely is just based on the selection of the repeats and axis during the call to tf.repeat.
If I understand your question correctly, you are looking for something like this:
import tensorflow as tf
# batch of true labels
y_true = tf.constant([5, 0, 1, 3, 4, 0, 2, 0], dtype=tf.int64)
# batch of class probabilities
y_pred = tf.constant(
[
[0.34670502, 0.04551039, 0.14020428, 0.14341979, 0.21430719, 0.10985339],
[0.25681055, 0.14013883, 0.19890164, 0.11124421, 0.14526634, 0.14763844],
[0.09199252, 0.21889475, 0.1170236 , 0.1929019 , 0.20311192, 0.17607528],
[0.3246354 , 0.23257554, 0.15549366, 0.17282239, 0.00000001, 0.11447308],
[0.16502093, 0.13163856, 0.14371352, 0.19880624, 0.23360236, 0.12721846],
[0.27362782, 0.21408406, 0.10917682, 0.13135742, 0.10814326, 0.16361059],
[0.20697299, 0.23721898, 0.06455399, 0.11071447, 0.18990229, 0.19063729],
[0.10320242, 0.22173141, 0.2547973 , 0.2314068 , 0.07063974, 0.11822232]
], dtype=tf.float32)
# find the indices in the batch where the true label is the class 0
indices = tf.where(tf.math.equal(y_true, 0))
# create a tensor with the number of updates you want to replace in `y_pred`
updates = tf.repeat(
[[1.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
repeats=indices.shape[0],
axis=0)
# insert the updates into `y_pred` at the specified indices
modified_y_pred = tf.tensor_scatter_nd_update(y_pred, indices, updates)
print(modified_y_pred)
# tf.Tensor(
# [[0.34670502, 0.04551039, 0.14020428, 0.14341979, 0.21430719, 0.10985339],
# [1.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000],
# [0.09199252, 0.21889475, 0.1170236 , 0.1929019 , 0.20311192, 0.17607528],
# [0.3246354 , 0.23257554, 0.15549366, 0.17282239, 0.00000001, 0.11447308],
# [0.16502093, 0.13163856, 0.14371352, 0.19880624, 0.23360236, 0.12721846],
# [1.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000],
# [0.20697299, 0.23721898, 0.06455399, 0.11071447, 0.18990229, 0.19063729],
# [1.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.00000000]],
# shape=(8, 6), dtype=tf.float32)
This final tensor, modified_y_pred, can be used in differentiation.
EDIT:
It might be easier to do this with masks.
Example:
# these arent normalized to 1 but you get the point
probs = tf.random.normal([2, 4, 4, 6])
# raw labels per pixel
labels = tf.random.uniform(
shape=[2, 4, 4],
minval=0,
maxval=6,
dtype=tf.int64)
# your labels are already one-hot encoded
labels = tf.one_hot(labels, 6)
# boolean mask where classes are `0`
# converting back to int labels with argmax for purposes of
# using `tf.math.equal`. Matching on `[1, 0, 0, 0, 0, 0]` is
# potentially buggy; matching on an integer is a lot more
# explicit.
mask = tf.math.equal(tf.math.argmax(labels, -1), 0)[..., None]
# flip the mask to zero out the pixels across channels where
# labels are zero
probs *= tf.cast(tf.math.logical_not(mask), tf.float32)
# multiply the mask by the one-hot labels, and add back
# to the already masked probabilities.
probs += labels * tf.cast(mask, tf.float32)