Implementation of weigted binary cross entropy keras - python

I'm actually working on an image segmentation project with Keras.
I am using an implementation of Unet.
I have 2 classes identify by pixel value, 0 = background 1 = object I'm looking for.
I have a single output with a sigmoid activation function.
I'm using a binary cross-entropy has a loss function.
Here is the problem: I have a very unbalanced data set. I have aproximately 1 white pixel for 100 black pixels. And from what I understood, binary cross-entropy is not very compatible with an unbalanced data set.
So I try to implement the weigted cross entropy with the following formula:
This is my code:
def weighted_cross_entropy(y_true, y_pred):
w = [0.99, 0.01]
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
val = - (w[0] * y_true_f * K.log(y_pred_f) + w[1] * (1-y_true_f) * K.log(1-y_predf))
return K.mean(val, axis=-1)
I'm using the F1-Score/Dice to measure the result at the end of each epoch.
But pass 5 epochs, the loss is equal to NaN and the F1 Score stay very low (0.02).
It seems my network is not learning, but I don't understand why. Maybe my formula is wrong ?
I have also tried to invert weights value but the result is the same.
After some research I noted that it was possible to give defined weights directly in the fit function.
Like this:
from sklearn.utils import class_weight
w = class_weight.compute_class_weight('balanced', np.unique(y.train.ravel()) , y_train.ravel())
model.fit(epochs = 1000, ..., class_weight = w)
By doing this with the binary cross entropy basic function the network learns correctly and gives better results compared without the use of predefined weights.
So I don't understand the difference between the 2 methods. Is it really necessary to implement the weighted cross entropy function ?

Related

Which deep learning network structure should I refer?

I am making a deep learning network that finds several points in 3d space.
The input is a stack of grayscale 1024 x 1024 images(# of images varies 5 to 20 ), and the output is 64 x 64 x 64 space. Each voxel of output has 0 or 1, but in my dataset there are only 2000 1s, so it is hard to tell whether my network is being trained well by observing the training losses.
For example if my network only spit out np.zeros((64,64,64)) as output, the accuracy still would be 1-2000/(64x64x64)~=99.9%.
So I want to ask which deep learning network I should choose for finding very small number of answers from 3d space. The input size becomes (1024 x 1024 x #img) and output size (64 x 64 x 64). I am now making experiments using 2D Unet-like net / 3D Unet-like net, with ReLU-with-ceiling end activation.
Please somebody recommend anything to refer and thank you very much.
Unet-like networks seems to be a good idea. Your problem does not comes frop the network itself, but from the loss and metrics you are using.
Indead, if you use a binary crossentropy loss and accuracy for metrics, because of the imbalanced character of your classes, your score will still be near 100%.
I suggest that you use Dice or Jaccard coefficient for metrics and/or loss (in this case loss is 1-Dicecoef), and that you calculate it only on the items of interest, and not on the background.
Depending on the framework you are using, you should easily find an existing implementation of these metrics. Then modify the code to avoid calculation on the background.
For example for python/tensorflow, using your volumes:
def dice_coef(y_true, y_pred, smooth=1):
y_true_f = K.flatten(y_true)
y_pred_f = K.flatten(y_pred)
y_true_f = K.one_hot(K.cast(y_true_f, np.uint8), 2)
y_pred_f = K.one_hot(K.cast(y_pred_f, np.uint8), 2)
intersection = K.sum(y_true_f[:,1:]* y_pred_f[:,1:], axis=[-1])
union = K.sum(y_true_f[:,1:], axis=[-1]) + K.sum(y_pred_f[:,1:], axis=[-1])
dice = K.mean((2. * intersection + smooth)/(union + smooth), axis=0)
return dice

Keras Categorical Cross Entropy

I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.

Tensorflow Custom Regularization Term comparing the Prediction to the True value

Hello I am in need of a custom regularization term to add to my (binary cross entropy) Loss function. Can somebody help me with the Tensorflow syntax to implement this?
I simplified everything as much as possible so it could be easier to help me.
The model takes a dataset 10000 of 18 x 18 binary configurations as input and has a 16x16 of a configuration set as output. The neural network consists only of 2 Convlutional layer.
My model looks like this:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
EPOCHS = 10
model = models.Sequential()
model.add(layers.Conv2D(1,2,activation='relu',input_shape=[18,18,1]))
model.add(layers.Conv2D(1,2,activation='sigmoid',input_shape=[17,17,1]))
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),loss=tf.keras.losses.BinaryCrossentropy())
model.fit(initial.reshape(10000,18,18,1),target.reshape(10000,16,16,1),batch_size = 1000, epochs=EPOCHS, verbose=1)
output = model(initial).numpy().reshape(10000,16,16)
Now I wrote a function which I'd like to use as an aditional regularization terme to have as a regularization term. This function takes the true and the prediction. Basically it multiplies every point of both with its 'right' neighbor. Then the difference is taken. I assumed that the true and prediction term is 16x16 (and not 10000x16x16). Is this correct?
def regularization_term(prediction, true):
order = list(range(1,4))
order.append(0)
deviation = (true*true[:,order]) - (prediction*prediction[:,order])
deviation = abs(deviation)**2
return 0.2 * deviation
I would really appreciate some help with adding something like this function as a regularization term to my loss for helping the neural network to train better to this 'right neighbor' interaction. I'm really struggling with using the customizable Tensorflow functionalities a lot.
Thank you, much appreciated.
It is quite simple. You need to specify a custom loss in which you define your adding regularization term. Something like this:
# to minimize!
def regularization_term(true, prediction):
order = list(range(1,4))
order.append(0)
deviation = (true*true[:,order]) - (prediction*prediction[:,order])
deviation = abs(deviation)**2
return 0.2 * deviation
def my_custom_loss(y_true, y_pred):
return tf.keras.losses.BinaryCrossentropy()(y_true, y_pred) + regularization_term(y_true, y_pred)
model.compile(optimizer='Adam', loss=my_custom_loss)
As stated by keras:
Any callable with the signature loss_fn(y_true, y_pred) that returns
an array of losses (one of sample in the input batch) can be passed to
compile() as a loss. Note that sample weighting is automatically
supported for any such loss.
So be sure to return an array of losses (EDIT: as I can see now it is possible to return also a simple scalar. It doesn't matter if you use for example the reduce function). Basically y_true and y_predicted have as first dimension the batch size.
here details: https://keras.io/api/losses/

converting Sklearn based custom metric function -> to use as Keras' metric for callbacks

I have already been given a custom metric code on which my model is going to be evaluated but they've used sklearn's metrices. I know If I have a metric I can use it in callbacks like
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy', custom_metric])
ModelCheckpoint(monitor='val_custom_metric',
save_best_only=True,
save_weights_only=True,
mode='max',
verbose=1)
It is a multi-output problem with 3 labels,
Submissions are evaluated using a hierarchical macro-averaged recall. First, a standard macro-averaged recall is calculated for each component (label_1,label_2 or label_3). The final score is the weighted average of those three scores, with the label_1 given double weight. You can replicate the metric with the following python snippet:
and I am unable to comprehend how do I implement the code given below in keras-
import numpy as np
import sklearn.metrics
scores = []
for component in ['label_1', 'label_2', 'label_3']:
y_true_subset = solution[solution[component] == component]['target'].values
y_pred_subset = submission[submission[component] == component]['target'].values
scores.append(sklearn.metrics.recall_score(
y_true_subset, y_pred_subset, average='macro'))
final_score = np.average(scores, weights=[2,1,1])
How can I convert it in the form to use as a metric? or more preciely, how can I use keras.backend or to implement this code?
You can only implement the metric, the rest is very obscure and will certainly not participate in Keras.
threshold = 0.5 #you can work this threshold for better results
#considering y_true is made of 0 and 1 only
#considering output shape is (batch, 3)
def custom_metric(y_true, y_pred):
weights = K.constant([2,1,1]) #shape (3,)
y_pred = K.cast(K.greater(y_pred, threshold), K.floatx()) #shape (batch, 3)
true_positives = K.sum(y_pred * y_true, axis=0) #shape (3,)
false_negatives = K.sum((1-y_pred) * y_true, axis=0) #shape (3,)
recall = true_positives / (true_positives + false_negatives)
return K.mean(recall * weights)
Notice that this will be calulated batchwise, and since the denominator is different depending on the results, the calculated metric batchwise will be different compared to when you use the metric for the entire dataset.
You may need big batch sizes to avoid metric instability. And it might be interesting to apply the metric on the entire data with a callback to get the exact result.

Siamese network, lower part uses a dense layer instead of a euclidean distance layer

This is a rather interesting question for Siamese network
I am following the example from https://keras.io/examples/mnist_siamese/.
My modified version of the code is in this google colab
The siamese network takes in 2 inputs (2 handwritten digits) and output whether they are of the same digit (1) or not (0).
Each of the two inputs are first processed by a shared base_network (3 Dense layers with 2 Dropout layers in between). The input_a is extracted into processed_a, input_b into processed_b.
The last layer of the siamese network is an euclidean distance layer between the two extracted tensors:
distance = Lambda(euclidean_distance,
output_shape=eucl_dist_output_shape)([processed_a, processed_b])
model = Model([input_a, input_b], distance)
I understand the reasoning behind using an euclidean distance layer for the lower part of the network: if the features are extracted nicely, then similar inputs should have similar features.
I am thinking, why not use a normal Dense layer for the lower part, as:
# distance = Lambda(euclidean_distance,
# output_shape=eucl_dist_output_shape)([processed_a, processed_b])
# model = Model([input_a, input_b], distance)
#my model
subtracted = Subtract()([processed_a, processed_b])
out = Dense(1, activation="sigmoid")(subtracted)
model = Model([input_a,input_b], out)
My reasoning is that if the extracted features are similar, then the Subtract layer should produce a small tensor, as the difference between the extracted features. The next layer, Dense layer, can learn that if the input is small, output 1, otherwise 0.
Because the euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise, I also need to invert the accuracy and loss function, as:
# the version of loss and accuracy for Euclidean distance layer
# def contrastive_loss(y_true, y_pred):
# '''Contrastive loss from Hadsell-et-al.'06
# http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
# '''
# margin = 1
# square_pred = K.square(y_pred)
# margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1 - y_true) * margin_square)
# def compute_accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# pred = y_pred.ravel() < 0.5
# return np.mean(pred == y_true)
# def accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
### my version, loss and accuracy
def contrastive_loss(y_true, y_pred):
margin = 1
square_pred = K.square(y_pred)
margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1-y_true) * margin_square)
return K.mean(y_true * margin_square + (1-y_true) * square_pred)
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() > 0.5
return np.mean(pred == y_true)
def accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
return K.mean(K.equal(y_true, K.cast(y_pred > 0.5, y_true.dtype)))
The accuracy for the old model:
* Accuracy on training set: 99.55%
* Accuracy on test set: 97.42%
This slight change leads to a model that not learning anything:
* Accuracy on training set: 48.64%
* Accuracy on test set: 48.29%
So my question is:
1. What is wrong with my reasoning of using Substract + Dense for the lower part of the Siamese network?
2. Can we fix this? I have two potential solution in mind but I am not confident, (1) convoluted neural net for feature extraction (2) more dense layers for the lower part of the siamese network.
In case of two similar examples, after subtracting two n-dimensional feature vector (extracted using common/base feature extraction model) you will get zero or around zero value in most of the location of resulting n-dimensional vector on which next/output Dense layer works. On the other hand, we all know that in a ANN model weights are learnt in such a way that less important features produce very less responses and prominent/interesting features contributing towards decision produce high responses. Now you can understand that our subtracted features vector is just in the opposite direction because when two examples are from different class then they produce high responses and opposite for examples from same class. Furthermore with a single node in the output layer (no additional hidden layer before output layer) its quite difficult to learn for model to generate high response from zero values when two samples are of same class. This might be an important point to solve your problem.
Based on the above discussion, you may want to try following ideas:
transforming subtracted feature vector to ensure when there is similarity you get high responses, may be by doing subtraction from 1 or reciprocal (multiplicative inverse) followed by normalization.
Adding more Dense layer before output layer.
I wont be surprised if convolutional neural net instead of stacked Dense layer for feature extraction (as you are thinking) does not improve your accuracy much as it's just another way of doing the same (feature extraction).

Categories