How is the categorical_crossentropy implemented in keras?

How is the categorical_crossentropy implemented in keras? - python

I'm trying to apply the concept of distillation, basically to train a new smaller network to do the same as the original one but with less computation.
I have the softmax outputs for every sample instead of the logits.
My question is, how is the categorical cross entropy loss function implemented?
Like it takes the maximum value of the original labels and multiply it with the corresponded predicted value in the same index, or it does the summation all over the logits (One Hot encoding) as the formula says:

As an answer to "Do you happen to know what the epsilon and tf.clip_by_value is doing?",
it is ensuring that output != 0, because tf.log(0) returns a division by zero error.
(I don't have points to comment but thought I'd contribute)

I see that you used the tensorflow tag, so I guess this is the backend you are using?
def categorical_crossentropy(output, target, from_logits=False):
"""Categorical crossentropy between an output tensor and a target tensor.
# Arguments
output: A tensor resulting from a softmax
(unless `from_logits` is True, in which
case `output` is expected to be the logits).
target: A tensor of the same shape as `output`.
from_logits: Boolean, whether `output` is the
result of a softmax, or is a tensor of logits.
# Returns
Output tensor.
This code comes from the keras source code. Looking directly at the code should answer all your questions :) If you need more info just ask !
EDIT :
Here is the code that interests you :
# Note: tf.nn.softmax_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# scale preds so that the class probas of each sample sum to 1
output /= tf.reduce_sum(output,
reduction_indices=len(output.get_shape()) - 1,
keep_dims=True)
# manual computation of crossentropy
epsilon = _to_tensor(_EPSILON, output.dtype.base_dtype)
output = tf.clip_by_value(output, epsilon, 1. - epsilon)
return - tf.reduce_sum(target * tf.log(output),
reduction_indices=len(output.get_shape()) - 1)
If you look at the return, they sum it... :)

Related

No gradients provided for any variable error

I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)

Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

Loss function for comparing two vectors for categorization

I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?

The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3

In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.

Tensorflow with Keras: sparse_categorical_crossentropy

I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?

The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.

from_logits=True and from_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet

I am doing the image semantic segmentation job with unet, if I set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
conv10 = (Activation('softmax'))(conv9)
model = Model(inputs, conv10)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
The training will not converge even for only one training image.
But if I do not set the Softmax Activation for last layer like this:
...
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
model = Model(inputs, conv9)
return model
...
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
The training will converge for one training image.
My groundtruth dataset is generated like this:
X = []
Y = []
im = cv2.imread(impath)
X.append(im)
seg_labels = np.zeros((height, width, n_classes))
for spath in segpaths:
mask = cv2.imread(spath, 0)
seg_labels[:, :, c] += mask
Y.append(seg_labels.reshape(width*height, n_classes))
Why? Is there something wrong for my usage?
This is my experiment code of git: https://github.com/honeytidy/unet
You can checkout and run (can run on cpu). You can change the Activation layer and from_logits of CategoricalCrossentropy and see what i said.

Pushing the "softmax" activation into the cross-entropy loss layer significantly simplifies the loss computation and makes it more numerically stable.
It might be the case that in your example the numerical issues are significant enough to render the training process ineffective for the from_logits=False option.
You can find a derivation of the cross entropy loss (a special case of "info gain" loss) in this post. This derivation illustrates the numerical issues that are averted when combining softmax with cross entropy loss.

from_logits = True signifies the values of the loss obtained by the model are not normalized and is basically used when we don't have any softmax function in our model. For e.g. https://www.tensorflow.org/tutorials/generative/dcgan in this model they have not used a softmax activation function or in other words we can say it helps in numerical stability.

By default, all of the loss function implemented in Tensorflow for classification problem uses from_logits=False. Remember in case of classification problem, at the end of the prediction, usually one wants to produce output in terms of probabilities.
Just look at the image below, the last layer of the network(just before softmax function)
So the sequence is Neural Network ⇒ Last layer output ⇒ Softmax or Sigmoid function ⇒ Probability of each class.
For example in the case of a multi-class classification problem, where output can be y1, y2, ....... yn one wants to produce each output with some probability. (see the output layer). Now, this output layer will get compared in cross-entropy loss function with the true label.
Let us take an example where our network produced the output for the classification task. Assume your Neural Network is producing output, then you convert that output into probabilities using softmax function and calculate loss using a cross-entropy loss function
# output produced by the last layer of NN
nn_output_before_softmax = [3.2, 1.3, 0.2, 0.8]
# converting output of last layer of NN into probabilities by applying softmax
nn_output_after_softmax = tf.nn.softmax(nn_output_before_softmax)
# output converted into softmax after appling softmax
print(nn_output_after_softmax.numpy())
[0.77514964 0.11593805 0.03859243 0.07031998]
y_true = [1.0, 0.0, 0.0, 0.0]
Now there are two scenarios:
One is explicitly using the softmax (or sigmoid) function
One is not using softmax function separately and wants to include in the calculation of loss function
1) One is explicitly using the softmax (or sigmoid) function
When one is explicitly using softmax (or sigmoid) function, then, for the classification task, then there is a default option in TensorFlow loss function i.e. from_logits=False. So here TensorFlow is assuming that whatever the input that you will be feeding to the loss function are the probabilities, so no need to apply the softmax function.
# By default from_logits=False
loss_taking_prob = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
loss_1 = loss_taking_prob(y_true, nn_output_after_softmax)
print(loss_1)
tf.Tensor(0.25469932, shape=(), dtype=float32)
2) One is not using the softmax function separately and wants to include it in the calculation of the loss function. This means that whatever inputs you are providing to the loss function is not scaled (means inputs are just the number from -inf to +inf and not the probabilities). Here you are letting TensorFlow perform the softmax operation for you.
loss_taking_logits = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss_2 = loss_taking_logits(y_true, nn_output_before_softmax)
print(loss_2)
tf.Tensor(0.2546992, shape=(), dtype=float32)
Please do remember that you using from_logits=False when it should be True leads to taking softmax of probabilities and producing incorrect model

I guess the problem comes from the softmax activation function. Looking at the doc I found that sotmax is applied to the last axis by default. Can you look at model.summary() and check if that is what you want ?

For softmax to work properly, you must make sure that:
You are using 'channels_last' as Keras default channel config.
This means the shapes in the model will be like (None, height, width, channels)
This seems to be your case because you are putting n_classes in the last axis. But it's also strange because you are using Conv2D and your output Y should be (1, height, width, n_classes) and not that strange shape you are using.
Your Y has only zeros and ones (not 0 and 255 as usually happens to images)
Check that Y.max() == 1 and Y.min() == 0
You may need to have Y = Y / 255.
Only one class is correct (your data does not have more than one path/channel with value = 1).
Check that (Y.sum(axis=-1) == 1).all() is True

how to reference one output from a multi-outputs with different dimension in Keras

Currently, I have this out put from my model:
egen = keras.models.Model(egen_input, [classes,x])
where x has [None, 32, 32, 3] and classes has [None, 2] as their dimension. How can I reference only part of the output in a custom loss function?
for example,
def customLoss():
def loss(y_true, y_pred):
return keras.losses.binary_crossentropy(y_true, y_pred[0])
return loss
currently the above loss function returns me error on mismatched dimension,yet if i just use y_pred, it does not return error...very confused here
Thanks!

If you want to use only classes, which is the first output, to calculate the loss, then you can set the loss_weights option (https://keras.io/models/model/) when compiling.
model.compile(...., loss_weights=[1.0, 0.0])
Note also that loss is computed for each output separately, then averaged (with equal weight at default) across outputs to obtain a single loss metric. So y_pred[0] does not mean classes, but the first element of classes and x.
EDITS.
if it's the first element of classes and x, what would be the shape of y_pred[0] ? bit confused here
Both! Keras computes the loss for classes and x separately, then take the (weighted) average. So, if the loss function is defined as return keras.losses.binary_crossentropy(y_true, y_pred[0]) as in the question, keras tries to calculate the loss with classes_true vs class_pred[0], and with x_true vs x_pred[0], which raises shape mismatch error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How is the categorical_crossentropy implemented in keras? - python

As an answer to "Do you happen to know what the epsilon and tf.clip_by_value is doing?", it is ensuring that output != 0, because tf.log(0) returns a division by zero error. (I don't have points to comment but thought I'd contribute)

Related

No gradients provided for any variable error

Loss function for comparing two vectors for categorization

Tensorflow with Keras: sparse_categorical_crossentropy

from_logits=True and from_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet

how to reference one output from a multi-outputs with different dimension in Keras

Categories

Resources