I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?
The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.
Related
I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.
I'm trying to implement my custom loss function in Keras using TensorFlow backend. The idea is for the neural network to input coefficients for Gaussians and compare the sum of four Gaussians to the model output. So we're fitting Gaussians to the data. I'd like to have y_pred in the form of [a_0, b_0, c_0, a_1, ..., c_3] and calculate the sum of a_i*e^((x-b_i)^2/2c_i), i=0,1,2,3 and then work out for example mean absolute error comparing this function to y_true. What I tried was
def gauss_loss(y_true, y_pred):
# zs is the the size y_true
# the size of y_pred is 12
xs = np.linspace(0, 1, zs)
gauss_sum = 0
for i in range(0, 12, 3):
gauss_sum += y_pred[:,i]*K.exp(-(xs-y_pred[:,i+1])**2/(2*y_pred[:,i+2]))
return 1./zs*sum(K.abs(y_true-gauss_sum))
I get the error "TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn".
However, I don't think I can use tf.map_fn either because it only accepts one argument so I can't use the first entry of y_pred as coefficient a and the next as b in the same formula.
All examples I find just use tensor operations for the entire matrix. It seems to me that this might not even be possible in Keras. Is this possible and if so, how is it done?
I am trying to log AUC during training time of my model.
According to the documentation, tf.metric.auc needs a label and predictions, both of same shape.
But in my case of binary classification, label is a one-dimensional tensor, containing just the classes. And prediction is two-dimensional containing probability for each class of each datapoint.
How to calculate AUC in this case?
Let's have a look at the parameters in the function tf.metrics.auc:
labels: A Tensor whose shape matches predictions. Will be cast to bool.
predictions: A floating point Tensor of arbitrary shape and whose values are in the range [0, 1].
This operation already assumes a binary classification. That is, each element in labels states whether the class is "positive" or "negative" for a single sample. It is not a 1-hot vector, which requires a vector with as many elements as the number of exclusive classes.
Likewise, predictions represents the predicted binary class with some level of certainty (some people may call it a probability), and each element should also refer to one sample. It is not a softmax vector.
If the probabilities came from a neural network with a fully connected layer of 2 neurons and a softmax activation at the head of the network, consider replacing that with a single neuron and a sigmoid activation. The output can now be fed to tf.metrics.auc directly.
Otherwise, you can just slice the predictions tensor to only consider the positive class, which will represent the binary class just the same:
auc_value, auc_op = tf.metrics.auc(labels, predictions[:, 1])
I'm trying to train a network with an unbalanced data. I have A (198 samples), B (436 samples), C (710 samples), D (272 samples) and I have read about the "weighted_cross_entropy_with_logits" but all the examples I found are for binary classification so I'm not very confident in how to set those weights.
Total samples: 1616
A_weight: 198/1616 = 0.12?
The idea behind, if I understood, is to penalize the errors of the majority class and value more positively the hits in the minority one, right?
My piece of code:
weights = tf.constant([0.12, 0.26, 0.43, 0.17])
cost = tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=pred, targets=y, pos_weight=weights))
I have read this one and others examples with binary classification but still not very clear.
Note that weighted_cross_entropy_with_logits is the weighted variant of sigmoid_cross_entropy_with_logits. Sigmoid cross entropy is typically used for binary classification. Yes, it can handle multiple labels, but sigmoid cross entropy basically makes a (binary) decision on each of them -- for example, for a face recognition net, those (not mutually exclusive) labels could be "Does the subject wear glasses?", "Is the subject female?", etc.
In binary classification(s), each output channel corresponds to a binary (soft) decision. Therefore, the weighting needs to happen within the computation of the loss. This is what weighted_cross_entropy_with_logits does, by weighting one term of the cross-entropy over the other.
In mutually exclusive multilabel classification, we use softmax_cross_entropy_with_logits, which behaves differently: each output channel corresponds to the score of a class candidate. The decision comes after, by comparing the respective outputs of each channel.
Weighting in before the final decision is therefore a simple matter of modifying the scores before comparing them, typically by multiplication with weights. For example, for a ternary classification task,
# your class weights
class_weights = tf.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
You could also rely on tf.losses.softmax_cross_entropy to handle the last three steps.
In your case, where you need to tackle data imbalance, the class weights could indeed be inversely proportional to their frequency in your train data. Normalizing them so that they sum up to one or to the number of classes also makes sense.
Note that in the above, we penalized the loss based on the true label of the samples. We could also have penalized the loss based on the estimated labels by simply defining
weights = class_weights
and the rest of the code need not change thanks to broadcasting magic.
In the general case, you would want weights that depend on the kind of error you make. In other words, for each pair of labels X and Y, you could choose how to penalize choosing label X when the true label is Y. You end up with a whole prior weight matrix, which results in weights above being a full (num_samples, num_classes) tensor. This goes a bit beyond what you want, but it might be useful to know nonetheless that only your definition of the weight tensor need to change in the code above.
See this answer for an alternate solution which works with sparse_softmax_cross_entropy:
import tensorflow as tf
import numpy as np
np.random.seed(123)
sess = tf.InteractiveSession()
# let's say we have the logits and labels of a batch of size 6 with 5 classes
logits = tf.constant(np.random.randint(0, 10, 30).reshape(6, 5), dtype=tf.float32)
labels = tf.constant(np.random.randint(0, 5, 6), dtype=tf.int32)
# specify some class weightings
class_weights = tf.constant([0.3, 0.1, 0.2, 0.3, 0.1])
# specify the weights for each sample in the batch (without having to compute the onehot label matrix)
weights = tf.gather(class_weights, labels)
# compute the loss
tf.losses.sparse_softmax_cross_entropy(labels, logits, weights).eval()
Tensorflow 2.0 Compatible Answer: Migrating the Code specified in P-Gn's Answer to 2.0, for the benefit of the community.
# your class weights
class_weights = tf.compat.v2.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.compat.v2.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.compat.v2.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
For more information about migration of code from Tensorflow Version 1.x to 2.x, please refer this Migration Guide.
I'm implementing a Convolutional Neural Network in Tensorflow with python.
I'm in the following scenario: I've got a tensor of labels y (batch labels) like this:
y = [[0,1,0]
[0,0,1]
[1,0,0]]
where each row is a one-hot vector that represents a label related to the correspondent example. Now in training I want stop loss gradient (set to 0) of the example with that label (the third):
[1,0,0]
which rappresents the n/a label,
instead the loss of the other examples in the batch are computed.
For my loss computation I use a method like that:
self.y_loss = kl_divergence(self.pred_y, self.y)
I found this function that stop gradient, but how can apply it to conditionally to the batch elements?
If you don't want some samples to contribute to the gradients you could just avoid feeding them to the network during training at all. Simply remove the samples with that label from your training set.
Alternatively, since the loss is computed by summing over the KL-divergences for each sample, you could multiply the KL-divergence for each sample with either 1 if the sample should be taken into account and 0 otherwise before summing over them.
You can get the vectors of values you need to multiply the individual KL-divergences with by subtracting the first column of the tensor of labels from 1: 1 - y[:,0]
For the kl_divergence function from the answer to your previous question it might look like this:
def kl_divergence(p, q)
return tf.reduce_sum(tf.reduce_sum(p * tf.log(p/q), axis=1)*(1-p[:,0]))
where p is the groundtruth tensor and q are the predictions