I am trying to log AUC during training time of my model.
According to the documentation, tf.metric.auc needs a label and predictions, both of same shape.
But in my case of binary classification, label is a one-dimensional tensor, containing just the classes. And prediction is two-dimensional containing probability for each class of each datapoint.
How to calculate AUC in this case?
Let's have a look at the parameters in the function tf.metrics.auc:
labels: A Tensor whose shape matches predictions. Will be cast to bool.
predictions: A floating point Tensor of arbitrary shape and whose values are in the range [0, 1].
This operation already assumes a binary classification. That is, each element in labels states whether the class is "positive" or "negative" for a single sample. It is not a 1-hot vector, which requires a vector with as many elements as the number of exclusive classes.
Likewise, predictions represents the predicted binary class with some level of certainty (some people may call it a probability), and each element should also refer to one sample. It is not a softmax vector.
If the probabilities came from a neural network with a fully connected layer of 2 neurons and a softmax activation at the head of the network, consider replacing that with a single neuron and a sigmoid activation. The output can now be fed to tf.metrics.auc directly.
Otherwise, you can just slice the predictions tensor to only consider the positive class, which will represent the binary class just the same:
auc_value, auc_op = tf.metrics.auc(labels, predictions[:, 1])
Related
I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.
I'm new on StackOverflow and I also recently started to work with Tensorflow and Keras. Currently I'm developing an architecture using LSTM units. My question was partially discussed here:
What does the implementation of keras.losses.sparse_categorical_crossentropy look like?
However, in my model I have a predicted tensor, y_hat, of size (batch_size, seq_length, vocabulary_dimension) and the true labels, y, of size (batch_size, seq_length).
I would like to know how the value of the loss is computed when I call
loss = sparse_categorical_crossentropy(y,y_hat): how does the sparse_crossentropy function calculate the loss value starting from two tensors of different dimensions?
The cross entropy is a way to compare two probability distributions. That is, it says how different or similar the two are. It is a mathematical function defined on two arrays or continuous distributions as shown here.
The 'sparse' part in 'sparse_categorical_crossentropy' indicates that the y_true value must have a single value per row, e.g. [0, 2, ...] that indicates which outcome (category) was the right choice. The model then outputs the y_pred that must be like [[.99, .01, 0], [.01, .5, .49], ...]. Here, model predicts that the 0th category has a chance of .99 in the first row. This is very close to the true value, that is [1,0,0]. The sparse_categorical_crossentropy would then calculate a single number with two distributions using the above mentioned formula and return that number.
If you used a 'categorical_crossentropy' it would expect the y_true to be a one-hot encoded vector, like [[0,0,1], [0,1,0], ...].
If you would like to know the details in depth, you can take a look at the source.
I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)
I am defining a weighted mean squared error in Keras as follows:
def weighted_mse(yTrue,yPred):
data_weights = [w0,w1,w2,w3]
data_weights_np = np.asarray(data_weights, np.float32)
weights = tf.convert_to_tensor(data_weights_np, np.float32)
return K.mean(weights*K.square(yTrue-yPred))
I have a list of weights for each prediction. The predictions are of shape for example: (25,4). That is generated via final dense layer with dimension 4. I wish to weights these prediction in the mean squared error, so I generate a tensor and multiply it with the sum of squares error. Is this the correct way to do so?
Because, when I print the shape of the tensor, using tf.shape for YTrue and YPred it shows:
Tensor("loss_19/dense_20_loss/Shape:0", shape=(3,), dtype=int32)
and for weights:
Tensor("loss_19/dense_20_loss/Shape_2:0", shape=(1,), dtype=int32)
The Keras API already provides a mechanism to provide weights, for example the model.fit function. From the documentation:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode="temporal" in compile().
If you have a weight for each sample, you can pass the NumPy array as sample_weight to achieve the same effect without writing your own loss function.
First of all I narrate you about my question and situation.
I want to do multi-label classification in chainer and my class imbalance problem is very serious.
In this cases I must slice the vector inorder to calculate loss function, For example, In multi-label classification, ground truth label vector most elements is 0, only few of them is 1, In this situation, directly use F.sigmoid_cross_entropy to apply all the 0/1 elements may cause training not convergence, So I decide to use a[[xx,xxx,...,xxx]] slice( a is chainer.Variable output by last FC layer) to slice specific elements to calculate loss function.
In this case, because of label imbalance may cause rare class low classification performance, so I want to set rare gt-label variable high loss weight during back propagation, but set major label(occur too many in gt) variable low weight during back propagation.
How should I do it? What is your suggestion about multi-label imbalance class problem training in chainer?
You can use sigmoid_cross_entropy() of no-reduce mode (by passing reduce='no') to obtain a loss value at each spatial location and the average function for weighted averaging.
sigmoid_cross_entropy() first computes the loss value at each spatial location and each data along the batch dimension, and then take the mean or summation over the spatial dimensions and batch dimension (depending on the normalize option). You can disable the reduction part by passing reduce='no'. If you want to do the weighted average, you should specify it so that you can get the loss value at each location and reduce them by yourself.
After that, the simplest way to manually do weighted averaging is using average(), which can accept weight argument that indicates the weights for averaging. It first does weighted summation using the input and weight, and then divides the result by the summation of weight. You can pass appropriate weight array that has the same shape as the input and pass it to average() along with the raw (unreduced) loss values obtained by sigmoid_cross_entropy(..., reduce='no'). It is also ok to manually multiply a weight array and take summation like F.sum(score * weight) if weight is appropriately scaled (e.g. summing up to 1).
If you work on multi-label classification, how about using softmax_crossentropy loss?
softmax_crossentropy can take into account the class imbalance by specifying the class_weight attribute.
https://github.com/chainer/chainer/blob/v3.0.0rc1/chainer/functions/loss/softmax_cross_entropy.py#L57
https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html