First of all I narrate you about my question and situation.
I want to do multi-label classification in chainer and my class imbalance problem is very serious.
In this cases I must slice the vector inorder to calculate loss function, For example, In multi-label classification, ground truth label vector most elements is 0, only few of them is 1, In this situation, directly use F.sigmoid_cross_entropy to apply all the 0/1 elements may cause training not convergence, So I decide to use a[[xx,xxx,...,xxx]] slice( a is chainer.Variable output by last FC layer) to slice specific elements to calculate loss function.
In this case, because of label imbalance may cause rare class low classification performance, so I want to set rare gt-label variable high loss weight during back propagation, but set major label(occur too many in gt) variable low weight during back propagation.
How should I do it? What is your suggestion about multi-label imbalance class problem training in chainer?
You can use sigmoid_cross_entropy() of no-reduce mode (by passing reduce='no') to obtain a loss value at each spatial location and the average function for weighted averaging.
sigmoid_cross_entropy() first computes the loss value at each spatial location and each data along the batch dimension, and then take the mean or summation over the spatial dimensions and batch dimension (depending on the normalize option). You can disable the reduction part by passing reduce='no'. If you want to do the weighted average, you should specify it so that you can get the loss value at each location and reduce them by yourself.
After that, the simplest way to manually do weighted averaging is using average(), which can accept weight argument that indicates the weights for averaging. It first does weighted summation using the input and weight, and then divides the result by the summation of weight. You can pass appropriate weight array that has the same shape as the input and pass it to average() along with the raw (unreduced) loss values obtained by sigmoid_cross_entropy(..., reduce='no'). It is also ok to manually multiply a weight array and take summation like F.sum(score * weight) if weight is appropriately scaled (e.g. summing up to 1).
If you work on multi-label classification, how about using softmax_crossentropy loss?
softmax_crossentropy can take into account the class imbalance by specifying the class_weight attribute.
https://github.com/chainer/chainer/blob/v3.0.0rc1/chainer/functions/loss/softmax_cross_entropy.py#L57
https://docs.chainer.org/en/stable/reference/generated/chainer.functions.softmax_cross_entropy.html
Related
I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.
I am working on a NN with Pytorch which simply maps points from the plane into real numbers, for example
model = nn.Sequential(nn.Linear(2,2),nn.ReLU(),nn.Linear(2,1))
What I want to do, since this network defines a map h:R^2->R, is to compute the gradient of this mapping h in the training loop. So for example
for it in range(epochs):
pred = model(X_train)
grad = torch.autograd.grad(pred,X_train)
....
The training set has been defined as a tensor requiring the gradient. My problem is that even if the output, for each fixed point, is a scalar, since I am propagating a set of N=100 points, the output is actually a Nx1 tensor. This brings to the error: autograd can compute the gradient just of scalar functions.
In fact, trying with the little change
pred = torch.sum(model(X_train))
everything works perfectly. However I am interested in all the single gradients so, is there a way to compute all these gradients together?
Actually computing the sum as presented above gives exactly the same result I expect of course, but I wanted to know if this is the only possiblity.
There are other possibilities but using .sum is the simplest way. Using .sum() on the final loss vector and computing dpred/dinput will give you the desired output. Here is why:
Since, pred = sum(loss) = sum (f(xi))
where i is the index of input x.
dpred/dinput will be a matrix [dpred/dx0, dpred/dx1, dpred/dx...]
Consider, dpred/dx0, it will be equal to df(x0)/dx0, since other df(xi)/dx0 is 0.
PS: Please excuse the crappy mathematical expressions... SO does not support latex/math expressions.
I'm working on a binary semantic segmentation task where the distribution of one class is very smalls across any input image, hence there are only a few pixels which are labeled. When using sparse_softmax_cross_entropy
the over all error is easily decreased when ignoring this class. Now, I'm looking for a way to weight the classes by a coefficient which penalizes missclassifications for the specific class higher compared to the other class.
The doc of the loss function states:
weights acts as a coefficient for the loss. If a scalar is provided, then the loss is simply scaled by the given value. If weights is a tensor of shape [batch_size], then the loss weights apply to each corresponding sample.
If I understand this correctly, it says that specific sample in a batch get weighted differently compared to others. But this is actually not what I'm looking for. Does anyone know how to implement a weighted version of this loss function where the weights scale the importance of a specific class rather than samples?
To answer my own question:
The authors of the U-Net paper used a pre-computed weight-map to handle imbalanced classes.
The Institute for Anstronomy of ETH Zurich provided a Tensorflow-based U-Net package which contains a weighted version of the Softmax function (not sparse but they flatten their labels and logits first):
class_weights = tf.constant(np.array(class_weights, dtype=np.float32))
weight_map = tf.multiply(flat_labels, class_weights)
weight_map = tf.reduce_sum(weight_map, axis=1)
loss_map = tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits, labels=flat_labels)
weighted_loss = tf.multiply(loss_map, weight_map)
loss = tf.reduce_mean(weighted_loss)
I am trying to log AUC during training time of my model.
According to the documentation, tf.metric.auc needs a label and predictions, both of same shape.
But in my case of binary classification, label is a one-dimensional tensor, containing just the classes. And prediction is two-dimensional containing probability for each class of each datapoint.
How to calculate AUC in this case?
Let's have a look at the parameters in the function tf.metrics.auc:
labels: A Tensor whose shape matches predictions. Will be cast to bool.
predictions: A floating point Tensor of arbitrary shape and whose values are in the range [0, 1].
This operation already assumes a binary classification. That is, each element in labels states whether the class is "positive" or "negative" for a single sample. It is not a 1-hot vector, which requires a vector with as many elements as the number of exclusive classes.
Likewise, predictions represents the predicted binary class with some level of certainty (some people may call it a probability), and each element should also refer to one sample. It is not a softmax vector.
If the probabilities came from a neural network with a fully connected layer of 2 neurons and a softmax activation at the head of the network, consider replacing that with a single neuron and a sigmoid activation. The output can now be fed to tf.metrics.auc directly.
Otherwise, you can just slice the predictions tensor to only consider the positive class, which will represent the binary class just the same:
auc_value, auc_op = tf.metrics.auc(labels, predictions[:, 1])
My dataset already has weighted examples. And in this binary classification I also have far more of the first class compared to the second.
Can I use both sample_weight and further re-weight it with class_weight in the model.fit() function?
Or do I first make a new array of new_weights and pass it to the fit function as sample_weight?
Edit:
TO further clarify, I already have individual weights for each sample in my dataset, and to further add to the complexity, the total sum of sample weights of the first class is far more than the total sample weights of the second class.
For example I currently have:
y = [0,0,0,0,1,1]
sample_weights = [0.01,0.03,0.05,0.02, 0.01,0.02]
so the sum of weights for class '0' is 0.11 and for class '1' is 0.03. So I should have:
class_weight = {0 : 1. , 1: 0.11/0.03}
I need to use both sample_weight AND class_weight features. If one overrides the other then I will have to create new sample_weights and then use fit() or train_on_batch().
So my question is, can I use both, or does one override the other?
You can surely do both if you want, the thing is if that is what you need. According to the keras docs:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data [...].
So given that you mention that you "have far more of the first class compared to the second" I think that you should go for the class_weight parameter. There you can indicate that ratio your dataset presents so you can compensate for imbalanced data classes. The sample_weight is more when you want to define a weight or importance for each data element.
For example if you pass:
class_weight = {0 : 1. , 1: 50.}
you will be saying that every sample from class 1 would count as 50 samples from class 0, therefore giving more "importance" to your elements from class 1 (as you have less of those samples surely). You can custom this to fit your own needs. More info con imbalanced datasets on this great question.
Note: To further compare both parameters, have in mind that passing class_weight as {0:1., 1:50.} would be equivalent to pass sample_weight as [1.,1.,1.,...,50.,50.,...], given you had samples whose classes where [0,0,0,...,1,1,...].
As we can see it is more practical to use class_weight on this case, and sample_weight could be of use on more specific cases where you actually want to give an "importance" to each sample individually. Using both can also be done if the case requires it, but one has to have in mind its cumulative effect.
Edit: As per your new question, digging on the Keras source code it seems that indeed sample_weights overrides class_weights, here is the piece of code that does it on the _standarize_weigths method (line 499):
if sample_weight is not None:
#...Does some error handling...
return sample_weight #simply returns the weights you passed
elif isinstance(class_weight, dict):
#...Some error handling and computations...
#Then creates an array repeating class weight to match your target classes
weights = np.asarray([class_weight[cls] for cls in y_classes
if cls in class_weight])
#...more error handling...
return weights
This means that you can only use one or the other, but not both. Therefore you will indeed need to multiply your sample_weights by the ratio you need to compensate for the imbalance.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.
To add a little to DarkCygnus answer, for those who actually need to use class weight & sample weights simultaneously:
Here is a code, that I use for generating sample weights for classifying multiclass temporal data in sequences:
(targets is an array of dimension [#temporal, #categories] with values being in set(#classes), class_weights is an array of [#categories, #classes]).
The generated sequence has the same length as the targets array and the common usecase in batching is to pad the targets with zeros and the sample weights also up to the same size, thus making the network ignore the padded data.
def multiclass_temoral_class_weights(targets, class_weights):
s_weights = np.ones((targets.shape[0],))
# if we are counting the classes, the weights do not exist yet!
if class_weights is not None:
for i in range(len(s_weights)):
weight = 0.0
for itarget, target in enumerate(targets[i]):
weight += class_weights[itarget][int(round(target))]
s_weights[i] = weight
return s_weights