Using sigmoid output for cross entropy loss on Pytorch

Using sigmoid output for cross entropy loss on Pytorch - python

I’m trying to modify Yolo v1 to work with my task which each object has only 1 class. (e.g: an obj cannot be both cat and dog)
Due to the architecture (other outputs like localization prediction must be used regression) so sigmoid was applied to the last output of the model (f.sigmoid(nearly_last_output)). And for classification, yolo 1 also use MSE as loss. But as far as I know that MSE sometimes not going well compared to cross entropy for one-hot like what I want.
And specific: GT like this: 0 0 0 0 1 (let say we have only 5 classes in total, each only has 1 class so only one number 1 in them, of course this is class 5th in this example)
and output model at classification part: 0.1 0.1 0.9 0.2 0.1
I found some suggestion use nn.BCE / nn.BCEWithLogitsLoss but I think I should ask here for more correct since I’m not good at math and maybe I’m wrong somewhere so just ask to learn more and for sure what should I use correctly?

MSE loss is usually used for regression problem.
For binary classification, you can either use BCE or BCEWithLogitsLoss. BCEWithLogitsLoss combines sigmoid with BCE loss, thus if there is sigmoid applied on the last layer, you can directly use BCE.
The GT mentioned in your case refers to 'multi-class' classification problem, and the output shown doesn't really correspond to multi-class classification. So, in this case, you can apply a CrossEntropyLoss, which combines softmax and log loss and suitable for 'multi-class' classification problem.

Related

When using the cross-entropy function for binary classification, a big gap between the model output scalar and a two-dimensional vector

When I use u-net for semantic segmentation of two categories, my output in the last layer of the model is set to 1 channel and 2 channel respectively. Then I use cross-entropy loss to measure: BCEloss and CrossEntropyLoss.
But the gap between the two is great. The performance of the former is normal, but the latter has a very low precision rate and a high recall rate.
I used pytorch.

Mathematically BCEloss (logist) is just a special case of CrossEntropy loss for the case of two classes.
Are you using a sigmoid or softmax in the output of the network? In PyTorch, CrossEntropy loss takes the raw output of the last layer (no need for softmax the output), that is done for numerical stability.
BCEloss only takes input in between 0 and 1. So a sigmoid is needed there. However, PyTorch has The BCEWithLogistLoss that applies the sigmoid for you, this version is more stable.
One more thing that it seems you are not doing correctly. (it would be nicer to have some minimum amount of code to better understand your problem). CrossEntropyLoss requires one channel per class. So if you have 2 classes, you have to give it an input with two channels. The logist (BCEloss) only takes one channel with a number ranging between 0 and 1. If I understood correctly, you are somehow giving it 2 channels. That will bring you problems in training.
My best guess is that the gap in performance between the two is due to misuse of the loss functions. The PyTorch documentation has improved a lot, I can recomend you spending a few minutes understanding the difference between each those three loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions .

Softmax or sigmoid for multiclass problem

I am using VGG16 model and fine tuned them on my data. I am predicting ethnicity of images (faces) .i have 5 output classes like white, black,Asian, Sub-continent and others. Should i use softmax or sigmoid. And why??

Sigmoid:
Softmax:
When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one.
In the case of softmax, increasing the output value of one class makes the others go down (because sum=1 always). If you plan to find exactly one value (which is the case in your ethnicity classifier) you should use softmax function. The character of this function is “there can be only one”. So these are ideally used in multi-class problems like your problem.
Things are different for the sigmoid function. This function can provide us with the top n results based on the threshold. The feature of the sigmoid is to emphasize multiple values (yes, can be more than one, hence called "multi-label"), based on the threshold, and we use it for the multi-label classification problems.

In general cases, if you are dealing with multi-class clasification problems, you should use a Softmax because you are guaranted that the sum of probabilities of all clases will sum 1, by weighting them individually and computing the join distribution, whereas with a Sigmoid, you'd be predicting the probability of each class individually, but not necesarilly weighted. If not careful and aware of the difference you can run into some issues with your output.

Convert this loss function equation into python code

Please check this equation of this link and convert it into a python loss function for a simple keras model.
EQUATION PICTURE OR IMAGE LINK FOR CONVERTING IT TO PYTHON'S KERAS REQUIRED LOSS EQUATION
where the max part or the curve selected part of the equation in the picture is the hinge loss, yi represents the label of each
example, φ(x) denotes feature representation, b is a bias, k is the total number of training examples and w is the classifier to be learned.
For easy check, the sample equation is -
min(w) [
1/k(sum of i to k)
max(0, 1 - y_i(w.φ(x) - b))
]
+
1/2||w||^ 2
.
Actually I can find the max part or the curved section of the equation in the picture but I can not find the 1/2 * ||w||^ 2 part.
You check this link too for help -
similar link
Here I have attached some sample code to clear the concept of my issue:
print("Create Model")
model = Sequential()
model.add(Dense(512,
input_dim=4096, init='glorot_normal',W_regularizer=l2(0.001),activation='relu'))
model.add(Dropout(0.6))
model.add(Dense(32, init='glorot_normal',W_regularizer=l2(0.001)))
model.add(Dropout(0.6))
model.add(Dense(1, init='glorot_normal',W_regularizer=l2(0.001),activation='sigmoid'))
adagrad=Adagrad(lr=0.01, epsilon=1e-08)
model.compile(loss= required_loss_function, optimizer=adagrad)
def required_loss_function(y_true, y_pred):
IN THIS LOSS FUNCTION,
CONVERT THE EQUATION IN THE
PICTURE INTO PYTHON CODE.
AS A MENTION, THE THING YOU HAVE TO FIND IS THE- 1/2 * ||w|| ^ 2 .
As I can find the python code of the remaining or other part of the equation in the linked picture. The hinge loss part can be easily calculated using this equation -
import keras
keras.losses.hinge(y_true, y_pred)
If you require further help, please comment for details.

Your screenshot shows the whole objective function, but only the sum(max(...)) term is called the loss term. Therefore, only that term needs to get implemented in required_loss_function. Indeed, you can probably use the pre-baked hinge loss from the keras library rather than writing it yourself—unless you're supposed to write it yourself as part of the exercise, of course.
The other term, the 0.5*||w||^2 term, is a regularization term. Specifically, it's an L2 regularization term. keras has a completely separate way of dealing with regularization, which you can read about at https://keras.io/regularizers/ . Basically it amounts to creating an l2 instance keras.regularizers.l2(lambdaParameter) and attaching it to your model with the .add() method (your screenshotted equation doesn't have a parameter that scales the regularization term–so, if that's literally what you're supposed to implement, it means your lambdaParameter would be 1.0).
But the listing you supply already seems to be applying l2 regularizers like this, multiple times in different contexts (I'm not familiar with keras at all so I don't really know what's going on–I guess it's a more sophisticated model than the one represented in your screenshot).
Either way, the answer to your question is that the regularization term is handled separately and does not belong in the loss function (the signature of the loss function gives us that hint too: there's no w argument passed into it—only y_true and y_pred).

Binary cross entropy Vs categorical cross entropy with 2 classes

When considering the problem of classifying an input to one of 2 classes, 99% of the examples I saw used a NN with a single output and sigmoid as their activation followed by a binary cross-entropy loss. Another option that I thought of is having the last layer produce 2 outputs and use a categorical cross-entropy with C=2 classes, but I never saw it in any example.
Is there any reason for that?
Thanks

If you are using softmax on top of the two output network you get an output that is mathematically equivalent to using a single output with sigmoid on top.
Do the math and you'll see.
In practice, from my experience, if you look at the raw "logits" of the two outputs net (before softmax) you'll see that one is exactly the negative of the other. This is a result of the gradients pulling exactly in the opposite direction each neuron.
Therefore, since both approaches are equivalent, the single output configuration has less parameters and requires less computations, thus it is more advantageous to use a single output with a sigmoid ob top.

Unbalanced Binary Classification in Tensorflow

I am trying to perform a binary classification using tensorflow (V.1.1.0) with a single neuron at the output layer. The snippet below corresponds to the loss function and optimizer I am currently using (inspired from the answer here).
ratio=.034 #minority/population ratio
learning_rate=0.001
class_weight=tf.constant([[ratio,1.0-ratio]],name='unbalanced_ratio') #weight vector, (lab_feed is one_hot labels)
weight_per_label=tf.transpose(tf.matmul(lab_feed,tf.transpose(class_weight)),name='weights_per_label')
xent=tf.multiply(weight_per_label,tf.nn.sigmoid_cross_entropy_with_logits(labels=lab_feed,logits=output),name='loss')
loss=tf.reduce_mean(xent)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate,name='GradientDescent').minimize(loss)
My issue is however that for some reason all instances are classified as the same class after progression of epochs. Do I have to stop training in the middle or is there something wrong with the loss function?

You are misusing sigmoid cross-entropy as if it were softmax cross-entropy.
Sigmoid cross-entropy is adapted to binary classification — your problem is binary classification, so that's fine. But then, the output of your net should have only one channel per binary classification task — in your case, you have a single binary classification task, so your net should have one output channel only.
To balance a sigmoid cross-entropy you need to balance each individual part of the cross-entropy, i.e. the part coming from the positive and the part coming from the negative. This cannot be done on the output, as you are doing, because the output is already a sum of the positive and negative parts.
Hopefully there is a function in tensorflow to do just that, tf.nn.weighted_cross_entropy_with_logits. Its use is similar to tf.nn.sigmoid_cross_entropy with an additional parameter corresponding to the weight of the positive class.
What you are currently doing, is having two binary classifiers on two different channels, and sending only the negative samples to the first and the positives samples to the second. This cannot produce something useful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using sigmoid output for cross entropy loss on Pytorch - python

Related

When using the cross-entropy function for binary classification, a big gap between the model output scalar and a two-dimensional vector

Softmax or sigmoid for multiclass problem

Convert this loss function equation into python code

Binary cross entropy Vs categorical cross entropy with 2 classes

Unbalanced Binary Classification in Tensorflow

Categories

Resources