Softmax or sigmoid for multiclass problem - python

I am using VGG16 model and fine tuned them on my data. I am predicting ethnicity of images (faces) .i have 5 output classes like white, black,Asian, Sub-continent and others. Should i use softmax or sigmoid. And why??

Sigmoid:
Softmax:
When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one.
In the case of softmax, increasing the output value of one class makes the others go down (because sum=1 always). If you plan to find exactly one value (which is the case in your ethnicity classifier) you should use softmax function. The character of this function is “there can be only one”. So these are ideally used in multi-class problems like your problem.
Things are different for the sigmoid function. This function can provide us with the top n results based on the threshold. The feature of the sigmoid is to emphasize multiple values (yes, can be more than one, hence called "multi-label"), based on the threshold, and we use it for the multi-label classification problems.

In general cases, if you are dealing with multi-class clasification problems, you should use a Softmax because you are guaranted that the sum of probabilities of all clases will sum 1, by weighting them individually and computing the join distribution, whereas with a Sigmoid, you'd be predicting the probability of each class individually, but not necesarilly weighted. If not careful and aware of the difference you can run into some issues with your output.

Related

Mixed linear model with Probabilistic Layers Regression

I'm interested in fitting a linear mixed model using the variational inference capabilities of tensorflow probabilities and keras. However, I cannot find a straight-forward answer on how to implement such analysis. Using the regression example in TF probabilities (see Case 3 here), I am able to grasp how to fit these models if we have only random variables in the model (the example is regression using a single feature). Following the radon example here, we have two features: floor (fixed) and county (random). My understanding is the latter should only be passed to the denseVariational layers, while the former can be passed to a regular dense layer. So I guess I would have to jointly train two networks, one for fixed and one for the random features and some how merge their outputs.
So my questions are:
(1) If these are fit jointly can the same loss function be applied to both? I often see mean square error used, while in VI negative log likelihood is used (I think this equivalent to maximizing evidence of lower bound).
(2) Does the input need to be split before-hand and fed as input to two networks?

When using the cross-entropy function for binary classification, a big gap between the model output scalar and a two-dimensional vector

When I use u-net for semantic segmentation of two categories, my output in the last layer of the model is set to 1 channel and 2 channel respectively. Then I use cross-entropy loss to measure: BCEloss and CrossEntropyLoss.
But the gap between the two is great. The performance of the former is normal, but the latter has a very low precision rate and a high recall rate.
I used pytorch.
Mathematically BCEloss (logist) is just a special case of CrossEntropy loss for the case of two classes.
Are you using a sigmoid or softmax in the output of the network? In PyTorch, CrossEntropy loss takes the raw output of the last layer (no need for softmax the output), that is done for numerical stability.
BCEloss only takes input in between 0 and 1. So a sigmoid is needed there. However, PyTorch has The BCEWithLogistLoss that applies the sigmoid for you, this version is more stable.
One more thing that it seems you are not doing correctly. (it would be nicer to have some minimum amount of code to better understand your problem). CrossEntropyLoss requires one channel per class. So if you have 2 classes, you have to give it an input with two channels. The logist (BCEloss) only takes one channel with a number ranging between 0 and 1. If I understood correctly, you are somehow giving it 2 channels. That will bring you problems in training.
My best guess is that the gap in performance between the two is due to misuse of the loss functions. The PyTorch documentation has improved a lot, I can recomend you spending a few minutes understanding the difference between each those three loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions .

Binary cross entropy Vs categorical cross entropy with 2 classes

When considering the problem of classifying an input to one of 2 classes, 99% of the examples I saw used a NN with a single output and sigmoid as their activation followed by a binary cross-entropy loss. Another option that I thought of is having the last layer produce 2 outputs and use a categorical cross-entropy with C=2 classes, but I never saw it in any example.
Is there any reason for that?
Thanks
If you are using softmax on top of the two output network you get an output that is mathematically equivalent to using a single output with sigmoid on top.
Do the math and you'll see.
In practice, from my experience, if you look at the raw "logits" of the two outputs net (before softmax) you'll see that one is exactly the negative of the other. This is a result of the gradients pulling exactly in the opposite direction each neuron.
Therefore, since both approaches are equivalent, the single output configuration has less parameters and requires less computations, thus it is more advantageous to use a single output with a sigmoid ob top.

Unbalanced Binary Classification in Tensorflow

I am trying to perform a binary classification using tensorflow (V.1.1.0) with a single neuron at the output layer. The snippet below corresponds to the loss function and optimizer I am currently using (inspired from the answer here).
ratio=.034 #minority/population ratio
learning_rate=0.001
class_weight=tf.constant([[ratio,1.0-ratio]],name='unbalanced_ratio') #weight vector, (lab_feed is one_hot labels)
weight_per_label=tf.transpose(tf.matmul(lab_feed,tf.transpose(class_weight)),name='weights_per_label')
xent=tf.multiply(weight_per_label,tf.nn.sigmoid_cross_entropy_with_logits(labels=lab_feed,logits=output),name='loss')
loss=tf.reduce_mean(xent)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate,name='GradientDescent').minimize(loss)
My issue is however that for some reason all instances are classified as the same class after progression of epochs. Do I have to stop training in the middle or is there something wrong with the loss function?
You are misusing sigmoid cross-entropy as if it were softmax cross-entropy.
Sigmoid cross-entropy is adapted to binary classification — your problem is binary classification, so that's fine. But then, the output of your net should have only one channel per binary classification task — in your case, you have a single binary classification task, so your net should have one output channel only.
To balance a sigmoid cross-entropy you need to balance each individual part of the cross-entropy, i.e. the part coming from the positive and the part coming from the negative. This cannot be done on the output, as you are doing, because the output is already a sum of the positive and negative parts.
Hopefully there is a function in tensorflow to do just that, tf.nn.weighted_cross_entropy_with_logits. Its use is similar to tf.nn.sigmoid_cross_entropy with an additional parameter corresponding to the weight of the positive class.
What you are currently doing, is having two binary classifiers on two different channels, and sending only the negative samples to the first and the positives samples to the second. This cannot produce something useful.

Difference of implementation between tensorflow softmax_cross_entropy_with_logits and sigmoid_cross_entropy_with_logits

I recently came across tensorflow softmax_cross_entropy_with_logits, but I can not figure out what the difference on the implementation is compared to sigmoid_cross_entropy_with_logits.
I know I am answering a bit late, but better late than never. So I had the exact same doubt and the answer was there in tensorflow documentation. The answer is and I quote:
softmax_cross_entropy_with_logits: Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class).
sigmoid_cross_entropy_with_logits: Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive
edit: Thought I should add that while the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. which is not the case in sparse_softmax_cross_entropy_with_logits in which the label is a vector containing only the index of the true class.
I am also adding the links to the documentation. Hope this answer was helpful.
The softmax_cross_entropy_with_logits first calculates softmax and then a cross entropy, whereas sigmoid_cross_entropy_with_logits first calculates sigmoid and then cross entropy.
the major difference between sigmoid and softmax is that softmax function return result in terms of probability which is kind of more inline with the ML philosophy. Sum of all outputs from softmax result to 1. This is turn tells you how confident the network is about the answer.
Whereas, sigmoid outputs are discreet. Its either correct or incorrect. You would have to write code to calculate the probability yourself.
As far as performance of the network goes. Softmax generally gives better accuracy than sigmoid. But, it also highly dependent on other hyper parameters also.

Categories