When considering the problem of classifying an input to one of 2 classes, 99% of the examples I saw used a NN with a single output and sigmoid as their activation followed by a binary cross-entropy loss. Another option that I thought of is having the last layer produce 2 outputs and use a categorical cross-entropy with C=2 classes, but I never saw it in any example.
Is there any reason for that?
Thanks
If you are using softmax on top of the two output network you get an output that is mathematically equivalent to using a single output with sigmoid on top.
Do the math and you'll see.
In practice, from my experience, if you look at the raw "logits" of the two outputs net (before softmax) you'll see that one is exactly the negative of the other. This is a result of the gradients pulling exactly in the opposite direction each neuron.
Therefore, since both approaches are equivalent, the single output configuration has less parameters and requires less computations, thus it is more advantageous to use a single output with a sigmoid ob top.
Related
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
When I read the contents above, I understood that torch.nn.CrossEntropy already computes exp score of the last layer. So I thought the forward function doesn't have to include softmax. For example,
return self.fc(x) rather than return nn.softmax(self.fc(x)). However, I'm confused, for I've seen several implementations of ConvNet Classifiers that use both ways (they return with or without softmax while both use cross entropy loss).
Do this issues affect the performance of Classifier?? Which way is correct?
JHPark,
You are correct - with torch.nn.CrossEntropyLoss there is no need to include softmax layer. If one does include softmax it will still lead to proper classification result, since softmax does not change which element has max score. However, if applied twice, it may distort relative levels of the outputs, making gradients less strong and potentially slowing training a bit.
When I use u-net for semantic segmentation of two categories, my output in the last layer of the model is set to 1 channel and 2 channel respectively. Then I use cross-entropy loss to measure: BCEloss and CrossEntropyLoss.
But the gap between the two is great. The performance of the former is normal, but the latter has a very low precision rate and a high recall rate.
I used pytorch.
Mathematically BCEloss (logist) is just a special case of CrossEntropy loss for the case of two classes.
Are you using a sigmoid or softmax in the output of the network? In PyTorch, CrossEntropy loss takes the raw output of the last layer (no need for softmax the output), that is done for numerical stability.
BCEloss only takes input in between 0 and 1. So a sigmoid is needed there. However, PyTorch has The BCEWithLogistLoss that applies the sigmoid for you, this version is more stable.
One more thing that it seems you are not doing correctly. (it would be nicer to have some minimum amount of code to better understand your problem). CrossEntropyLoss requires one channel per class. So if you have 2 classes, you have to give it an input with two channels. The logist (BCEloss) only takes one channel with a number ranging between 0 and 1. If I understood correctly, you are somehow giving it 2 channels. That will bring you problems in training.
My best guess is that the gap in performance between the two is due to misuse of the loss functions. The PyTorch documentation has improved a lot, I can recomend you spending a few minutes understanding the difference between each those three loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions .
I am using VGG16 model and fine tuned them on my data. I am predicting ethnicity of images (faces) .i have 5 output classes like white, black,Asian, Sub-continent and others. Should i use softmax or sigmoid. And why??
Sigmoid:
Softmax:
When you use a softmax, basically you get a probability of each class, (join distribution and a multinomial likelihood) whose sum is bound to be one.
In the case of softmax, increasing the output value of one class makes the others go down (because sum=1 always). If you plan to find exactly one value (which is the case in your ethnicity classifier) you should use softmax function. The character of this function is “there can be only one”. So these are ideally used in multi-class problems like your problem.
Things are different for the sigmoid function. This function can provide us with the top n results based on the threshold. The feature of the sigmoid is to emphasize multiple values (yes, can be more than one, hence called "multi-label"), based on the threshold, and we use it for the multi-label classification problems.
In general cases, if you are dealing with multi-class clasification problems, you should use a Softmax because you are guaranted that the sum of probabilities of all clases will sum 1, by weighting them individually and computing the join distribution, whereas with a Sigmoid, you'd be predicting the probability of each class individually, but not necesarilly weighted. If not careful and aware of the difference you can run into some issues with your output.
I recently came across tensorflow softmax_cross_entropy_with_logits, but I can not figure out what the difference on the implementation is compared to sigmoid_cross_entropy_with_logits.
I know I am answering a bit late, but better late than never. So I had the exact same doubt and the answer was there in tensorflow documentation. The answer is and I quote:
softmax_cross_entropy_with_logits: Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class).
sigmoid_cross_entropy_with_logits: Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive
edit: Thought I should add that while the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. which is not the case in sparse_softmax_cross_entropy_with_logits in which the label is a vector containing only the index of the true class.
I am also adding the links to the documentation. Hope this answer was helpful.
The softmax_cross_entropy_with_logits first calculates softmax and then a cross entropy, whereas sigmoid_cross_entropy_with_logits first calculates sigmoid and then cross entropy.
the major difference between sigmoid and softmax is that softmax function return result in terms of probability which is kind of more inline with the ML philosophy. Sum of all outputs from softmax result to 1. This is turn tells you how confident the network is about the answer.
Whereas, sigmoid outputs are discreet. Its either correct or incorrect. You would have to write code to calculate the probability yourself.
As far as performance of the network goes. Softmax generally gives better accuracy than sigmoid. But, it also highly dependent on other hyper parameters also.
I have a neural network with one input, three hidden neurons and one output. I have 720 input and corresponding target values, 540 for training, 180 for testing.
When I train my network using Logistic Sigmoid or Tan Sigmoid function, I get the same outputs while testing, i.e. I get same number for all 180 output values. When I use Linear activation function, I get NaN, because apparently, the value gets too high.
Is there any activation function to use in such a case? Or any improvements to be done? I can update the question with details and code if required.
Neural nets are not stable when fed input data on arbitrary scales (such as between approximately 0 and 1000 in your case). If your output units are tanh they can't even predict values outside the range -1 to 1 or 0 to 1 for logistic units!
You should try recentering/scaling the data (making it have mean zero and unit variance - this is called standard scaling in the datascience community). Since it is a lossless transformation you can revert back to your original scale once you've trained the net and predicted on the data.
Additionally, a linear output unit is probably the best as it makes no assumptions about the output space and I've found tanh units to do much better on recurrent neural networks in low dimensional input/hidden/output nets.
Newmu is right that the scaling is probably the issue here; you need to scale your inputs to lie in the valid range. (Standardization to zero mean, unit variance, as they suggest, though, isn't a great choice since that means about a third of your data will like outside [-1, 1]....) I don't know about pybrain, but in scikit-learn you'd want sklearn.preprocessing.MinMaxScaler.
But, also, in the comments you said your dataset looks like this:
where the horizontal axis is inputs, vertical is targets. So, when you see an input of 200, you have one training example saying it's 80 and one saying it's 320; what do you want it to say then? An "optimal" neural network (which may be hard to achieve) would predict 200 or so.
You may need to think about how to reframe your learning problem to be a more-consistent function from inputs to targets.