I am training a Network on images for binary classification. The input images are normalized to have pixel values in the range[0,1]. Also, the weight matrices are initialized from a normal distribution. However, the output from my last Dense layer with sigmoid activation yields values with a very minute difference for the two classes. For example -
output for class1- 0.377525 output for class2- 0.377539
The difference for the classes comes after 4 decimal places. Is there any workaround to make sure that the output for class 1 falls around 0 to 0.5 and for class 2 , it falls between 0.5 to 1.
Edit:
I have tried both the cases.
Case 1 - Dense(1, 'sigmoid') with binary crossentropy
Case 2- Dense(2, 'softmax') with binary crossentropy
For case1, the output values differ by a very small amount as mentioned in the problem above. As such , i am taking mean of the predicted values to act as threshold for classification. This works upto some extent, but not a permanent solution.
For case 2 - the prediction overfits to one class only.
A sample code : -
inputs = Input(shape = (128,156,1))
x = Conv2D(.....)(inputs)
x = BatchNormalization()(x)
x = Maxpooling2D()(x)
...
.
.
flat=Flatten()(x)
out = Dense(1,'sigmoid')(x)
model = Model(inputs,out)
model.compile(optimizer='adamax',loss='binary_crossentropy',metrics=['binary_accuracy'])
It seems you are confusing a binary classification architecture with a 2 label multi-class classification architecture setup.
Since you mention the probabilities for the 2 classes, class1 and class2, you have, set up a single label multi-class setup. That means, you are trying to predict the probabilities of 2 classes, where a sample can have only one of the labels at a time.
In this setup, it's proper to use softmax instead of sigmoid. Your loss function would be binary_crossentropy as well.
Right now, with the multi-label setup and sigmoid activation, you are independently predicting the probability of a sample being class1 and class2 simultaneously (aka, multi-label multi-class classification).
Once you change to softmax you should see more significant differences between the probabilities IF the sample actually definitively belongs to one of the 2 classes and if your model is well trained & confident about its predictions (validation vs training results)
First, I would like to say the information you provided is insufficient to exactly debug your problem, because you didn't provide any code of your model and optimizer. I suspect there might be an error in the labels, and I also suggest you use a softmax activation fuction instead of the sigmoid function in the final layer, although it will still work through your approach, binary classification problems must output one single node and loss must be binary cross entropy.
If you want to receive an accurate solution, please provide more information.
Related
I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.
I trained a model for multiclass classification. There were three classes. In the first approach, I trained a model by converting the classes into one-hot vectors and training a model with loss function, categorical crossentropy, I achieved a loss of 0.07 in 1000 epochs. But when I used the same approach, but this time I did not converted the classes into one-hot vectors and used sparse_categorical_crossentropy, as the loss function, this time i achieved a loss of 0.05 in 1000 epochs.. Does this mean that sparse_categorical_crossentropy is better than categorical_crossentropy?
Thank You!
You can't compare two loss functions in term of losses since the definition of loss itself changed. you can compare the performance on the same test dataset.
In general use sparse_categorical_crossentropy when your classes are mutually exclusive (e.g. when each sample belongs exactly to one class) and categorical_crossentropy when one sample can have multiple classes or labels are soft probabilities (like [0.5, 0.3, 0.2]).
You got different losses because the representation of the labels changes, actually in keras the sparse_categorical_crossentropy is defined as categorical crossentropy with integer targets
I am trying to build two neural network for classification. One for Binary and the second is for multi-class classification. I am trying to use the torch.nn.CrossEntropyLoss() as a loss function, but I try to train my first neural network I get the following error:
multi-target not supported at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THNN/generic/ClassNLLCriterion.c:22
From my analysis, I found that the my dataset has two problems that caused the error.
My data set is one hot encoded. I used one hot encoding to pre processes my dataset. The first target Y_binary variable has the shape of torch.Size([125973, 1]) full of 0s and 1 indicating classes 'No' and 'Yes'.
My data has the wrong dimensions? I found that I can't use a simple vector with the cross entropy loss function. Some people used the following code to reshape their target vector before feeding to the loss function.
out = out.permute(0, 2, 3, 1).contiguous().view(-1, class_number)
But I didn't really understand the reasoning behind this code. But it seems for my that I need to keep track of the following variables: Class_Number, Batch_size, Dimension_Output. For my code here are the dimensions
X_train.shape: (125973, 122)
Y_train2.shape: (125973, 1)
batch_size = 64
K = len(set(Y_train2)) # Binary classification For multi class classification use K = len(set(Y_train5))
Should the target value be one hot encoded? If not, how I can feed a nominal feature to the loss function?
If I use reshape the output, can you help me do this for my code ?
I am trying to use this loss function for both my neural networks.
Thank you in advance,
The error is due to the usage of torch.nn.CrossEntropyLoss() which can be used if you want to predict 1 class out of N classes. For multiclass classification, you should use torch.nn.BCEWithLogitsLoss() which combines a Sigmoid layer and the BCELoss in one single class.
In case of multi-class, and if you use Sigmoid + BCELoss, then you need the target to be one-hot encoding, i.e. something like this per sample: [0 1 0 0 0 1 0 0 1 0], where 1 will be at the locations of classes present.
I am using Keras for a binary classification problem. I am using the following adaptation of LeNet:
lenet_model = models.Sequential()
lenet_model.add(Convolution2D(filters=filt_size, kernel_size=(kern_size,
kern_size), padding='valid', input_shape=input_shape))
lenet_model.add(Activation('relu'))
lenet_model.add(BatchNormalization())
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=64, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=128, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Flatten())
lenet_model.add(Dense(1024, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dense(512, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dropout(0.2))
lenet_model.add(Dense(1, kernel_initializer='uniform'))
lenet_model.add(Activation('sigmoid'))
lenet_model.compile(loss='binary_crossentropy', optimizer=Adam(),
metrics=['accuracy'])
But I am getting this:
ValueError: Error when checking model target: expected activation_6 to have shape (None, 1) but got array with shape (1652, 2). It gets resolved if I use 2 in the final Dense layer.
I would suggest first check the dimensionality of your data. The training dataset target is 2 dimensional, but the model takses 1 dimensional data.
You have set lenet_model.add(Dense(1, kernel_initializer='uniform')) to accept 2 dimensional data. You need to set the final dense layer shape such that it accepts target shape (None,2)
lenet_model.add(Dense(2, kernel_initializer='uniform')) is what it should be else preprocess your data such that target data is 1 dimensional data.
Consider reading the documentaion before writing the code next time.
It seems that in your preprocessing steps, you have used functions to turn your numerical class labels into categorical ones, i.e., representing numerical classes in the one-hot coding scheme (in Keras, to_categorical(y, num_classes=2) would do this job for you).
Since you are dealing with a binary problem, if the original labels are 0s and 1s, the coded categorical labels would be 01s and 10s (in labels coded in the one-hot scheme, counting from right to left, the nth digit would be 1 if the numerical class for this instance is n while the rest of that label would be 0). This would explain why your data dimension in the error traceback is (1652, 2).
However, since you have set the output dimension in your model to 1, your output layer would expect the desired labels in data to be of 1 digit only, which would correspond to the raw data before you applied any preprocessing steps mentioned above.
So, you could fix this problem either by taking out the preprocessing for the labels or changing the output dimension to 2. If you stick with using categorical labels coded in the one-hot fashion, you should also switch the sigmoid activation in the last layer to softmax activation since sigmoid only deals with binary numerical classes, i.e., 0 or 1. For a binary classification problem, these two choices should not differ in performance much.
One thing worth mentioning is that you should also pay attention to the cost function you use when you compile this model. Generally speaking, categorical labels work the best with cost functions like categorical crossentropy. Especially for multi-class classification (more than 2 classes) problems where you would have to use categorical labels together with a softmax activation, categorical crossentropy should pretty much be your default choice since it has many benefits over some other common cost functions such as MSE and raw error count.
One of the many benefits of categorical crossentropy would be the fact that it penalizes a "very confident mistake" much more than the case where the classifier "almost got it right", which makes sense. For example, in a binary classification setting using categorical crossentropy as the cost function, a classifier that is 95% sure that a given instance is of class 0 whereas the instance actually belongs to class 1 would be penalized more than a classifier that is 51% percent sure when it made this mistake. Some other cost functions like raw error count are insensitive to how "sure" the classifier is when it makes decisions and those cost functions only take into consideration the final classification result, which essentially means losing a great deal of useful information. Some other cost functions such as MSE would give more emphasis on the wrongly classified instances, which is not always the desired feature to have.
I'm trying to train a network with an unbalanced data. I have A (198 samples), B (436 samples), C (710 samples), D (272 samples) and I have read about the "weighted_cross_entropy_with_logits" but all the examples I found are for binary classification so I'm not very confident in how to set those weights.
Total samples: 1616
A_weight: 198/1616 = 0.12?
The idea behind, if I understood, is to penalize the errors of the majority class and value more positively the hits in the minority one, right?
My piece of code:
weights = tf.constant([0.12, 0.26, 0.43, 0.17])
cost = tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=pred, targets=y, pos_weight=weights))
I have read this one and others examples with binary classification but still not very clear.
Note that weighted_cross_entropy_with_logits is the weighted variant of sigmoid_cross_entropy_with_logits. Sigmoid cross entropy is typically used for binary classification. Yes, it can handle multiple labels, but sigmoid cross entropy basically makes a (binary) decision on each of them -- for example, for a face recognition net, those (not mutually exclusive) labels could be "Does the subject wear glasses?", "Is the subject female?", etc.
In binary classification(s), each output channel corresponds to a binary (soft) decision. Therefore, the weighting needs to happen within the computation of the loss. This is what weighted_cross_entropy_with_logits does, by weighting one term of the cross-entropy over the other.
In mutually exclusive multilabel classification, we use softmax_cross_entropy_with_logits, which behaves differently: each output channel corresponds to the score of a class candidate. The decision comes after, by comparing the respective outputs of each channel.
Weighting in before the final decision is therefore a simple matter of modifying the scores before comparing them, typically by multiplication with weights. For example, for a ternary classification task,
# your class weights
class_weights = tf.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
You could also rely on tf.losses.softmax_cross_entropy to handle the last three steps.
In your case, where you need to tackle data imbalance, the class weights could indeed be inversely proportional to their frequency in your train data. Normalizing them so that they sum up to one or to the number of classes also makes sense.
Note that in the above, we penalized the loss based on the true label of the samples. We could also have penalized the loss based on the estimated labels by simply defining
weights = class_weights
and the rest of the code need not change thanks to broadcasting magic.
In the general case, you would want weights that depend on the kind of error you make. In other words, for each pair of labels X and Y, you could choose how to penalize choosing label X when the true label is Y. You end up with a whole prior weight matrix, which results in weights above being a full (num_samples, num_classes) tensor. This goes a bit beyond what you want, but it might be useful to know nonetheless that only your definition of the weight tensor need to change in the code above.
See this answer for an alternate solution which works with sparse_softmax_cross_entropy:
import tensorflow as tf
import numpy as np
np.random.seed(123)
sess = tf.InteractiveSession()
# let's say we have the logits and labels of a batch of size 6 with 5 classes
logits = tf.constant(np.random.randint(0, 10, 30).reshape(6, 5), dtype=tf.float32)
labels = tf.constant(np.random.randint(0, 5, 6), dtype=tf.int32)
# specify some class weightings
class_weights = tf.constant([0.3, 0.1, 0.2, 0.3, 0.1])
# specify the weights for each sample in the batch (without having to compute the onehot label matrix)
weights = tf.gather(class_weights, labels)
# compute the loss
tf.losses.sparse_softmax_cross_entropy(labels, logits, weights).eval()
Tensorflow 2.0 Compatible Answer: Migrating the Code specified in P-Gn's Answer to 2.0, for the benefit of the community.
# your class weights
class_weights = tf.compat.v2.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.compat.v2.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.compat.v2.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
For more information about migration of code from Tensorflow Version 1.x to 2.x, please refer this Migration Guide.