Compute cross entropy loss for classification in pytorch - python

I am trying to build two neural network for classification. One for Binary and the second is for multi-class classification. I am trying to use the torch.nn.CrossEntropyLoss() as a loss function, but I try to train my first neural network I get the following error:
multi-target not supported at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THNN/generic/ClassNLLCriterion.c:22
From my analysis, I found that the my dataset has two problems that caused the error.
My data set is one hot encoded. I used one hot encoding to pre processes my dataset. The first target Y_binary variable has the shape of torch.Size([125973, 1]) full of 0s and 1 indicating classes 'No' and 'Yes'.
My data has the wrong dimensions? I found that I can't use a simple vector with the cross entropy loss function. Some people used the following code to reshape their target vector before feeding to the loss function.
out = out.permute(0, 2, 3, 1).contiguous().view(-1, class_number)
But I didn't really understand the reasoning behind this code. But it seems for my that I need to keep track of the following variables: Class_Number, Batch_size, Dimension_Output. For my code here are the dimensions
X_train.shape: (125973, 122)
Y_train2.shape: (125973, 1)
batch_size = 64
K = len(set(Y_train2)) # Binary classification For multi class classification use K = len(set(Y_train5))
Should the target value be one hot encoded? If not, how I can feed a nominal feature to the loss function?
If I use reshape the output, can you help me do this for my code ?
I am trying to use this loss function for both my neural networks.
Thank you in advance,

The error is due to the usage of torch.nn.CrossEntropyLoss() which can be used if you want to predict 1 class out of N classes. For multiclass classification, you should use torch.nn.BCEWithLogitsLoss() which combines a Sigmoid layer and the BCELoss in one single class.
In case of multi-class, and if you use Sigmoid + BCELoss, then you need the target to be one-hot encoding, i.e. something like this per sample: [0 1 0 0 0 1 0 0 1 0], where 1 will be at the locations of classes present.

Related

Loss function for comparing two vectors for categorization

I am performing a NLP task where I analyze a document and classify it into one of six categories. However, I do this operation at three different time periods. So the final output is an array of three integers (sparse), where each integer is the category 0-5. So a label looks like this: [1, 4, 5].
I am using BERT and am trying to decide what type of head I should attach to it, as well as what type of loss function I should use. Would it make sense to use BERT's output of size 1024 and run it through a Dense layer with 18 neurons, then reshape into something of size (3,6)?
Finally, I assume I would use Sparse Categorical Cross-Entropy as my loss function?
The bert final hidden state is (512,1024). You can either take the first token which is the CLS token or take the average pooling. Either way your final output is shape (1024,) now simply put 3 linear layers of shape (1024,6) as in nn.Linear(1024,6) and pass it into the loss function below. (you can make it more complex if you want to)
Simply add up the loss and call backward. Remember you can call loss.backward() on any scalar tensor.(pytorch)
def loss(time1output,time2output,time3output,time1label,time2label,time3label):
loss1 = nn.CrossEntropyLoss()(time1output,time1label)
loss2 = nn.CrossEntropyLoss()(time2output,time2label)
loss3 = nn.CrossEntropyLoss()(time3output,time3label)
return loss1 + loss2 + loss3
In a typical setup you take a CLS output of BERT (a vector of length 768 in case of bert-base and 1024 in case of bert-large) and add a classification head (it may be a simple Dense layer with dropout). In this case the inputs are word tokens and the output of the classification head is a vector of logits for each class, and usually a regular Cross-Entropy loss function is used. Then you apply softmax to it and get probability-like scores for each class, or if you apply argmax you will get the winning class. So the result might be either vector of classification scores [1x6] or the dominant class index (an integer).
Image taken from d2l.ai
You can simply concatenate 3 such networks (for each time period) to get the desired result.
Obviously, I have described only one possible solution. But as it is usually provide good results I suggest you try it before moving over to more complex ones.
Finally, Sparse Categorical Cross-Entropy loss is used when output is sparse (say [4]) and regular Categorical Cross-Entropy loss is used when output is one-hot encoded (say [0 0 0 0 1 0]). Otherwise they are absolutely the same.

Scaling the sigmoid output

I am training a Network on images for binary classification. The input images are normalized to have pixel values in the range[0,1]. Also, the weight matrices are initialized from a normal distribution. However, the output from my last Dense layer with sigmoid activation yields values with a very minute difference for the two classes. For example -
output for class1- 0.377525 output for class2- 0.377539
The difference for the classes comes after 4 decimal places. Is there any workaround to make sure that the output for class 1 falls around 0 to 0.5 and for class 2 , it falls between 0.5 to 1.
Edit:
I have tried both the cases.
Case 1 - Dense(1, 'sigmoid') with binary crossentropy
Case 2- Dense(2, 'softmax') with binary crossentropy
For case1, the output values differ by a very small amount as mentioned in the problem above. As such , i am taking mean of the predicted values to act as threshold for classification. This works upto some extent, but not a permanent solution.
For case 2 - the prediction overfits to one class only.
A sample code : -
inputs = Input(shape = (128,156,1))
x = Conv2D(.....)(inputs)
x = BatchNormalization()(x)
x = Maxpooling2D()(x)
...
.
.
flat=Flatten()(x)
out = Dense(1,'sigmoid')(x)
model = Model(inputs,out)
model.compile(optimizer='adamax',loss='binary_crossentropy',metrics=['binary_accuracy'])
It seems you are confusing a binary classification architecture with a 2 label multi-class classification architecture setup.
Since you mention the probabilities for the 2 classes, class1 and class2, you have, set up a single label multi-class setup. That means, you are trying to predict the probabilities of 2 classes, where a sample can have only one of the labels at a time.
In this setup, it's proper to use softmax instead of sigmoid. Your loss function would be binary_crossentropy as well.
Right now, with the multi-label setup and sigmoid activation, you are independently predicting the probability of a sample being class1 and class2 simultaneously (aka, multi-label multi-class classification).
Once you change to softmax you should see more significant differences between the probabilities IF the sample actually definitively belongs to one of the 2 classes and if your model is well trained & confident about its predictions (validation vs training results)
First, I would like to say the information you provided is insufficient to exactly debug your problem, because you didn't provide any code of your model and optimizer. I suspect there might be an error in the labels, and I also suggest you use a softmax activation fuction instead of the sigmoid function in the final layer, although it will still work through your approach, binary classification problems must output one single node and loss must be binary cross entropy.
If you want to receive an accurate solution, please provide more information.

Keras output layer gives unexpected error

I am using Keras for a binary classification problem. I am using the following adaptation of LeNet:
lenet_model = models.Sequential()
lenet_model.add(Convolution2D(filters=filt_size, kernel_size=(kern_size,
kern_size), padding='valid', input_shape=input_shape))
lenet_model.add(Activation('relu'))
lenet_model.add(BatchNormalization())
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=64, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Convolution2D(filters=128, kernel_size=(kern_size,
kern_size), padding='valid'))
lenet_model.add(Activation('relu'))
lenet_model.add(MaxPooling2D(pool_size=(maxpool_size, maxpool_size)))
lenet_model.add(Flatten())
lenet_model.add(Dense(1024, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dense(512, kernel_initializer='uniform'))
lenet_model.add(Activation('relu'))
lenet_model.add(Dropout(0.2))
lenet_model.add(Dense(1, kernel_initializer='uniform'))
lenet_model.add(Activation('sigmoid'))
lenet_model.compile(loss='binary_crossentropy', optimizer=Adam(),
metrics=['accuracy'])
But I am getting this:
ValueError: Error when checking model target: expected activation_6 to have shape (None, 1) but got array with shape (1652, 2). It gets resolved if I use 2 in the final Dense layer.
I would suggest first check the dimensionality of your data. The training dataset target is 2 dimensional, but the model takses 1 dimensional data.
You have set lenet_model.add(Dense(1, kernel_initializer='uniform')) to accept 2 dimensional data. You need to set the final dense layer shape such that it accepts target shape (None,2)
lenet_model.add(Dense(2, kernel_initializer='uniform')) is what it should be else preprocess your data such that target data is 1 dimensional data.
Consider reading the documentaion before writing the code next time.
It seems that in your preprocessing steps, you have used functions to turn your numerical class labels into categorical ones, i.e., representing numerical classes in the one-hot coding scheme (in Keras, to_categorical(y, num_classes=2) would do this job for you).
Since you are dealing with a binary problem, if the original labels are 0s and 1s, the coded categorical labels would be 01s and 10s (in labels coded in the one-hot scheme, counting from right to left, the nth digit would be 1 if the numerical class for this instance is n while the rest of that label would be 0). This would explain why your data dimension in the error traceback is (1652, 2).
However, since you have set the output dimension in your model to 1, your output layer would expect the desired labels in data to be of 1 digit only, which would correspond to the raw data before you applied any preprocessing steps mentioned above.
So, you could fix this problem either by taking out the preprocessing for the labels or changing the output dimension to 2. If you stick with using categorical labels coded in the one-hot fashion, you should also switch the sigmoid activation in the last layer to softmax activation since sigmoid only deals with binary numerical classes, i.e., 0 or 1. For a binary classification problem, these two choices should not differ in performance much.
One thing worth mentioning is that you should also pay attention to the cost function you use when you compile this model. Generally speaking, categorical labels work the best with cost functions like categorical crossentropy. Especially for multi-class classification (more than 2 classes) problems where you would have to use categorical labels together with a softmax activation, categorical crossentropy should pretty much be your default choice since it has many benefits over some other common cost functions such as MSE and raw error count.
One of the many benefits of categorical crossentropy would be the fact that it penalizes a "very confident mistake" much more than the case where the classifier "almost got it right", which makes sense. For example, in a binary classification setting using categorical crossentropy as the cost function, a classifier that is 95% sure that a given instance is of class 0 whereas the instance actually belongs to class 1 would be penalized more than a classifier that is 51% percent sure when it made this mistake. Some other cost functions like raw error count are insensitive to how "sure" the classifier is when it makes decisions and those cost functions only take into consideration the final classification result, which essentially means losing a great deal of useful information. Some other cost functions such as MSE would give more emphasis on the wrongly classified instances, which is not always the desired feature to have.

Custom Loss Function in TensorFlow for weighting training data

I want to weight the training data based on a column in the training data set. Thereby giving more importance to certain training items than others. The weighting column should not be included as a feature for the input layer.
The Tensorflow documentation holds an example how to use the label of the item to assign a custom loss and thereby assigning weight:
# Ensures that the loss for examples whose ground truth class is `3` is 5x
# higher than the loss for all other examples.
weight = tf.multiply(4, tf.cast(tf.equal(labels, 3), tf.float32)) + 1
onehot_labels = tf.one_hot(labels, num_classes=5)
tf.contrib.losses.softmax_cross_entropy(logits, onehot_labels, weight=weight)
I am using this in a custom DNN with three hidden layers. In theory i simply need to replace labels in the example above with a tensor containing the weight column.
I am aware that there are several threads that already discuss similar problems e.g. defined loss function in tensorflow?
For some reason i am running into a lot of problems trying to bring my weight column in. It's probably two easy lines of code or maybe there is an easier way to achieve the same result.
I believe i found the answer:
weight_tf = tf.range(features.get_shape()[0]-1, features.get_shape()[0])
loss = tf.losses.softmax_cross_entropy(target, logits, weights=weight_tf)
The weight is the last column in the features tensorflow.

Is it possible using Tensorflow to create a neural network for input/output mapping?

I am currently using tensorflow to create a neural network, that replicates the function of creating a certain output given an input.
The input in this case is a sampled audio, and the audio is generating MFCC features. Know for each file what the corresponding MFCC feature, is, but aren't sure how i should setup the neural network.
I am following this guide/tutorial http://www.kdnuggets.com/2016/09/urban-sound-classification-neural-networks-tensorflow.html/2
It which it says that the neural network is setup as such
training_epochs = 5000
n_dim = tr_features.shape[1]
n_classes = 10
n_hidden_units_one = 280
n_hidden_units_two = 300
sd = 1 / np.sqrt(n_dim)
learning_rate = 0.01
My question here is how i define the number of classes? I mean, the real values I've computed aren't divided into classes, but is a decimal number, so should I just create multiple networks with different number of classes, and choose the one which has the smallest error compared to the original value, or is there a tensorflow command that can do that, as I am doing supervised learning..
Neural networks could be use for classification tasks or regression tasks. In tutorial, the author wants to classify sounds into 10 different categories. So the neural networks have 10 output neurons (n_classes) and each of their activation value give the probability of membership to a class for an input sound.
In our case, you want to map a given sound with a decimal number (that's right ?), so it's a regression task : the neural network have to learn an unknown function. The number of output neurons has to be equal to the output dimension of our unknown function (1 if it's just a decimal number).
So if you want to keep the same architecture to our regression task, just set n_classes = 1 and modify y_ to
y_ = tf.matmul(h_2,W) + b
because tf.nn.softmax convert the final score to probability (it's good for classification but not for regression)

Categories