I'm working on a 200-class classification task(but it's a little bit different, because there might be multiple 1's in the y vector) using a 4-layer fully-connected neural network. For most times y(label vector) contains one or two 1's, and that's where the problem is. When training, the model tends to predict all the labels as zero, even it should be 1.
Thus the accuracy is low(less than 99%, which is almostly worse than all-zero prediction). The activation function for each layer is sigmoid. Could you give me some advice to improve the model?
This is my loss function. The accuracy is low because when I predict all labels as 0, it'll get almost 99% accuracy.
loss = tf.reduce_mean(tf.reduce_sum(-(sum_all - sum_one) / sum_all * tf.multiply(ys, tf.log(prediction)) - sum_one / sum_all * tf.multiply((one - ys), tf.log(one - prediction)), reduction_indices = [1])) sum_one indicates the number of 1's in the label. I implemented a weighting here.
Related
I reused code from others to make head-pose prediction in Euler angles. The author trained a classification network that returns bin classification results for the three angles, i.e. yaw, roll, pitch. The number of bins is 66. They somehow convert the probabilities to the corresponding angle, as written from line 150 to 152 here. Could someone help to explain the formula?
These are the relevant lines of code in the above file:
[56] model = hopenet.Hopenet(torchvision.models.resnet.Bottleneck, [3, 4, 6, 3], 66) # a variant of ResNet50
[80] idx_tensor = [idx for idx in xrange(66)]
[81] idx_tensor = torch.FloatTensor(idx_tensor).cuda(gpu)
[144] yaw, pitch, roll = model(img)
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99
If we look at the training code, and the authors' paper,* we see that the loss function is a sum of two losses:
the raw model output (vector of probabilities for each bin category):
[144] yaw, pitch, roll = model(img)
a linear combination of the bin predictions (the predicted continuous angle):
[146] yaw_predicted = F.softmax(yaw)
[150] yaw_predicted = torch.sum(yaw_predicted.data[0] * idx_tensor) * 3 - 99
Since 3 * softmax(label_weighted_sum(output)) - 99 is the final layer in training the regression loss (but is not explicitly a part of the model's forward), this must be applied to the raw output to convert it from the vector of bin probabilities to a single angle prediction.
*
3.2. The Multi-Loss Approach
All previous work which predicted head pose using convolutional networks regressed all three Euler angles directly using a mean squared error loss. We notice that this approach does not achieve the best results on our large-scale synthetic training data.
We propose to use three separate losses, one for each angle. Each loss is a combination of two components: a binned pose classification and a regression component. Any
backbone network can be used and augmented with three fully-connected layers which predict the angles. These three fully-connected layers share the previous convolutional layers of the network.
The idea behind this approach is that by performing bin classification we use the very stable softmax layer and cross-entropy, thus the network learns to predict the neighbourhood of the pose in a robust fashion. By having three cross-entropy losses, one for each Euler angle, we have three signals which are backpropagated into the network
which improves learning. In order to obtain a fine-grained predictions we compute the expectation of each output angle for the binned output. The detailed architecture is shown
in Figure 2.
We then add a regression loss to the network, namely a mean-squared error loss, in order to improve fine-grained predictions. We have three final losses, one for each angle,
and each is a linear combination of both the respective classification and the regression losses. We vary the weight of the regression loss in Section 4.4 and we hold the weight of
the classification loss constant at 1. The final loss for each Euler angle is the following:
Where H and MSE respectively designate the crossentropy and mean squared error loss functions.
I'm currenly working on a dataset where I've to predict an integer output. It starts from 1 to N. I've build a network with loss function mse. But I feel like mse loss function may not be an ideal loss function to minimize in the case of integer output.
I'm also round my prediction to get integer output. Is there a way to make/optimize the model better in case of integer output.
Can anyone provide some help on how to deal with integer output/targets. This is the loss function I'm using right now.
model.compile(optimizer=SGD(0.001), loss='mse')
You are using the wrong loss, mean squared error is a loss for regression, and you have a classification problem (discrete outputs, not continuous).
So for this your model should have a softmax output layer:
model.add(Dense(N, activation="softmax"))
And you should be using a classification loss:
model.compile(optimizer=SGD(0.001), loss='sparse_categorical_crossentropy')
Assuming your labels are integers in the [0, N-1] range (off by one), this should work. To make a prediction, you should do:
output = np.argmax(model.predict(some_data), axis=1) + 1
The +1 is because integer labels go from 0 to N-1
Ordinal regression could be an appropriate approach, in case predicting the wrong month but close to the true month is considered a smaller mistake than predicting a value one year earlier or later. Only you can know that, based on the specific problem you want to solve.
I found an implementation of the appropriate loss function on github (no affiliation). For completeness, below I copy-paste the code from that repo:
from keras import backend as K
from keras import losses
def loss(y_true, y_pred):
weights = K.cast(
K.abs(K.argmax(y_true, axis=1) - K.argmax(y_pred, axis=1))/(K.int_shape(y_pred)[1] - 1),
dtype='float32'
)
return (1.0 + weights) * losses.categorical_crossentropy(y_true, y_pred)
This is a rather interesting question for Siamese network
I am following the example from https://keras.io/examples/mnist_siamese/.
My modified version of the code is in this google colab
The siamese network takes in 2 inputs (2 handwritten digits) and output whether they are of the same digit (1) or not (0).
Each of the two inputs are first processed by a shared base_network (3 Dense layers with 2 Dropout layers in between). The input_a is extracted into processed_a, input_b into processed_b.
The last layer of the siamese network is an euclidean distance layer between the two extracted tensors:
distance = Lambda(euclidean_distance,
output_shape=eucl_dist_output_shape)([processed_a, processed_b])
model = Model([input_a, input_b], distance)
I understand the reasoning behind using an euclidean distance layer for the lower part of the network: if the features are extracted nicely, then similar inputs should have similar features.
I am thinking, why not use a normal Dense layer for the lower part, as:
# distance = Lambda(euclidean_distance,
# output_shape=eucl_dist_output_shape)([processed_a, processed_b])
# model = Model([input_a, input_b], distance)
#my model
subtracted = Subtract()([processed_a, processed_b])
out = Dense(1, activation="sigmoid")(subtracted)
model = Model([input_a,input_b], out)
My reasoning is that if the extracted features are similar, then the Subtract layer should produce a small tensor, as the difference between the extracted features. The next layer, Dense layer, can learn that if the input is small, output 1, otherwise 0.
Because the euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise, I also need to invert the accuracy and loss function, as:
# the version of loss and accuracy for Euclidean distance layer
# def contrastive_loss(y_true, y_pred):
# '''Contrastive loss from Hadsell-et-al.'06
# http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
# '''
# margin = 1
# square_pred = K.square(y_pred)
# margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1 - y_true) * margin_square)
# def compute_accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# pred = y_pred.ravel() < 0.5
# return np.mean(pred == y_true)
# def accuracy(y_true, y_pred):
# '''Compute classification accuracy with a fixed threshold on distances.
# '''
# return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
### my version, loss and accuracy
def contrastive_loss(y_true, y_pred):
margin = 1
square_pred = K.square(y_pred)
margin_square = K.square(K.maximum(margin - y_pred, 0))
# return K.mean(y_true * square_pred + (1-y_true) * margin_square)
return K.mean(y_true * margin_square + (1-y_true) * square_pred)
def compute_accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
pred = y_pred.ravel() > 0.5
return np.mean(pred == y_true)
def accuracy(y_true, y_pred):
'''Compute classification accuracy with a fixed threshold on distances.
'''
return K.mean(K.equal(y_true, K.cast(y_pred > 0.5, y_true.dtype)))
The accuracy for the old model:
* Accuracy on training set: 99.55%
* Accuracy on test set: 97.42%
This slight change leads to a model that not learning anything:
* Accuracy on training set: 48.64%
* Accuracy on test set: 48.29%
So my question is:
1. What is wrong with my reasoning of using Substract + Dense for the lower part of the Siamese network?
2. Can we fix this? I have two potential solution in mind but I am not confident, (1) convoluted neural net for feature extraction (2) more dense layers for the lower part of the siamese network.
In case of two similar examples, after subtracting two n-dimensional feature vector (extracted using common/base feature extraction model) you will get zero or around zero value in most of the location of resulting n-dimensional vector on which next/output Dense layer works. On the other hand, we all know that in a ANN model weights are learnt in such a way that less important features produce very less responses and prominent/interesting features contributing towards decision produce high responses. Now you can understand that our subtracted features vector is just in the opposite direction because when two examples are from different class then they produce high responses and opposite for examples from same class. Furthermore with a single node in the output layer (no additional hidden layer before output layer) its quite difficult to learn for model to generate high response from zero values when two samples are of same class. This might be an important point to solve your problem.
Based on the above discussion, you may want to try following ideas:
transforming subtracted feature vector to ensure when there is similarity you get high responses, may be by doing subtraction from 1 or reciprocal (multiplicative inverse) followed by normalization.
Adding more Dense layer before output layer.
I wont be surprised if convolutional neural net instead of stacked Dense layer for feature extraction (as you are thinking) does not improve your accuracy much as it's just another way of doing the same (feature extraction).
I am creating a deep convolutional neural network for pixel-wise classification. I am using adam optimizer, softmax with cross entropy.
Github Repository
I asked a similar question found here but the answer I was given did not result in me solving the problem. I also have a more detailed graph of what it going wrong. Whenever I use softmax, the problem in the graph occurs. I have done many things such as adjusting training and epsilon rates, trying different optimizers, etc. The loss never decreases past 500. I do not shuffle my data at the moment. Using sigmoid in place of softmax results in this problem not occurring. However, my problem has multiple classes, so the accuracy of sigmoid is not very good. It should also be mentioned that when the loss is low, my accuracy is only about 80%, I need much better than this. Why would my loss suddenly spike like this?
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
You need label smoothing.
I just had the same problem. I was training with tf.nn.sparse_softmax_cross_entropy_with_logits which is the same as if you use tf.nn.softmax_cross_entropy_with_logits with one-hot labels. My dataset predicts the occurrence of rare events so the labels in the training set are 99% class 0 and 1% class 1. My loss would start to fall, then stagnate (but with reasonable predictions), then suddenly explode and then the predictions also went bad.
Using the tf.summary ops to log internal network state into Tensorboard, I observed that the logits were growing and growing in absolute value. Eventually at >1e8, tf.nn.softmax_cross_entropy_with_logits became numerically unstable and that's what generated those weird loss spikes.
In my opinion, the reason why this happens is with the softmax function itself, which is in line with Jai's comment that putting a sigmoid in there before the softmax will fix things. But that will quite surely also make it impossible for the softmax likelihoods to be accurate, as it limits the value range of the logits. But in doing so, it prevents the overflow.
Softmax is defined as likelihood[i] = tf.exp(logit[i]) / tf.reduce_sum(tf.exp(logit[!=i])). Cross-entropy is defined as tf.reduce_sum(-label_likelihood[i] * tf.log(likelihood[i]) so if your labels are one-hot, that reduces to just the negative logarithm of your target likelihood. In practice, that means you're pushing likelihood[true_class] as close to 1.0 as you can. And due to the softmax, the only way to do that is if tf.exp(logit[!=true_class]) becomes as close to 0.0 as possible.
So in effect, you have asked the optimizer to produce tf.exp(x) == 0.0 and the only way to do that is by making x == - infinity. And that's why you get numerical instability.
The solution is to "blur" the labels so instead of [0,0,1] you use [0.01,0.01,0.98]. Now the optimizer works to reach tf.exp(x) == 0.01 which results in x == -4.6 which is safely inside the numerical range where GPU calculations are accurate and reliably.
Not sure, what it causes it exactly. I had the same issue a few times. A few things generally help: You might reduce the learning rate, ie. the bound of the learning rate for Adam (eg. 1e-5 to 1e-7 or so) or try stochastic gradient descent. Adam tries to estimate learning rates which can lead to instable training: See Adam optimizer goes haywire after 200k batches, training loss grows
Once I also removed batchnorm and that actually helped, but this was for a "specially" designed network for stroke data (= point sequences), which was not very deep with Conv1d layers.
I want to implement an accuracy function for a triplet loss network so that I know, how does the algorithm works during the training. So far I have tried something, but I'm not sure whether it actually can work and also I have troubles implementing it in keras. My idea was to compare the predicted anchor-positive and anchor-negative distances (in y_pred), so that the positive distance should be low enough and the negative one large enough:
def accuracy(_, y_pred):
pos_treshold = 0.4
neg_treshold = 0.6
return K.mean(y_pred[0] < pos_treshold and y_pred[1] > neg_treshold)
The problem with this is that I couldn't figure out how to implement this and condition in keras.
Then I tried to find something on this topic of accuracy for triplet loss. One way of doing it is to define the accuracy as a proportion of the number of triplets in which the predicted distance between the anchor image and the positive image is less than the one between the anchor image and the negative image. With this I have even bigger problems in implementing it in keras.
I tried this (although I don't know whether it does what I described):
K.mean(y_pred[0] < y_pred[1])
which gives me accuracy around 0.5 all the time (probably some random stuff). So still I don't know whether the model is bad or the accuracy function is bad.
So my question is how to implement any reasonable accuracy function in keras? Whether it would be one of these two I don't really care.
That's what I use (condition y_pred[0] < y_pred[1]), while taking into account the batch dimension. Note that I'm not using a mean, so that it would support sample-weight.
def triplet_accuracy(_, y_pred):
'''
Input: y_pred shape is (batch_size, 2)
[pos, neg]
Output: shape (batch_size, 1)
loss[i] = 1 if y_pred[i, 0] < y_pred[i, 1] else 0
'''
subtraction = K.constant([-1, 1], shape=(2, 1))
diff = K.dot(y_pred, subtraction)
loss = K.maximum(K.sign(diff), K.constant(0))
return loss