Getting wrong output while calculating Cross entropy loss using pytorch
Hi guys ,
I calculated the cross entropy loss using pytorch , Input = torch.tensor([[1.,0.0,0.0],[1.,0.0,0.0]]) ,label = torch.tensor([0, 0]) . The output must be 0 but I got ( tensor(0.5514) ) . Anyone Plz say why it's coming 0.55 instead of 0 code for reference
Yes, you are getting the correct output.
import torch
Input = torch.tensor([[1.,0.0,0.0],[1.,0.0,0.0]])
label = torch.tensor([0, 0])
print(torch.nn.functional.cross_entropy(Input,label))
# tensor(0.5514)
torch.nn.functional.cross_entropy function combines log_softmax and nll_loss in a single function:
It is equivalent to :
torch.nn.functional.nll_loss(torch.nn.functional.log_softmax(Input, 1), label)
Code for reference:
print(torch.nn.functional.softmax(Input, 1).log())
tensor([[-0.5514, -1.5514, -1.5514],
[-0.5514, -1.5514, -1.5514]])
print(torch.nn.functional.log_softmax(Input, 1))
tensor([[-0.5514, -1.5514, -1.5514],
[-0.5514, -1.5514, -1.5514]])
print(torch.nn.functional.nll_loss(torch.nn.functional.log_softmax(Input, 1), label))
tensor(0.5514)
Now, you see:
torch.nn.functional.cross_entropy(Input,label) is equal to
torch.nn.functional.nll_loss(torch.nn.functional.log_softmax(Input, 1), label)
Related
I am writing a custom loss function for semi supervised learning on cifar-10 dataset, for which I need to duplicate columns of my tensor for creating a sort of mask which I then multiply with the activation values to later sum over.
My loss function is a sum of entropy and cross entropy for unlabelled and labeled samples. I add an extra class and set it to 1 for unlabelled samples.
I then create a mask for identifying row indices of unlabelled samples from the y_true tensor. From that I should get a (n_samples, 1) tensor which I need to repeat/duplicate/copy to a (n_samples, 11) tensor that I can multiply with the activation values in y_pred
Loss function code:
a = np.ones((mini_batch_size, 1)) * 10
a_var = K.variable(value=a)
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32')
e_loss = K.sum(K.concatenate([v,v,v,v,v,v,v,v,v,v,v], axis=-1) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=1), 'float32'), a_var), 'float32'))
b = np.ones((mini_batch_size, 1)) * 10
b_var = K.variable(value=b)
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=1), 'float32'), b_var), 'float32')
ce_loss = K.sum(K.concatenate([v2, v2, v2, v2, v2, v2, v2, v2, v2, v2, v2], axis=1) * K.log(y_pred))
m_l = K.variable(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
The error I get is:
InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
[[metrics_28/acc/Mean/_2627]]
(1) Invalid argument: Incompatible shapes: [40,11] vs. [40,440]
[[{{node loss_36/dense_74_loss/mul_2}}]]
0 successful operations.
0 derived errors ignored.
My batch size is 40.
I need my concatenated tensor to be of size [40, 11] not [40, 440]
I don't have real data to test whether the loss properly works, but this got rid of that InvalidArgumentError and did work with model.fit() for a dense model.
Few changes I did,
You don't have to repeat your v 11 times to multiply that with y_pred. All you need is reshape it to (-1,1) - (Will save you memory)
Got rid of all the K.variables. Now this is something I want to check with you, you are not trying to optimize a_var and b_var right (i.e. that's not a part of the model)? (Apparently, that's what's causing the issue. I need to dive deeper to see why). It seems the whole point of a_var and b_var is to perform boolean logics equal and not_equal, which works just fine with the constant.
Made m_l a K.constant
def loss_fn(y_true, y_pred):
v = K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
e_loss = K.sum(K.reshape(v, (-1,1)) * K.log(y_pred) * y_pred)
m_u = K.sum(K.cast(K.equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32'))
v2 = K.cast(K.not_equal(K.cast(K.argmax(y_true, axis=-1), 'float32'), 10), 'float32')
ce_loss = K.sum(K.reshape(v2, (-1,1)) * K.log(y_pred))
m_l = K.constant(value=float(mini_batch_size), dtype='float32') #- m_u
return -((e_loss/m_u) + (ce_loss/m_l))
Note: Depending on the batch size within the loss function is a bad idea. Try to get rid of any batch_size dependent operations (especially for shape of tensors). You can see that I only have kept mini_batch_size to set m_l. But I would suggest setting this to some constant instead of min_batch_size. Because, if a batch with <40 comes through, you are using a different loss function for that batch. And your results aren't comparable between different batch sizes, as your loss function changes.
In setting up the model I sometimes see the code:
# Scenario 1
# Define loss and optimizer
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=logits, labels=Y))
or
# Scenario 2
# Evaluate model (with test logits, for dropout to be disabled)
prediction = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))
The definition of tf.reduce_mean states that it "calculates the mean of tensor elements along various dimensions of the tensor." I am confused about what it does in simpler language? When do we need to use it, maybe with reference to # Scenario 1 & 2 ? Thank you
As far as I understand, tensorflow.reduce_mean is the same as numpy.mean. It creates an operation in the underlying tensorflow graph which computes the mean of a tensor.
The most important keyword argument of tensorflow.reduce_mean is axis. Basically, if you have a tensor with shape (4, 3, 2) and axis=1, an empty array with shape (4, 2) will be created, and the mean values along the selected axis will be computed to fill in the empty array. (This is just a pseudo-process to help you make sense of the output, but may not be the actual process)
Here is a simple example to help you understand
import tensorflow as tf
import numpy as np
one = np.linspace(1, 30, 30).reshape(5, 3, 2)
x = tf.placeholder('float32', shape=[5, 3, 2])
op_1 = tf.reduce_mean(x)
op_2 = tf.reduce_mean(x, axis=0)
op_3 = tf.reduce_mean(x, axis=1)
op_4 = tf.reduce_mean(x, axis=2)
with tf.Session() as sess:
print(sess.run(op_1, feed_dict={x: one}))
print(sess.run(op_2, feed_dict={x: one}))
print(sess.run(op_3, feed_dict={x: one}))
print(sess.run(op_4, feed_dict={x: one}))
The first output is a number because we didn't provide an axis. The shapes of the rest of the outputs are (3, 2), (5, 2) and (5, 3), respectively.
reduce_mean can be useful when the target value is a matrix.
User #meTchaikovsky explained the general case of tf.reduce_mean. In both of your cases tf.reduce_mean simply works as any mean calculator i.e,. you're not taking mean along any particular axis of a tensor, you simply divide the sum of the elements in a tensor by number of elements.
Let's decode what exactly is happening in both the cases. For the both the cases assume batch_size = 2 and num_classes = 5, meaning that there are two examples per batch.
Now for the first case, tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y) returns an array of shape (2,).
>>import numpy as np
>>import tensorflow as tf
>>sess= tf.InteractiveSession()
>>batch_size = 2
>>num_classes = 5
>>logits = np.random.rand(batch_size,num_classes)
>>print(logits)
[[0.94108451 0.68186329 0.04000461 0.25996487 0.50391948]
[0.22781201 0.32305269 0.93359371 0.22599208 0.05942905]]
>>labels = np.array([[1,0,0,0,0],[0,1,0,0,0]])
>>print(labels)
[[1 0 0 0 0]
[0 1 0 0 0]]
>>logits_ = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>Y_ = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>loss_op = tf.nn.softmax_cross_entropy_with_logits(logits=logits_, labels=Y_)
>>loss_per_example = sess.run(loss_op,feed_dict={Y_:labels,logits_:logits})
>>print(loss_per_example)
array([1.2028817, 1.6912657], dtype=float32)
You can see that loss_per_example is of shape (2,). If we take the mean of this variable then we can approximate the average loss for the full batch. Hence we calculate
>>loss_per_example_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>final_loss_per_batch = tf.reduce_mean(loss_per_example_holder)
>>final_loss = sess.run(final_loss_per_batch,feed_dict={loss_per_example_holder:loss_per_example})
>>print(final_loss)
1.4470737
Coming to your second case:
>>predictions_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size,num_classes))
>>labels_holder = tf.placeholder(dtype=tf.int32,shape=(batch_size,num_classes))
>>prediction_tf = tf.equal(tf.argmax(predictions_holder, 1), tf.argmax(labels_holder, 1))
>>labels_match = sess.run(prediction_tf,feed_dict={predictions_holder:logits,labels_holder:labels})
>>print(labels_match)
[ True False]
The above output was expected because only the first example of the variable logits says that the neuron with highest activation (0.9410) is zeroth which is same as labels. Now we want to calculate the accuracy, which means we have to take the average of the variable labels_match.
>>labels_match_holder = tf.placeholder(dtype=tf.float32,shape=(batch_size))
>>accuracy_calc = tf.reduce_mean(tf.cast(labels_match_holder, tf.float32))
>>accuracy = sess.run(accuracy_calc, feed_dict={labels_match_holder:labels_match})
>>print(accuracy)
0.5
This is my code where I am running multi layer LSTM:
layers = [tf.contrib.rnn.LSTMCell(num_units=n_neurons,
activation=tf.nn.leaky_relu, use_peepholes = True)
for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
rnn_outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = tf.layers.dense(stacked_rnn_outputs, n_outputs)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
outputs = outputs[:,n_steps-1,:] # keep only last output of sequence
loss = tf.reduce_mean(tf.squared_difference(outputs, y)) # loss function = mean squared error
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
As you can see I am using the Root Mean Square Error function for directing my LSTM. The output what I am getting is in between 0 and 1 which is not exactly what I was expecting. My guess is that this is happening because my error function is not correct for the operation.
I am expecting the output of the training must be in between 0 and 0.5 and 0.5 and 1. But the output is weird. It come around 0.9 and 1, which is not correct. Ultimately I am getting the mean of all. But that is not what I am expecting as correction or loss function. Please guide me what can be the best error function for my scenario.
I created a neural network and attempted training it, all was well until I added in a bias.
From what I gather when training the bias adjusts to move the expected output up or down, and the weights tend towards a value that helps YHat emulate some function, so for a two layer network:
output = tanh(tanh(X0W0 + b0)W1 + b1)
In practice what I've found is W sets all weights to near 0, and b almost echos the trained output of Y. Which essentially makes the output work perfectly for the trained data, but when you give it different kinds of data it will always give the same output.
This has caused quite some confusion. I know that the bias' role is to move the activation graph up or down but when it comes to training it seems to make the entire purpose of the neural network irrelevant. Here is the code from my training method:
def train(self, X, Y, loss, epoch=10000):
for i in range(epoch):
YHat = self.forward(X)
loss.append(sum(Y - YHat))
err = -(Y - YHat)
for l in self.__layers[::-1]:
werr = np.sum(np.dot(l.localWGrad, err.T), axis=1)
werr.shape = (l.height, 1)
l.adjustWeights(werr)
err = np.sum(err, axis=1)
err.shape = (X.shape[0], 1)
l.adjustBiases(err)
err = np.multiply(err, l.localXGrad)
and the code for adjusting the weghts and biases. (Note: epsilon is my training rate and lambda is the regularisation rate)
def adjustWeights(self, err):
self.__weights = self.__weights - (err * self.__epsilon + self.__lambda * self.__weights)
def adjustBiases(self, err):
a = np.sum(np.multiply(err, self.localPartialGrad), axis=1) * self.__epsilon
a.shape = (err.shape[0], 1)
self.__biases = self.__biases - a
And here is the math I've done for this network.
Z0 = X0W0 + b0
X1 = relu(Z0)
Z1 = X1W1 + b1
X2 = relu(Z1)
a = YHat-X2
#Note the second part is for regularisation
loss = ((1/2)*(a^2)) + (lambda*(1/2)*(sum(W1^2) + sum(W2^2)))
And now the derivatives
dloss/dW1 = -(YHat-X2)*relu'(X1W1 + b1)X1
dloss/dW0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)X0
dloss/db1 = -(YHat-X2)*relu'(X1W1 + b1)
dloss/db0 = -(YHat-X2)*relu'(X1W1 + b1)W1*relu'(X0W0 + b0)
I'm guessing I'm doing something wrong, but I have no idea what it is. I tried training this network on the following inputs
X = np.array([[0.0], [1.0], [2.0], [3.0]])
Xnorm = X / np.amax(X)
Y = np.array([[0.0], [2.0], [4.0], [6.0]])
Ynorm = Y / np.amax(Y)
And I get this as the output:
post training:
shape: (4, 1)
[[0. ]
[1.99799666]
[3.99070622]
[5.72358125]]
Expected:
[[0.]
[2.]
[4.]
[6.]]
Which seems great... until you forward something else:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
Then I get:
shape: (4, 1)
[[0.58289512]
[2.59967085]
[4.31654068]
[5.74322541]]
Expected:
[[4.]
[6.]
[8.]
[10.]]
I thought "perhapse this is the evil 'Overfitting I've heard of" and decided to add in some regularisation, but even that doesn't really solve the issue, why would it when it makes sense from a logical perspective that it's faster, and more optimal to set the biases to equal the output and make the weights zero... Can someone explain what's going wrong in my thinking?
Here is the network structure post training, (note if you multiply the output by the max of the training Y you will get the expected output:)
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 3)
[[0.05539559 0.05539442 0.05539159]]
Biases: (4, 1)
[[0. ]
[0.22897166]
[0.56300199]
[1.30167665]]
==============\Layer 0 :===============
===============Layer 1 :===============
Weights: (3, 1)
[[0.29443245]
[0.29442639]
[0.29440642]]
Biases: (4, 1)
[[0. ]
[0.13199981]
[0.32762199]
[1.10023446]]
==============\Layer 1 :===============
==========================\NeuralNetwork===========================
The graph y = 2x has a y intercept crosses at x=0, and thus it would make sense for all the bias' to be 0 as we aren't moving the graph up or down... right?
Thanks for reading this far!
edit:
Here is the loss graph:
edit 2:
I just tried to do this with a single weight and output and here is the output structure I got:
===========================NeuralNetwork===========================
Layers:
===============Layer 0 :===============
Weights: (1, 1)
[[0.47149317]]
Biases: (4, 1)
[[0. ]
[0.18813419]
[0.48377987]
[1.33644038]]
==============\Layer 0 :===============
==========================\NeuralNetwork===========================
and for this input:
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I got this output:
shape: (4, 1)
[[4.41954787]
[5.53236625]
[5.89599366]
[5.99257962]]
when again it should be:
Expected:
[[4.]
[6.]
[8.]
[10.]]
Note the problem with the biases persist, you would think in this situation the weight would be 2, and the bias would be 0.
Moved answer from OP's question
Turns out I never dealt with my training data properly. The input vector:
[[0.0], [1.0], [2.0], [3.0]]
was normalised, I divided this vector by the max value in the input which was 3, and thus I got
[[0.0], [0.3333], [0.6666], [1.0]]
And for the input Y training vector I had
[[0.0], [2.0], [4.0], [6.0]]
and I foolishly decided to do the same with this vector,but with the max of Y 6:
[[0.0], [0.333], [0.666], [1.0]]
So basically I was saying "hey network, mimic my input". This was my first error. The second error came as a result of more misunderstanding of the scaling.
Although 1 was 0.333, and 0.333*2 = 0.666, which I then multiplied by the max of y (6) 6*0.666 = 2, if I try this again with a different set of data say:
[[2.0], [3.0], [4.0], [5.0]]
2 would be 2/5 = 0.4 and 0.4*2 = 0.8, which times 5 would be 2, however in the real world we would have no way of knowing that 5 was the max output of the dataset, and thus I thought maybe it would have been the max of the Y training, which was 6 so instead of 2/5 = 0.4, 0.4*2 = 0.8 * 5, I done 2/5 = 0.4, 0.4*2 = 0.8 * 6 = 4.8.
So then I got some strange behaviours of the biases and weights as a result. So after essentially getting rid of the normalisation, I was free to tweak the hyperparameters and now as an output for the base training data:
input:
X:
[[0.]
[1.]
[2.]
[3.]]
I get this output:
shape: (4, 1)
[[0.30926124]
[2.1030826 ]
[3.89690395]
[5.6907253 ]]
and for the extra testing data (not trained on):
shape: (4, 1)
[[2.]
[3.]
[4.]
[5.]]
I get this output:
shape: (4, 1)
[[3.89690395]
[5.6907253 ]
[7.48454666]
[9.27836801]]
So now I'm happy. I also changed my activation to a leaky relu as it should fit a linear equation better (I think.). I'm sure with more testing data and more tweaking of the hyperparameters it would be a perfect fit. Thanks for the help everyone. Trying to explain my problem really put things into perspective.
Here is an implementation of AND function with single neuron using tensorflow:
def tf_sigmoid(x):
return 1 / (1 + tf.exp(-x))
data = [
(0, 0),
(0, 1),
(1, 0),
(1, 1),
]
labels = [
0,
0,
0,
1,
]
n_steps = 1000
learning_rate = .1
x = tf.placeholder(dtype=tf.float32, shape=[2])
y = tf.placeholder(dtype=tf.float32, shape=None)
w = tf.get_variable('W', shape=[2], initializer=tf.random_normal_initializer(), dtype=tf.float32)
b = tf.get_variable('b', shape=[], initializer=tf.random_normal_initializer(), dtype=tf.float32)
h = tf.reduce_sum(x * w) + b
output = tf_sigmoid(h)
error = tf.abs(output - y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(error)
sess.run(tf.initialize_all_variables())
for step in range(n_steps):
for i in np.random.permutation(range(len(data))):
sess.run(optimizer, feed_dict={x: data[i], y: labels[i]})
Sometimes it works perfectly, but on some parameters it gets stuck and doesn't want to learn. For example with these initial parameters:
w = tf.Variable(initial_value=[-0.31199348, -0.46391705], dtype=tf.float32)
b = tf.Variable(initial_value=-1.94877, dtype=tf.float32)
it will hardly make any improvement in cost function. What am I doing wrong, maybe I should somehow adjust initialization of parameters?
Aren't you missing a mean(error) ?
Your problem is the particular combination of the sigmoid, the cost function, and the optimizer.
Don't feel bad, AFAIK this exact problem stalled the entire field for a few years.
Sigmoid is flat when you're far from the middle, and You're initializing it with relatively large numbers, try /1000.
So your abs-error (or square-error) is flat too, and the GradientDescent optimizer takes steps proportional to the slope.
Either of these should fix it:
Use cross-entropy for the error - it's convex.
Use a better Optimizer, like Adam
, who's step size is much less dependent on the slope. More on the consistency of the slope.
Bonus: Don't roll your own sigmoid, use tf.nn.sigmoid, you'll get a lot fewer NaN's that way.
Have fun!