pytorch custom loss function nn.CrossEntropyLoss - python

After studying autograd, I tried to make loss function myself.
And here are my loss
def myCEE(outputs,targets):
exp=torch.exp(outputs)
A=torch.log(torch.sum(exp,dim=1))
hadamard=F.one_hot(targets, num_classes=10).float()*outputs
B=torch.sum(hadamard, dim=1)
return torch.sum(A-B)
and I compared with torch.nn.CrossEntropyLoss
here are results
for i,j in train_dl:
inputs=i
targets=j
break
outputs=model(inputs)
myCEE(outputs,targets) : tensor(147.5397, grad_fn=<SumBackward0>)
loss_func = nn.CrossEntropyLoss(reduction='sum') : tensor(147.5397, grad_fn=<NllLossBackward>)
values were same.
I thought, because those are different functions so grad_fn are different and it
won't cause any problems.
But something happened!
After 4 epochs, loss values are turned to nan.
Contrary to myCEE, with nn.CrossEntropyLoss learning went well.
So, I wonder if there is a problem with my function.
After read some posts about nan problems, I stacked more convolutions to the model.
As a result 39-epoch training did not make an error.
Nevertheless, I'd like to know difference between myCEE and nn.CrossEntropyLoss

torch.nn.CrossEntropyLoss is different to your implementation because it uses a trick to counter instable computation of the exponential when using numerically big values. Given the logits output {l_1, ... l_j, ..., l_n}, the softmax is defined as:
softmax(l_i) = exp(l_i) / sum_j(exp(l_j))
The trick is to multiple both the numerator and denominator by exp(-β):
softmax(l_i) = exp(l_i)*exp(-β) / [sum_j(exp(l_j))*exp(-β)]
= exp(l_i-β) / sum_j(exp(l_j-β))
Then the log-softmax comes down to:
logsoftmax(l_i) = l_i - β - log[sum_j(exp(l_j-β))]
In practice β is chosen as the highest logit value i.e. β = max_j(l_j).
You can read more about it on this question: Numerically Stable Softmax.

Related

The output of softmax makes the binary cross entropy's output NAN, what should I do?

I have implemented a neural network in Tensorflow where the last layer is a convolution layer, I feed the output of this convolution layer into a softmax activation function then I feed it to a cross-entropy loss function which is defined as follows along with the labels but the problem is I got NAN as the output of my loss function and I figured out it is because I have 1 in the output of softmax. So, my question is what should I do in this case?
My input is a 16 by 16 image where I have 0 and 1 as the values of each pixel (binary classification)
My loss function:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
log_pred = tf.log(prediction, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
Note that log(0) is undefined so if ever prediction==0 or prediction==1 you will have a NaN.
In order to get around this it is commonplace to add a very small value epsilon to the value passed to tf.log in any loss function (we also do a similar thing when dividing to avoid dividing by zero). This makes our loss function numerically stable and the epsilon value is small enough to be negligible in terms of any inaccuracy it introduces to our loss.
Perhaps try something like:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
epsilon = tf.constant(0.000001)
log_pred = tf.log(prediction + epsilon, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction + epsilon, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
UPDATE:
As jdehesa points out in his comments though - the 'out of the box' loss functions handle the numerical stability issue nicely already

Numerical equivalence of PyTorch backpropagation

After i 'v written the simple neural network with numpy, i wanted to compare it numerically with PyTorch impementation. Running alone, seems my neural network implementation converges, so it seems to have no errors.
Also i v checked forward pass matches to PyTorch, so basic setup is correct.
But something different happens while backward pass, because the weights after one backpropagation are different.
I dont want to post full code here because its linked over several .py files, and most of the code is irrelevant to the question. I just want to know does PyTorch "basic" gradient descent or something different.
I m viewing the most simle example about full-connected weights of the last layer, cause if it is different, further will be also different:
self.weight += self.learning_rate * hidden_layer.T.dot(output_delta )
where
output_delta = self.expected - self.output
self.expected are expected value,
self.output is forward pass result
No activation or further stuff here.
The torch past is:
optimizer = torch.optim.SGD(nn.parameters() , lr = 1.0)
criterion = torch.nn.MSELoss(reduction='sum')
output = nn.forward(x_train)
loss = criterion(output, y_train)
loss.backward()
optimizer.step()
optimizer.zero_grad()
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above? If its so i d like to know how to numerically check my numpy solution with pytorch.
I just want to know does PyTorch "basic" gradient descent or something different.
If you set torch.optim.SGD, this means stochastic gradient descent.
You have different implementations on GD, but the one that is used in PyTorch is applied to mini-batches.
There are GD implementations that will optimize parameters after the full epoch. As you may guess they are very "slow", this may be great for supercomputers to test. There are GD implementations that work for every sample, as you may guess their imperfectness is "huge" gradient fluctuations.
These are all relative terms, so I am using ""
Note you are using too big learning rates like lr = 1.0, which means you haven't normalized your data at first, but this is a skill you may scalp over time.
So it is possible that with SGD optimizer and MSELoss it uses some different delta or backpropagation function, not the basic one mentioned above?
It uses what you told.
Here is a the example in PyTorch and in Python to show detection of gradients works as expected (used in back propagation) :
x = torch.tensor([5.], requires_grad=True);
print(x) # tensor([5.], requires_grad=True)
y = 3*x**2
y.backward()
print(x.grad) # tensor([30.])
How would you get this value 30 in plain python?
def y(x):
return 3*x**2
x=5
e=0.01 #etha
g=(y(x+e)-y(x))/e
print(g) # 30.0299
As we expect we got ~30, it would be even better with smaller etha.

How to understand the loss function in scikit-learn logestic regression code?

The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.
First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.

Can Neural Network model use Weighted Mean (Sum) Squared Error as its loss function?

I am nooby in this field of study and probably this is a pretty silly question. I want to build a normal ANN, but I am not sure if I can use a weighted mean square error as the loss function.
If we are not treating each sample equally, I mean we care the prediction precision more for some of the categories of the samples more than the others, then we want to form a weighted loss function.
Lets say, we have a categorical feature ci, i is the index of the sample, and for simplicity, we assume that this feature takes binary value, either 0 or 1. So, we can form the loss function as
(ci + 1)(yi_hat - yi)^2
#and take the sum for all i
Are there going to be any problem with the back-propagation? I don't see any issue with calculating the gradient or updating the weights between layers.
And, if no issue, how can I program this loss function in Keras? Because it seems that the loss function only takes two parameters, y_true and y_pred, how can I plug in the vector c?
There is absolutely nothing wrong with that. Functions can declare the constants withing themselves or even take the constants from an outside scope:
import keras.backend as K
c = K.constant([c1,c2,c3,c4,...,cn])
def weighted_loss(y_true,y_pred):
loss = keras.losses.get('mse')
return c * loss(y_true,y_pred)
Exactly like yours:
def weighted_loss(y_true,y_pred):
weighted = (c+1)*K.square(y_true-y_pred)
return K.sum(weighted)

Sparse Cross Entropy in Tensorflow

Using tf.nn.sparse_softmax_cross_entropy_with_logits in tensorflow, its possible to only calculate loss for specific rows by setting the class label to -1 (it is otherwise expected to be in the range 0->numclasses-1).
Unfortunately this breaks the gradient computations (as is mentioned in the comments in the source nn_ops.py).
What I would like to do is something like the following:
raw_classification_output1 = [0,1,0]
raw_classification_output2 = [0,0,1]
classification_output =tf.concat(0,[raw_classification_output1,raw_classification_output2])
classification_labels = [1,-1]
classification_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(classification_output,classification_labels)
total_loss = tf.reduce_sum(classification_loss) + tf.reduce_sum(other_loss)
optimizer = tf.train.GradientDescentOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(total_loss)
changed_grads_and_vars = #do something to 0 the incorrect gradients
optimizer.apply_gradients(changed_grads_and_vars)
What's the most straightforward way to zero those gradients?
The easiest method is to just multiply the classification loss by a similar tensor of 1's where the loss is desired, and zeros where it isn't. This is made easier by the fact that the loss is already zero where you don't want it to be updated. This is basically just a workaround for the fact that it still does some weird gradient behavior if you have loss zero for this sparse softmax.
adding this line after tf.nn.sparse_softmax_cross_entropy_with_logits:
classification_loss_zeroed = tf.mul(classification_loss,tf.to_float(tf.not_equal(classification_loss,0)))
It should zero out the gradients also.

Categories