I was playing around with some code and and it behaved differently than what i expected. So i dumbed it down to a minimally working example:
import torch
test_act = torch.tensor([[2.,0.]])
test_target = torch.tensor([0])
loss_function_test = torch.nn.CrossEntropyLoss()
loss_test = loss_function_test(test_act, test_target)
print(loss_test)
> tensor(0.1269)
weights=torch.tensor([0.1,0.5])
loss_function_test = torch.nn.CrossEntropyLoss(weight=weights)
loss_test = loss_function_test(test_act, test_target)
print(loss_test)
> tensor(0.1269)
As you can see the outputs are the same regardless if there are weights present or not. But i would expect the second output to be 0.0127
Is there some normalization going on that I dont know about? Or is it possibly bugged?
In this example, I add a second dataum with a different target class, and the effect of weights is visible.
import torch
test_act = torch.tensor([[2.,1.],[1.,4.]])
test_target = torch.tensor([0,1])
loss_function_test = torch.nn.CrossEntropyLoss()
loss_test = loss_function_test(test_act, test_target)
print(loss_test)
>>> tensor(0.1809)
weights=torch.tensor([0.1,0.5])
loss_function_test = torch.nn.CrossEntropyLoss(weight=weights)
loss_test = loss_function_test(test_act, test_target)
print(loss_test)
>>> tensor(0.0927)
This effect is because "The losses are averaged across observations for each minibatch. If the weight argument is specified then this is a weighted average" but only across the minibatch.
Personally I find this a bit strange and would think it would be useful to apply the weights globally (i.e. even if all classes are not present in each minibatch). One of the prominent uses of the weight parameter would ostensibly be to give more weight to classes that are under-represented in the dataset, but by this formulation the minority classes are only given higher weights for the minibatches in which they are present (which, of course, is a low percentage because they are a minority class).
In any case that is how Pytorch defines this operation.
Related
For my problem, I want to predict customer review scores ranging from 1 to 5. I thought it would be good to implement this as a regression problem because a predicted 1 from the model while 5 being the true value should be a "worse" prediction than 4. It is also wished, that the model performs somehow equally good for all review score classes. Because my dataset is highly unbalanced I want to create a own loss function. With a normal MSE implemenation the model will only be good at predicting the majority class. Therefore I want to estimate the MSE for every review class (1,2,3,4 and 5 star reviews) and then take the mean of these five MSE scores. That's how the model should be incentivized to also perform good at predicting the smaller classes.
The following code shows my tensorflow loss function:
def custom_loss_function(y_true, y_pred):
y_true_t = tf.cast(tf.Variable(y_true, validate_shape=False),dtype="float16")
y_pred_t = tf.cast(tf.Variable(y_pred, validate_shape=False),dtype="float16")
one = tf.cast(tf.Variable([1]),dtype="float16")
two = tf.cast(tf.Variable([2]),dtype="float16")
three = tf.cast(tf.Variable([3]),dtype="float16")
four = tf.cast(tf.Variable([4]),dtype="float16")
five = tf.cast(tf.Variable([5]),dtype="float16")
# index for a given review class
i_one = tf.where(tf.equal(y_true_t, one))
i_two = tf.where(tf.equal(y_true_t, two))
i_three = tf.where(tf.equal(y_true_t, three))
i_four = tf.where(tf.equal(y_true_t, four))
i_five = tf.where(tf.equal(y_true_t, five))
# predictions for the found indicies for a review class
y_pred_one=tf.gather(y_pred_t,i_one)
y_pred_two=tf.gather(y_pred_t,i_two)
y_pred_three=tf.gather(y_pred_t,i_three)
y_pred_four=tf.gather(y_pred_t,i_four)
y_pred_five=tf.gather(y_pred_t,i_five)
y_true_one=tf.gather(y_true_t,i_one)
y_true_two=tf.gather(y_true_t,i_two)
y_true_three=tf.gather(y_true_t,i_three)
y_true_four=tf.gather(y_true_t,i_four)
y_true_five=tf.gather(y_true_t,i_five)
mse1 = tf.reduce_mean(tf.cast(tf.square(y_true_one-y_pred_one),dtype="float"))
mse2 = tf.reduce_mean(tf.cast(tf.square(y_true_two-y_pred_two),dtype="float"))
mse3 = tf.reduce_mean(tf.cast(tf.square(y_true_three-y_pred_three),dtype="float"))
mse4 = tf.reduce_mean(tf.cast(tf.square(y_true_four-y_pred_four),dtype="float"))
mse5 = tf.reduce_mean(tf.cast(tf.square(y_true_five-y_pred_five),dtype="float"))
mse = (mse1+mse2+mse3+mse4+mse5)/5
return mse
When I try to use this loss function in my NN I get the following message:
ValueError: An operation has None for gradient. Please make sure
that all of your ops have a gradient defined (i.e. are
differentiable). Common ops without gradient: K.argmax, K.round,
K.eval.
I couldn't figure out why this error occurs. I thought that all tf operations should be differentiable but I could also be wrong.
Thanks for any help.
My question is similar to the one posed here:
keras combining two losses with adjustable weights
However, the outputs have a different dimensionality resulting in the outputs not being able to be concatenated. Hence, the solution is not applicable, is there another way to solve this problem?
The question:
I have a keras functional model with two layers with outputs x1 and x2.
x1 = Dense(1,activation='relu')(prev_inp1)
x2 = Dense(2,activation='relu')(prev_inp2)
I need to use these x1 and x2 use them in a weighted loss function like in the attached image. Propagate the 'same loss' into both branches. Alpha is flexible to vary with iterations.
For this question, a more elaborated solution is necessary. Since we're going to use a trainable weight, we will need a custom layer.
Also, we will be needing a different form of training, since our loss doesn't work like the others taking only y_true and y_pred and considers joining two different outputs.
Thus, we're going to create two versions of the same model, one for prediction, another for training, and the training version will contain the loss in itself, using a dummy keras loss function in compilation.
The prediction model
Let's use a very basic example of model with two outputs and one input:
#any input your true model takes
inp = Input((5,5,2))
#represents the localization output
outImg = Conv2D(1,3,activation='sigmoid')(inp)
#represents the classification output
outClass = Flatten()(inp)
outClass = Dense(2,activation='sigmoid')(outClass)
#the model
predictionModel = Model(inp, [outImg,outClass])
You use this one regularly for predictions. It's not necessary to compile this one.
The losses for each branch
Now, let's create custom loss functions for each branch, one for LossCls and another for LossLoc.
Using dummy examples here, you can elaborate these losses better if necessary. The most important is that they output batches shaped like (batch, 1) or (batch,). Both output the same shape so they can be summed later.
def calcImgLoss(x):
true,pred = x
loss = binary_crossentropy(true,pred)
return K.mean(loss, axis=[1,2])
def calcClassLoss(x):
true,pred = x
return binary_crossentropy(true,pred)
These will be used in Lambda layers in the training model.
The loss weighting layer - (WARNING! EDITED! - See explanation at the end)
Now, let's weight the losses with the trainable alpha. Trainable parameters need custom layers to be implemented.
class LossWeighter(Layer):
def __init__(self, **kwargs): #kwargs can have 'name' and other things
super(LossWeighter, self).__init__(**kwargs)
#create the trainable weight here, notice the constraint between 0 and 1
def build(self, inputShape):
self.weight = self.add_weight(name='loss_weight',
shape=(1,),
initializer=Constant(0.5),
constraint=Between(0,1),
trainable=True)
super(LossWeighter,self).build(inputShape)
def call(self,inputs):
#old answer: will always tend to completely ignore the biggest loss
#return (self.weight * firstLoss) + ((1-self.weight)*secondLoss)
#problem: alpha tends to 0 or 1, eliminating the biggest of the two losses
#proposal of working alpha optimization
#return K.square((self.weight * firstLoss) - ((1-self.weight)*secondLoss))
#problem: might not train any of the losses, and even increase one of them
#in order to minimize the difference between the two losses
#new answer - a mix between the two, applying gradients to the right weights
loss1, loss2 = inputs #trainable
static_loss1 = K.stop_gradient(loss1) #non_trainable
static_loss2 = K.stop_gradient(loss2) #non_trainable
a1 = self.weight #trainable
a2 = 1 - a1 #trainable
static_a1 = K.stop_gradient(a1) #non_trainable
static_a2 = 1 - static_a1 #non_trainable
#this trains only alpha to minimize the difference between both losses
alpha_loss = K.square((a1 * static_loss1) - (a2 * static_loss2))
#or K.abs (.....)
#this trains only the original model weights to minimize both original losses
model_loss = (static_a1 * loss1) + (static_a2 * loss2)
return alpha_loss + model_loss
def compute_output_shape(self,inputShape):
return inputShape[0]
Notice that there is a custom constraint to keep this weight between 0 and 1. This constraint is implemented with:
class Between(Constraint):
def __init__(self,min_value,max_value):
self.min_value = min_value
self.max_value = max_value
def __call__(self,w):
return K.clip(w,self.min_value, self.max_value)
def get_config(self):
return {'min_value': self.min_value,
'max_value': self.max_value}
The training model
This model will take the prediction model as base, add the loss calculations and loss weighter at the end and output only the loss value. Because it outputs only a loss, we will use the true targets as inputs, and a dummy loss function defined like:
def ignoreLoss(true,pred):
return pred #this just tries to minimize the prediction without any extra computation
Model inputs:
#true targets
trueImg = Input((3,3,1))
trueClass = Input((2,))
#predictions from the prediction model
predImg = predictionModel.outputs[0]
predClass = predictionModel.outputs[1]
Model outputs = losses:
imageLoss = Lambda(calcImgLoss, name='loss_loc')([trueImg, predImg])
classLoss = Lambda(calcClassLoss, name='loss_cls')([trueClass, predClass])
weightedLoss = LossWeighter(name='weighted_loss')([imageLoss,classLoss])
Model:
trainingModel = Model([predictionModel.input, trueImg, trueClass], weightedLoss)
trainingModel.compile(optimizer='sgd', loss=ignoreLoss)
Dummy training
inputImages = np.zeros((7,5,5,2))
outputImages = np.ones((7,3,3,1))
outputClasses = np.ones((7,2))
dummyOut = np.zeros((7,))
trainingModel.fit([inputImages,outputImages,outputClasses], dummyOut, epochs = 50)
predictionModel.predict(inputImages)
Necessary imports
from keras.layers import *
from keras.models import Model
from keras.constraints import Constraint
from keras.initializers import Constant
from keras.losses import binary_crossentropy #or another you need
(EDIT) Explaining the problem with the old answer:
The formula used in the old answer would make alpha always go to 0 or 1, meaning only the smallest of the two losses would be ever trained. (Useless)
A new formula leads alpha to make both losses have the same value. Alpha would be trained properly and not tend to 0 or 1. But, still, the losses would not be properly trained because "increasing one loss to reach the other" would be a possibility for the model, and once both losses were equal, the model would stop training.
The new solution is a mix of the two proposals above, while the first actually trains the losses but with wrong alpha; and the second trains alpha with wrong losses. The mixed solution adds both, but uses K.stop_gradient to prevent the wrong part of the training from happening.
The result of this will be: the "easiest" loss (not the biggest) will be more trained than the hardest. We may use K.abs or K.square, as compared to "mae" or "mse" between the two losses. The best option is up to experiment.
See this table comparing the old and new proposals:
This does not guarantee the best optimization though!!!
Training the easiest loss will not always have the best result, though. It may be better than favoring a huge loss just because it's formula is different. But the expected result might still need some manual weighting of the losses.
I fear there is no automatic training for this weight. If you have a target metric, you can try to train this metric (when possible, but metrics that depend on sorting, getting an index, rounding or anything that breaks backpropagation may not be possible to be transformed in losses).
There is no need to concatenate your outputs. To pass multiple arguments to a loss function, you can wrap it as follows:
def custom_loss(x1, x2, y1, y2, alpha):
def loss(y_true, y_pred):
return (1-alpha) * loss_cls(y1, x1) + alpha * loss_loc(y2, x2)
return loss
And then compile your functional model as:
x1 = Dense(1, activation='relu')(prev_inp1)
x2 = Dense(2, activation='relu')(prev_inp2)
y1 = Input((1,))
y2 = Input((2,))
model.compile('sgd',
loss=custom_loss(x1, x2, y1, y2, 0.5),
target_tensors=[y1, y2])
NOTE: Not tested.
I'm using the implementation found in http://adventuresinmachinelearning.com/word2vec-keras-tutorial/ to learn something about word2Vec. What I am not understanding is why isn't the loss function decreasing?
Iteration 119200, loss=0.7305528521537781
Iteration 119300, loss=0.6254740953445435
Iteration 119400, loss=0.8255964517593384
Iteration 119500, loss=0.7267132997512817
Iteration 119600, loss=0.7213149666786194
Iteration 119700, loss=0.6156617999076843
Iteration 119800, loss=0.11473365128040314
Iteration 119900, loss=0.6617216467857361
The net, from my understanding, is a standard one used in this task:
input_target = Input((1,))
input_context = Input((1,))
embedding = Embedding(vocab_size, vector_dim, input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)
context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)
dot_product = Dot(axes=1)([target, context])
dot_product = Reshape((1,))(dot_product)
output = Dense(1, activation='sigmoid')(dot_product)
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop') #adam??
Words come from a vocabulary of size 10000 from http://mattmahoney.net/dc/text8.zip (english text)
What I notice is that some words are somewhat learned in time like the context for numbers and articles is easily guessed, yet the loss is quite stuck around 0.7 from the beginning, and as iterations goes it only fluctuates randomly.
The training part is made like this (which I sense strange since the absence of the standard fit method)
arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
if cnt % 100 == 0:
print("Iteration {}, loss={}".format(cnt, loss))
Am i missing something important about these type of net? What is not written is implemented exactly like the link above
I followed the same tutorial and the loss drops after the algorithm went through a sample again. Note that the loss function is calculated only for the current target and context word pair. In the code example from the tutorial one epoch is only one sample, therefore you would need more than the number of target and context words to come to a point where the loss drops.
I implemented the training part with the following line
model.fit([word_target, word_context], labels, epochs=5)
Be warned that this can take a long time depending on how large the corpus is. The train_on_batch function gives you more control in training and you can vary the batch size or select samples you choose at every step of the training.
I have written the following binary classification program in tensorflow that is buggy. The cost is returning to be zero all the time no matter what the input is. I am trying to debug a larger program which is not learning anything from the data. I have narrowed down at least one bug to the cost function always returning zero. The given program is using some random inputs and is having the same problem. self.X_train and self.y_train is originally supposed to read from files and the function self.predict() has more layers forming a feedforward neural network.
import numpy as np
import tensorflow as tf
class annClassifier():
def __init__(self):
with tf.variable_scope("Input"):
self.X = tf.placeholder(tf.float32, shape=(100, 11))
with tf.variable_scope("Output"):
self.y = tf.placeholder(tf.float32, shape=(100, 1))
self.X_train = np.random.rand(100, 11)
self.y_train = np.random.randint(0,2, size=(100, 1))
def predict(self):
with tf.variable_scope('OutputLayer'):
weights = tf.get_variable(name='weights',
shape=[11, 1],
initializer=tf.contrib.layers.xavier_initializer())
bases = tf.get_variable(name='bases',
shape=[1],
initializer=tf.zeros_initializer())
final_output = tf.matmul(self.X, weights) + bases
return final_output
def train(self):
prediction = self.predict()
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=self.y))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(cost, feed_dict={self.X:self.X_train, self.y:self.y_train}))
with tf.Graph().as_default():
classifier = annClassifier()
classifier.train()
If someone could please figure out what I am doing wrong in this, I can try making the same change in my original program. Thanks a lot!
The only problem is invalid cost used. softmax_cross_entropy_with_logits should be used if you have more than two classes, as softmax of a single output always returns 1, as it is defined as :
softmax(x)_i = exp(x_i) / SUM_j exp(x_j)
so for a single number (one dimensional output)
softmax(x) = exp(x) / exp(x) = 1
Furthermore, for softmax output TF expects one-hot encoded labels, so if you provide only 0 or 1, there are two possibilities:
True label is 0, so the cost is -0*log(1) = 0
True label is 1, so the cost is -1*log(1) = 0
Tensorflow has a separate function to handle binary classification which applies sigmoid instead (note, that the same function for more than one output would apply sigmoid independently on each dimension which is what multi-label classification would expect):
tf.sigmoid_cross_entropy_with_logits
just switch to this cost and you are good to go, you do not have to encode anything as one-hot anymore either, as this function is designed solely to be used for your use-case.
The only missing bit is that .... your code does not have actual training routine you need to define optimiser, ask it to minimise a loss and then run a train op in the loop. In your current setting you just try to predict over and over, with the network which never changes.
In particular, please refer to Cross Entropy Jungle question on SO which provides more detailed description of all these different helper functions in TF (and other libraries), which have different requirements/use cases.
The softmax_cross_entropy_with_logits is basically a stable implementation of the 2 parts :
softmax = tf.nn.softmax(prediction)
cost = -tf.reduce_mean(labels * tf.log(softmax), 1)
Now in your example, prediction is a single value, so when you apply softmax on it, its going to be always 1 irrespective of the value (exp(prediction)/exp(prediction) = 1), and so the tf.log(softmax) term becomes 0. Thats why you always get your cost zero.
Either apply sigmoid to get your probabilities between 0 or 1 or if you use want to use softmax get the labels as [1, 0] for class 0 and [0, 1] for class 1.
I'm trying to train a network with an unbalanced data. I have A (198 samples), B (436 samples), C (710 samples), D (272 samples) and I have read about the "weighted_cross_entropy_with_logits" but all the examples I found are for binary classification so I'm not very confident in how to set those weights.
Total samples: 1616
A_weight: 198/1616 = 0.12?
The idea behind, if I understood, is to penalize the errors of the majority class and value more positively the hits in the minority one, right?
My piece of code:
weights = tf.constant([0.12, 0.26, 0.43, 0.17])
cost = tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(logits=pred, targets=y, pos_weight=weights))
I have read this one and others examples with binary classification but still not very clear.
Note that weighted_cross_entropy_with_logits is the weighted variant of sigmoid_cross_entropy_with_logits. Sigmoid cross entropy is typically used for binary classification. Yes, it can handle multiple labels, but sigmoid cross entropy basically makes a (binary) decision on each of them -- for example, for a face recognition net, those (not mutually exclusive) labels could be "Does the subject wear glasses?", "Is the subject female?", etc.
In binary classification(s), each output channel corresponds to a binary (soft) decision. Therefore, the weighting needs to happen within the computation of the loss. This is what weighted_cross_entropy_with_logits does, by weighting one term of the cross-entropy over the other.
In mutually exclusive multilabel classification, we use softmax_cross_entropy_with_logits, which behaves differently: each output channel corresponds to the score of a class candidate. The decision comes after, by comparing the respective outputs of each channel.
Weighting in before the final decision is therefore a simple matter of modifying the scores before comparing them, typically by multiplication with weights. For example, for a ternary classification task,
# your class weights
class_weights = tf.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
You could also rely on tf.losses.softmax_cross_entropy to handle the last three steps.
In your case, where you need to tackle data imbalance, the class weights could indeed be inversely proportional to their frequency in your train data. Normalizing them so that they sum up to one or to the number of classes also makes sense.
Note that in the above, we penalized the loss based on the true label of the samples. We could also have penalized the loss based on the estimated labels by simply defining
weights = class_weights
and the rest of the code need not change thanks to broadcasting magic.
In the general case, you would want weights that depend on the kind of error you make. In other words, for each pair of labels X and Y, you could choose how to penalize choosing label X when the true label is Y. You end up with a whole prior weight matrix, which results in weights above being a full (num_samples, num_classes) tensor. This goes a bit beyond what you want, but it might be useful to know nonetheless that only your definition of the weight tensor need to change in the code above.
See this answer for an alternate solution which works with sparse_softmax_cross_entropy:
import tensorflow as tf
import numpy as np
np.random.seed(123)
sess = tf.InteractiveSession()
# let's say we have the logits and labels of a batch of size 6 with 5 classes
logits = tf.constant(np.random.randint(0, 10, 30).reshape(6, 5), dtype=tf.float32)
labels = tf.constant(np.random.randint(0, 5, 6), dtype=tf.int32)
# specify some class weightings
class_weights = tf.constant([0.3, 0.1, 0.2, 0.3, 0.1])
# specify the weights for each sample in the batch (without having to compute the onehot label matrix)
weights = tf.gather(class_weights, labels)
# compute the loss
tf.losses.sparse_softmax_cross_entropy(labels, logits, weights).eval()
Tensorflow 2.0 Compatible Answer: Migrating the Code specified in P-Gn's Answer to 2.0, for the benefit of the community.
# your class weights
class_weights = tf.compat.v2.constant([[1.0, 2.0, 3.0]])
# deduce weights for batch samples based on their true label
weights = tf.compat.v2.reduce_sum(class_weights * onehot_labels, axis=1)
# compute your (unweighted) softmax cross entropy loss
unweighted_losses = tf.compat.v2.nn.softmax_cross_entropy_with_logits(onehot_labels, logits)
# apply the weights, relying on broadcasting of the multiplication
weighted_losses = unweighted_losses * weights
# reduce the result to get your final loss
loss = tf.reduce_mean(weighted_losses)
For more information about migration of code from Tensorflow Version 1.x to 2.x, please refer this Migration Guide.