I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.
Related
I have a question surrounding a pretty complex loss function I have.
This is a variational autoencoder loss function and it is fairly complex. It is made of two reconstruction losses, KL divergence and a discriminator as a regularizer. All of those losses are on the same scale, but I have found out that increasing one of the reconstruction losses by a factor of 20 (while leaving the rest on the previous scale) heavily increases the performance of my model.
Since I am still fairly novice on DL, I dont completely understand why this happens, or how I could identify this sort of thing on successive models.
Any advice/explanation is greatly appreciated.
To summarize your setting first:
loss = alpha1 * loss1 + alpha2 * loss2
When computing the gradients for backpropagation, we compute back through this formular. By backpropagating through our error function we get the gradient:
dError/dLoss
To continue our propagation downwards, we now want to compute dError/dLoss1 and dError/dLoss2.
dError/dLoss1 can be expanded to dError/dLoss * dLoss/dLoss1 via the cain rule (https://en.wikipedia.org/wiki/Chain_rule).
We already computed dError/dLoss so we only need to compute dLoss derived with respect to dLoss1, which is
dLoss/dLoss1 = alpha1
The backpropagation now continues until we reach our weights (dLoss1/dWeight). The gradient our weight receives is:
dError/dWeight = dError/dLoss * dLoss/dLoss1 * dLoss1/dWeight = dError/dLoss * alpha1 * dLoss1/dWeight
As you can see, the gradient used to update our weight does now depend on alpha1, the factor we use to scale Loss1.
If we increase alpha1 while not changing alpha2 the gradients depending on Loss1 will have higher different impact than the gradients of Loss2 and therefor changing the optimization of our model.
I have implemented a neural network in Tensorflow where the last layer is a convolution layer, I feed the output of this convolution layer into a softmax activation function then I feed it to a cross-entropy loss function which is defined as follows along with the labels but the problem is I got NAN as the output of my loss function and I figured out it is because I have 1 in the output of softmax. So, my question is what should I do in this case?
My input is a 16 by 16 image where I have 0 and 1 as the values of each pixel (binary classification)
My loss function:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
log_pred = tf.log(prediction, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
Note that log(0) is undefined so if ever prediction==0 or prediction==1 you will have a NaN.
In order to get around this it is commonplace to add a very small value epsilon to the value passed to tf.log in any loss function (we also do a similar thing when dividing to avoid dividing by zero). This makes our loss function numerically stable and the epsilon value is small enough to be negligible in terms of any inaccuracy it introduces to our loss.
Perhaps try something like:
#Loss function
def loss(prediction, label):
#with tf.variable_scope("Loss") as Loss_scope:
epsilon = tf.constant(0.000001)
log_pred = tf.log(prediction + epsilon, name='Prediction_Log')
log_pred_2 = tf.log(1-prediction + epsilon, name='1-Prediction_Log')
cross_entropy = -tf.multiply(label, log_pred) - tf.multiply((1-label), log_pred_2)
return cross_entropy
UPDATE:
As jdehesa points out in his comments though - the 'out of the box' loss functions handle the numerical stability issue nicely already
The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.
First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.
I am new to tensorflow
In a part of a code for a tensorflow session, there is :
loss = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=net, labels=self.out_placeholder, name='cross_entropy')
self.loss = tf.reduce_mean(loss, name='mean_squared_error')
I want to use mean_squared_error loss function for this purpose. I found this loss function in tensorflow website:
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
I need this loss function for a regression problem.
I tried:
loss = tf.losses.mean_squared_error(predictions=net, labels=self.out_placeholder)
self.loss = tf.reduce_mean(loss, name='mean_squared_error')
Where net = tf.matmul(input_tensor, weights) + biases
However, I'm not sure if it's the correct way.
First of all keep in mind that cross-entropy is mainly used for classification, while MSE is used for regression.
In your case cross entropy measures the difference between two distributions (the real occurences, called labels - and your predictions)
So while the first loss functions works on the result of the softmax layer (which can be seen as a probability distribution), the second one works directly on the floating point output of your network (which is no probability distribution) - therefore they cannot be simply exchanged.
I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss