I'm using LightGBM and I need to realize a loss function that during the training give a penalty when the prediction is lower than the target. In other words I assume that underestimates are much worse than overestimates. I've found this suggestion that do exactly the opposite:
def custom_asymmetric_train(y_true, y_pred):
residual = (y_true - y_pred).astype("float")
grad = np.where(residual<0, -2*10.0*residual, -2*residual)
hess = np.where(residual<0, 2*10.0, 2.0)
return grad, hess
def custom_asymmetric_valid(y_true, y_pred):
residual = (y_true - y_pred).astype("float")
loss = np.where(residual < 0, (residual**2)*10.0, residual**2)
return "custom_asymmetric_eval", np.mean(loss), False
https://towardsdatascience.com/custom-loss-functions-for-gradient-boosting-f79c1b40466d)
How can I modify it for my purpose?
I believe this function is where you want to make a change.
def custom_asymmetric_valid(y_true, y_pred):
residual = (y_true - y_pred).astype("float")
loss = np.where(residual < 0, (residual**2)*10.0, residual**2)
return "custom_asymmetric_eval", np.mean(loss), False
The line where loss is worked out has a comparison.
loss = np.where(residual < 0, (residual**2)*10.0, residual**2)
When residual is less that 0, loss is residual^2 * 10
where whne about 0, loss is just redisual^2.
So if we change this less than to a greater than. This will flip the skew.
loss = np.where(residual > 0, (residual**2)*10.0, residual**2)
I think this would be helpful. Originated from Custom loss function with Keras to penalise more negative prediction
def customLoss(true,pred):
diff = pred - true
greater = K.greater(diff,0)
greater = K.cast(greater, K.floatx()) #0 for lower, 1 for greater
greater = greater + 1 #1 for lower, 2 for greater
#use some kind of loss here, such as mse or mae, or pick one from keras
#using mse:
return K.mean(greater*K.square(diff))
model.compile(optimizer = 'adam', loss = customLoss)
I have a custom loss function using keras:
import keras.backend as K
def IoU(y_true, y_pred, eps=1e-6):
if K.max(y_true) == 0.0:
return IoU(1 - y_true, 1 - y_pred) ## empty image; calc IoU of zeros
intersection = K.sum(y_true * y_pred, axis=[1,2,3])
union = K.sum(y_true, axis=[1,2,3]) + K.sum(y_pred, axis=[1,2,3]) - intersection
return -K.mean((intersection + eps) / (union + eps), axis=0)
This results in an endless recursion, since if K.max(y_true) == 0.0: always evaluates to true. Why is this the case? Do I need to extract a single value out of the output of K.max? I tried converting y_true to a numpy array and using np.max instead but this was not easily possible.
Or does 1 - y_true not work the way numpy arrays would work?
Edit: y_true and y_pred are both tensors with shape:
Tensor("IoU/Shape:0", shape=(4,), dtype=int32). y_true is mostly filled with zeros, but some non zero values are present.
I'm trying to make categorical cross entropy loss function to better understand intuition behind it.
So far my implementation looks like this:
# Observations
y_true = np.array([[0, 1, 0], [0, 0, 1]])
y_pred = np.array([[0.05, 0.95, 0.05], [0.1, 0.8, 0.1]])
# Loss calculations
def categorical_loss():
loss1 = -(0.0 * np.log(0.05) + 1.0 * np.log(0.95) + 0 * np.log(0.05))
loss2 = -(0.0 * np.log(0.1) + 0.0 * np.log(0.8) + 1.0 * np.log(0.1))
loss = (loss1 + loss2) / 2 # divided by 2 because y_true and y_pred have 2 observations and 3 classes
return loss
# Show loss
print(categorical_loss()) # 1.176939193690798
However I do not understand how function should behave to return correct value when:
at least one number from y_pred is 0 or 1 because then log function returns -inf or 0 and how code implementation should look like in this case
at least one number from y_true is 0 because multiplication by 0 always returns 0 and value of np.log(0.95) will be discarded then and how code implementation should look like in this case as well
Regarding y_pred being 0 or 1, digging into the Keras backend source code for both binary_crossentropy and categorical_crossentropy, we get:
def binary_crossentropy(target, output, from_logits=False):
if not from_logits:
output = np.clip(output, 1e-7, 1 - 1e-7)
output = np.log(output / (1 - output))
return (target * -np.log(sigmoid(output)) +
(1 - target) * -np.log(1 - sigmoid(output)))
def categorical_crossentropy(target, output, from_logits=False):
if from_logits:
output = softmax(output)
else:
output /= output.sum(axis=-1, keepdims=True)
output = np.clip(output, 1e-7, 1 - 1e-7)
return np.sum(target * -np.log(output), axis=-1, keepdims=False)
from where you can clearly see that, in both functions, there is a clipping operation of the output (i.e. predictions), in order to avoid infinities from the logarithms:
output = np.clip(output, 1e-7, 1 - 1e-7)
So, here y_pred will never be exactly 0 or 1 in the underlying calculations. The handling is similar in other frameworks.
Regarding y_true being 0, there is not any issue involved - the respective terms are set to 0, as they should be according to the mathematical definition.
After answering this question, there are some interesting but confused findings I met in tensorflow 2.0. The gradients of logits looks incorrect to me. Let's say we have logits and labels here.
logits = tf.Variable([[0.8, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits,
from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
Since logits is already a prob distribution, so I set from_logits=False in the loss function.
I thought tensorflow will use loss=-\Sigma_i(p_i)\log(q_i) to calculate the loss, and if we derive on q_i, we will have the derivative be -p_i/q_i. So, the expected grads should be [-1.25,0,0]. However, tensorflow will return [-0.25,1,1].
After reading the source code of tf.categorical_crossentropy, I found that even though we set from_logits=False, it still normalize the probabilities. That will change the final gradient expression. Specifically, the gradient will be -p_i/q_i+p_i/sum_j(q_j). If p_i=1 and sum_j(q_j)=1, the final gradient will plus one. That's why the gradient will be -0.25, however, I haven't figured out why the last two gradients would be 1.
To prove that all gradients are increased by 1/sum_j(q_j), I made up a logits, which is not prob distribution, and set from_logits=False still.
logits = tf.Variable([[0.5, 0.1, 0.1]], dtype=tf.float32)
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits,
from_logits=False))
grads = tape.gradient(loss, logits)
print(grads)
The grads returned by tensorflow is [-0.57142866,1.4285713,1.4285713 ], which I thought should be [-2,0,0].
It shows that all gradients are increased by 1/(0.5+0.1+0.1). For the p_i==1, the gradient increased by 1/(0.5+0.1+0.1) makes sense to me. But I don't understand why p_i==0, the gradient is still increased by 1/(0.5+0.1+0.1).
Update
Thanks for #OverLordGoldDragon's kind reminder. After normalizing the probs, the correct gradients formula should be -p_i/q_i+1/sum_j(q_j). So the behaviors in the question are expected.
Categorical crossentropy is tricky, particularly w.r.t. one-hot encodings; the problem arises out of presuming that some predictions are "tossed out" in computing loss or gradient, when looking at how loss is computed:
loss = f(labels * preds) = f([1, 0, 0] * preds)
Why are the gradients incorrect? Above may suggest that preds[1:] don't matter, but note that this isn't actually preds - it's preds_normalized, which involves single element of preds. To get a better idea of what's happening, the Numpy backend is helpful; assuming from_logits=False:
losses = []
for label, pred in zip(labels, preds):
pred_norm = pred / pred.sum(axis=-1, keepdims=True)
losses.append(np.sum(label * -np.log(pred_norm), axis=-1, keepdims=False))
A more complete explanation of above - here. Below is my derivation of the gradients formula, with examples comparing its Numpy implementation with tf.GradientTape results. To skip the meaty details, scroll to "Main idea".
Formula + Derivation: proof of correctness at the bottom.
"""
grad = -y * sum(p_zeros) / (p_one * sum(pred)) + p_mask / sum(pred)
p_mask = abs(y - 1)
p_zeros = p_mask * pred
y = label: 1D array of length N, one-hot
p = prediction: 1D array of length N, float32 from 0 to 1
p_norm = normalized predictions
p_mask = prediction masks (see below)
"""
What's happening? Begin with a simple example to understand what tf.GradientTape is doing:
w = tf.Variable([0.5, 0.1, 0.1])
with tf.GradientTape(persistent=True) as tape:
f1 = w[0] + w[1] # f = function
f2 = w[0] / w[1]
f3 = w[0] / (w[0] + w[1] + w[2])
print(tape.gradient(f1, w)) # [1. 1. 0.]
print(tape.gradient(f2, w)) # [10. -50. 0.]
print(tape.gradient(f3, w)) # [0.40816 -1.02040 -1.02040]
Let w = [w1, w2, w3]. Then:
"""
grad = [df1/dw1, df1/dw2, df1/dw3]
grad1 = [d(w1 + w2)/w1, d(w1 + w2)/w2, d(w1 + w2)/w3] = [1, 1, 0]
grad2 = [d(w1 / w2)/w1, d(w1 / w2)/w2, d(w1 + w2)/w3] = [1/w2, -w1/w2^2, 0] = [10, -50, 0]
grad3 = [(w1 + w2)/K, - w2/K, -w3/K] = [0.40816 -1.02040 -1.02040] -- K = (w1 + w2 + w3)^2
"""
In other words, tf.GradientTape treats each element of the input tensor its differentiating against as a variable. This in mind, it suffices to implement categorical crossentropy via elementary tffunctions then derive its derivative by hand and see if they agree. It's what I've done at the bottom code, with loss better explained in answer linked above.
Formula explanation:
f3 above is the most insightful, as it's actually pred_norm; all we need now is to add a natural log, and handle two separate cases: grads for y==1, and for y==0; with a handy Wolf, derivatives can be computed in a flash. Adding more variables to denominator, we can see the following pattern:
d(loss)/d(p_one) = p_zeros / (p_one * sum(pred))
d(loss)/d(p_non_one) = -1 / sum(pred)
where p_one is pred where label == 1, p_non_one is any other pred element, and p_zeros is all pred elements except p_one. The code at bottom is simply an implementation of exactly this, using compact syntax.
Explanation example:
Suppose label = [1, 0, 0]; pred = [.5, .1, .1]. Below is numpy_gradient, step-by-step:
p_mask == [0, 1, 1] # effectively `label` "inverted", to exclude `p_one`
p_one == .5 # pred where `label` == 1
## grad_zeros
p_mask / np.sum(pred) == [0, 1, 1] / (.5 + .1 + .1) = [0, 1/.7, 1/.7]
## grad_one
p_one * np.sum(pred) == .5 * (.5 + .1 + .1) = .5 * .7 = .35
p_mask * pred == [0, 1, 1] * [.5, .1, .1] = [0, .1, .1]
np.sum(p_mask * pred) == .2
label * np.sum(p_mask * pred) == .2 * [1, 0, 0] = [.2, 0, 0]
label * np.sum(p_mask * pred) / (p_one * np.sum(pred))
== [.2, 0, 0] / .35 = 0.57142854
Per above, we can see that the gradient is effectively divided into two computations: grad_one, and grad_zeros.
Main idea: understandably, that's a lot of detail, so here's the main idea: every element of label and pred affects grad, and loss is computed using pred_norm, not pred, and the normalization step is backpropagated. We can run a little visual to confirm this:
labels = tf.constant([[1, 0, 0]],dtype=tf.float32)
grads = []
for i in np.linspace(0, 1, 100):
logits = tf.Variable([[0.5, 0.1, i]], dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
loss = tf.keras.losses.categorical_crossentropy(
labels, logits, from_logits=False)
grads.append(tape.gradient(loss, logits))
grads = np.vstack(grads)
plt.plot(grads)
Even though only logits[2] is varied, grads[1] varies exactly the same. The explanation's clear from grad_zeros above, but more intuitively, categorical crossentropy doesn't care "how wrong" the zero-label predictions are individually, only collectively - because it only semi-directly computes loss from pred[0] (i.e. pred[0] / sum(pred)), which is normalized by all other pred. So whether pred[1] == .9 and pred[2] == .2 or vice versa, p_norm is exactly the same.
Closing note: derived formulas are intended for a 1D case for simplicity, and may not work for N-dimensional labels and preds tensors, but can be easily generalized.
Numpy vs. tf.GradientTape:
def numpy_gradient(label, pred):
p_mask = np.abs(label - 1)
p_one = pred[np.where(label==1)[0][0]]
return p_mask / np.sum(pred) \
- label * np.sum(p_mask * pred) / (p_one * np.sum(pred))
def gtape_gradient(label, pred):
pred = tf.Variable(pred)
label = tf.Variable(label)
with tf.GradientTape() as tape:
loss = - tf.math.log(tf.reduce_sum(label * pred) / tf.reduce_sum(pred))
return tape.gradient(loss, pred).numpy()
label = np.array([1., 0., 0. ])
pred = np.array([0.5, 0.1, 0.1])
print(numpy_gradient(label, pred))
print(gtape_gradient(label, pred))
# [-0.57142854 1.4285713 1.4285713 ] <-- 100% agreement
# [-0.57142866 1.4285713 1.4285713 ] <-- 100% agreement
I am training to train a Neural Network using Keras and I am using my own metric function as the loss function. The reason for this is that the actual values in the test set have a lot of NaN values. Let me give an example of the actual values in the test set:
12
NaN
NaN
NaN
8
NaN
NaN
3
In the preprocessing of my data, I replaced all the NaN values with zeros, so the above example contains zeros on each NaN row.
The Neural Network produces an output like this:
14
12
9
9
8
7
6
3
I only want to calculate the root mean squared error between the non-zero values. So for the example above, it should only calculate the RMSE for rows 1, 5 and 8. To do this, I created the following function:
from sklearn.metrics import mean_squared_error
from math import sqrt
[...]
def evaluation_metric(y_true, y_pred):
y_true = y_true[np.nonzero(y_true)]
y_pred = y_pred[np.nonzero(y_true)]
error = sqrt(mean_squared_error(y_true, y_pred))
return error
When you test the function by hand, by feeding the actual values from the test set and an output from the neural network that is initialized with random weights, it works well an produces an error value. I am able to optimize the weights using an Evolutionary approach, and I am able to optimize this error measure by adjusting the weights of the network.
Now, I want to train the network with evaluation_metric as the loss function using the model.compile function from Keras. When I run:
model.compile(loss=evaluation_metric, optimizer='rmsprop', metrics=[evaluation_metric])
I get the following error:
TypeError: Using a tf.Tensor as a Python bool is not allowed. Use if t is not None: instead of if t: to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
I think this has to do with the usage of np.nonzero. Since I am working with Keras, I should probably use a function of the Keras Backend, or using something like tf.cond to check for the non zero values of y_true.
Can someone help me with this?
EDIT
The code works after applying the following fix:
def evaluation_metric(y_true, y_pred):
y_true = y_true * (y_true != 0)
y_pred = y_pred * (y_true != 0)
error = root_mean_squared_error(y_true, y_pred)
return error
Along with the following function for calculating the RMSE of a tf object:
def root_mean_squared_error(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1))
Yes, indeed the problem lies in using numpy function. Here is a quick fix:
def evaluation_metric(y_true, y_pred):
y_true = y_true * (y_true != 0)
y_pred = y_pred * (y_true != 0)
error = sqrt(mean_squared_error(y_true, y_pred))
return error
I would write the metric in tensorflow on my own like:
import tensorflow as tf
import numpy as np
data = np.array([0, 1, 2, 0, 0, 3, 7, 0]).astype(np.float32)
pred = np.random.randn(8).astype(np.float32)
gt = np.random.randn(8).astype(np.float32)
data_op = tf.convert_to_tensor(data)
pred_op = tf.convert_to_tensor(pred)
gt_op = tf.convert_to_tensor(gt)
expected = np.sqrt(((gt[data != 0] - pred[data != 0]) ** 2).mean())
def nonzero_mean(gt_op, pred_op, data_op):
mask_op = 1 - tf.cast(tf.equal(data_op, 0), tf.float32)
actual_op = ((gt_op - pred_op) * mask_op)**2
actual_op = tf.reduce_sum(actual_op) / tf.cast(tf.count_nonzero(mask_op), tf.float32)
actual_op = tf.sqrt(actual_op)
return actual_op
with tf.Session() as sess:
actual = sess.run(nonzero_mean(gt_op, pred_op, data_op))
print actual, expected
The y_true != 0 is not possible in plain Tensorflow. Not sure, if keras does some magic here.