I try to use Squared Normalized Error as my objective function for XGBoostRegressor using documentation hints here: https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html. My objective function equation is:
(prediction - observation) / standard_deviation(observations)
While trying to develop it I encountered the following issues:
I am wondering if such objective function is allowed, since standard deviation contains information about all observations (labels) while loss is calculated for each training example individually.
If my approach is correct, I am wondering how to calculate hessian and gradient of this objective function. I analyzed the squared error loss function here: Creating a Custom Objective Function in for XGBoost.XGBRegressor, but failed to understand why x=(predictions - observations) is treated as one parameter. In other words, why do we use loss function as x^2 instead of (x-y)^2? x and y correspond to predictions and observations respectively.
EDIT: I use XGBoost for the task of photovoltaic (PV) yield forecasting and I make predictions for multiple systems using one model. I would like to have low percentage error for all systems, despite their size. However, squared error makes training focus on the largest systems, as their error is naturally the largest. I changed the objective function to:
(prediction - observation) / system_size
and made system_size a global variable, as adding new input variables to gradient and hessian functions is not allowed. The code compiles without errors, but predictions are within very small range. Gradient can be divided by system_sizes, as dividing by constant does not change the derivative. Code I managed to develop so far:
def gradient_sne(predt: np.ndarray, dtrain: DMatrix) -> np.ndarray:
#Compute the gradient squared normalized error.
y = dtrain.get_label()
return 2*(predt - y)/system_sizes
def hessian_sne(predt: np.ndarray, dtrain: DMatrix) -> np.ndarray:
#Compute the hessian for squared error
y = dtrain.get_label()
return 0*y + 2
def custom_sne(y_pred, y_true):
#squared error objective. A simplified version of MSNE used as
#objective function.
grad = gradient_sne(y_pred, y_true)
hess = hessian_sne(y_pred, y_true)
return grad, hess
# Customized metric
def nrmse(predt: np.ndarray, dtrain: DMatrix):
''' Root mean squared normalized error metric. '''
y = dtrain.get_label()
predt[predt < 0] = 0 # all negative predictions are zero
std_dev = np.std(y)
elements = np.power(((y - predt) / std_dev), 2)
return 'RMSNE', float(np.sqrt(np.sum(elements) / len(y)))
I use python 3.7.5 and xgboost 1.0.2. I would appreciate your help very much.
Related
I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()
I have successfully implemented a custom multiclass softmax function function in XGBoost based on this tutorial. The reason for customization is that the classes I want to predict are conditional on some data inputs - i.e. of the 24 possible classes being predicted, only a certain subset are valid. valid_transitions are lists of indices corresponding to classes we want to make predictions on and invalid_transitions are the inverse set of indices.
I have implemented .fit() and .predict_proba() such that they take valid_transitions and invalid_transitions as arguments which tells softprob_obj() and softmax()which classes to null out during training and prediction.
def softmax(x, valid_transitions, invalid_transitions):
for i in range(len(x)):
e = np.exp(x[i,valid_transitions[i]])
x[i, valid_transitions[i]] = e/np.sum(e)
x[i, invalid_transitions[i]] = 0
return x
def softprob_obj(labels, predt, data, valid_transitions, invalid_transitions):
'''Loss function. Computing the gradient and approximated hessian (diagonal).
Reimplements the `multi:softprob` inside XGBoost.
'''
kRows = len(data)
kClasses = len(np.unique(labels))
# The prediction is of shape (rows, classes), each element in a row
# represents a raw prediction (leaf weight, hasn't gone through softmax
# yet). In XGBoost 1.0.0, the prediction is transformed by a softmax
# function, fixed in later versions.
assert predt.shape == (kRows, kClasses)
eps = 1e-6
# compute the gradient and hessian, slow iterations in Python, only
# suitable for demo. Also the one in native XGBoost core is more robust to
# numeric overflow as we don't do anything to mitigate the `exp` in
# `softmax` here.
probs = softmax(predt, valid_transitions, invalid_transitions)
labels = labels.astype(int)
hess = np.maximum((2.0 * probs * (1.0 - probs)), eps)
probs[np.arange(len(probs)),labels] -= 1
# Right now (XGBoost 1.0.0), reshaping is necessary
grad = probs.reshape((kRows * kClasses, 1))
hess = hess.reshape((kRows * kClasses, 1))
return grad, hess
This works, but is pretty slow in training, presumably because the core xgboost functions I'm replacing are not written in python. I made some attempts to try to vectorize the calculation in numpy to avoid the for loop in softmax(), but ran into some issues with the ragged arrays that valid_transition and invalid_transition create. Was wondering if anyone had any ideas on how to optimize this within python. Thanks!
I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.
So I am relatively new to the ML/AI game in python, and I'm currently working on a problem surrounding the implementation of a custom objective function for XGBoost.
My differential equation knowledge is pretty rusty so I've created a custom obj function with a gradient and hessian that models the mean squared error function that is ran as the default objective function in XGBRegressor to make sure that I am doing all of this correctly. The problem is, the results of the model (the error outputs are close but not identical for the most part (and way off for some points). I don't know what I'm doing wrong or how that could be possible if I am computing things correctly. If you all could look at this an maybe provide insight into where I am wrong, that would be awesome!
The original code without a custom function is:
import xgboost as xgb
reg = xgb.XGBRegressor(n_estimators=150,
max_depth=2,
objective ="reg:squarederror",
n_jobs=-1)
reg.fit(X_train, y_train)
y_pred_test = reg.predict(X_test)
and my custom objective function for MSE is as follows:
def gradient_se(y_true, y_pred):
#Compute the gradient squared error.
return (-2 * y_true) + (2 * y_pred)
def hessian_se(y_true, y_pred):
#Compute the hessian for squared error
return 0*(y_true + y_pred) + 2
def custom_se(y_true, y_pred):
#squared error objective. A simplified version of MSE used as
#objective function.
grad = gradient_se(y_true, y_pred)
hess = hessian_se(y_true, y_pred)
return grad, hess
the documentation reference is here
Thanks!
According to the documentation, the library passes the predicted values (y_pred in your case) and the ground truth values (y_true in your case) in this order.
You pass the y_true and y_pred values in reversed order in your custom_se(y_true, y_pred) function to both the gradient_se and hessian_se functions. For the hessian it doesn't make a difference since the hessian should return 2 for all x values and you've done that correctly.
For the gradient_se function you've incorrect signs for y_true and y_pred.
The correct implementation is as follows:
def gradient_se(y_pred, y_true):
#Compute the gradient squared error.
return 2*(y_pred - y_true)
def hessian_se(y_pred, y_true):
#Compute the hessian for squared error
return 0*y_true + 2
def custom_se(y_pred, y_true):
#squared error objective. A simplified version of MSE used as
#objective function.
grad = gradient_se(y_pred, y_true)
hess = hessian_se(y_pred, y_true)
return grad, hess
Update: Please keep in mind that the native XGBoost implementation and the implementation of the sklearn wrapper for XGBoost use a different ordering of the arguments. The native implementation takes predictions first and true labels (dtrain) second, while the sklearn implementation takes the true labels (dtrain) first and the predictions second.
I've compared extensively to existing tutorials but I can't figure out why my weights don't update. Here is the function that return the list of updates:
def get_updates(cost, params, learning_rate):
updates = []
for param in params:
updates.append((param, param - learning_rate * T.grad(cost, param)))
return updates
It is defined at the top level, outside of any classes. This is standard gradient descent for each param. The 'params' parameter here is fed in as mlp.params, which is simply the concatenated lists of the param lists for each layer. I removed every layer except for a logistic regression one to isolate the reason as to why my cost was not decreasing. The following is the definition of mlp.params in MLP's constructor. It follows the definition of each layer and their respective param lists.
self.params = []
for layer in self.layers:
self.params += layer.params
The following is the train function, which I call for each minibatch during each epoch:
train = theano.function([minibatch_index], cost,
updates=get_updates(cost, mlp.params, learning_rate),
givens= {
x: train_set_x[minibatch_index * batch_size : (minibatch_index + 1) * batch_size],
y: train_set_y[minibatch_index * batch_size : (minibatch_index + 1) * batch_size]
})
If you require further details, the entire file is available here: http://pastebin.com/EeNmXfGD
I don't know how many people use Theano (it doesn't seem like plenty); if you've read to this point, thank you.
Fixed: I've determined that I can't use average squared error as the cost function. It works as usual after replacing it with a negative log-likelihood.
This behavior it caused by a few things but it comes down to the cost not being properly computed. In your implementation , the output of the LogisticRegression layer is the predicted class for every input digit (obtained with the argmax operation) and you take the squared difference between it and the expected prediction.
This will give you gradients of 0s wrt to any parameter in your model because the gradient of the output of the argmax (predicted class) wrt the input of the argmax (class probabilities) will be 0.
Instead, the LogisticRegression should output the probabilities of the classes :
def output(self, input):
input = input.flatten(2)
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
return self.p_y_given_x
And then in the MLP class, you compute the cost. You can used mean squared error between the desired probabilities for each class and the probabilities computed by the model but people tend to use the Negative Log Likelihood of the expected classes and you can implement it as such in the MLP class :
def neg_log_likelihood(self, x, y):
p_y_given_x = self.output(x)
return -T.mean(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
Then you can use this function to compute your cost and the model trains :
cost = mlp.neg_log_likelihood(x_, y)
A few additional things:
At line 215, when you print your cost, you format it as an integer value but it is a floating point value; this will lose precision in the monitoring.
Initializing all the weights to 0s as you do in your LogisticRegression class is often not recommended. Weights should differ in their original values so as to help break symmetry