sklearn: Hyperparameter tuning by gradient descent?

sklearn: Hyperparameter tuning by gradient descent? - python

Is there a way to perform hyperparameter tuning in scikit-learn by gradient descent? While a formula for the gradient of hyperparameters might be difficult to compute, numerical computation of the hyperparameter gradient by evaluating two close points in hyperparameter space should be pretty easy. Is there an existing implementation of this approach? Why is or isn't this approach a good idea?

The calculation of the gradient is the least of problems. At least in times of advanced automatic differentiation software. (Implementing this in a general way for all sklearn-classifiers of course is not easy)
And while there are works of people who used this kind of idea, they only did this for some specific and well-formulated problem (e.g. SVM-tuning). Furthermore there probably were a lot of assumptions because:
Why is this not a good idea?
Hyper-param optimization is in general: non-smooth
GD really likes smooth functions as a gradient of zero is not helpful
(Each hyper-parameter which is defined by some discrete-set (e.g. choice of l1 vs. l2 penalization) introduces non-smooth surfaces)
Hyper-param optimization is in general: non-convex
The whole convergence-theory of GD assumes, that the underlying problem is convex
Good-case: you obtain some local-minimum (can be arbitrarily bad)
Worst-case: GD is not even converging to some local-minimum
I might add, that your general problem is the worst kind of optimization problem one can consider because it's:
non-smooth, non-convex
and even stochastic / noisy as most underlying algorithms are heuristic approximations with some variance in regards to the final output (and often even PRNG-based random-behaviour).
The last part is the reason, why the offered methods in sklearn are that simple:
random-search:
if we can't infere something because the problem is too hard, just try many instances and pick the best
grid-search:
let's assume there is some kind of smoothness
instead of random-sampling, we sample in regards to our smoothness-assumption
(and other assumptions like: param is probably big -> np.logspace to analyze more big numbers)
While there are a lot of Bayesian-approaches including available python-software like hyperopt and spearmint, many people think, that random-search is the best method in general (which might be surprising but emphasizes the mentioned problems).

Here are some papers describing gradient-based hyperparameter optimization:
Gradient-based hyperparameter optimization through reversible learning (2015):
We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
Forward and reverse gradient-based hyperparameter optimization (2017):
We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets.
Gradient descent: the ultimate optimizer (2019):
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.

For generalized linear models (i.e. logistic regression, ridge regression, poisson regression),
you can efficiently tune many regularization hyperparameters
using exact derivatives and approximate leave-one cross-validation.
But don't stop at just the gradient, compute the full hessian and use a second-order optimizer -- it's
both more efficient and robust.
sklearn doesn't currently have this functionality, but there are other tools available that can do it.
For example, here's how you can use the python package bbai to fit the
hyperparameter for ridge regularized logistic regression to maximize the log likelihood of the
approximate leave-one-out cross-validation of the training data set for the Wisconsin Breast Cancer Data Set.
Load the data set
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Fit the model
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: it automatically fits the C parameter to minimize the error on
# the approximate leave-one-out cross-validation.
model.fit(X, y)
Because it uses use both the gradient and hessian with efficient exact formulas
(no automatic differentiation), it can dial into an exact hyperparameter quickly with only a few
evaluations.
YMMV, but when I compare it to sklearn's LogisticRegressionCV with default parameters, it runs
in a fraction of the time.
t1 = time.time()
model = bbai.glm.LogisticRegression()
model.fit(X, y)
t2 = time.time()
print('***** approximate leave-one-out optimization')
print('C = ', model.C_)
print('time = ', (t2 - t1))
from sklearn.linear_model import LogisticRegressionCV
print('***** sklearn.LogisticRegressionCV')
t1 = time.time()
model = LogisticRegressionCV(scoring='neg_log_loss', random_state=0)
model.fit(X, y)
t2 = time.time()
print('C = ', model.C_[0])
print('time = ', (t2 - t1))
Prints
***** approximate leave-one-out optimization
C = 0.6655139682151275
time = 0.03996014595031738
***** sklearn.LogisticRegressionCV
C = 0.3593813663804626
time = 0.2602980136871338
How it works
Approximate leave-one-out cross-validation (ALOOCV) is a close approimxation to leave-one-out
cross-validation that's much more efficient to evaluate for generalized linear models.
It first fits the regularized model. Then uses a single step of Newton's algorithm to approximate what
the model weights would be when we leave a single data point out. If the regularized cost function for
the generalized linear model is represented as
Then the ALOOCV can be computed as
where
(Note: H represents the hessian of the cost function at the optimal weights)
For more background on ALOOCV, you can check out this guide.
It's also possible to compute exact derivatives for ALOOCV which makes it efficient to optimize.
I won't put the derivative formulas here as they are quite involved, but see the paper
Optimizing Approximate Leave-one-out Cross-validation.
If we plot out ALOOCV and compare to leave-one-out cross-validation for the example data set,
you can see that it tracks it very closely and the ALOOCV optimum is nearly the same as the
LOOCV optimum.
Compute Leave-one-out Cross-validation
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
Compute Approximate Leave-one-out Cross-validation
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
Plot out the results (along with the ALOOCV optimum)
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
Displays

Related

MLPRegressor learning_rate_init for lbfgs solver in sklearn

For a school project I need to evaluate a neural network with different learning rates. I chose sklearn to implement the neural network (using the MLPRegressor class). Since the training data is pretty small (20 instances, 2 inputs and 1 output each) I decided to use the lbfgs solver, since stochastic solvers like sgd and adam for this size of data don't make sense.
The project mandates testing the neural network with different learning rates. That, however, is not possible with the lbfgs solver according to the documentation:
learning_rate_init double, default=0.001
The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.
Is there a way I can access the learning rate of the lbfgs solver somehow and modify it or that question doesn't even make sense?

LBFGS is an optimization algorithm that simply does not use a learning rate. For the purpose of your school project, you should use either sgd or adam. Regarding whether it makes more sense or not, I would say that training a neural network on 20 data points doesn't make a lot of sense anyway, except for learning the basics.
LBFGS is a quasi-newton optimization method. It is based on the hypothesis that the function you seek to optimize can be approximated locally by a second order Taylor development. It roughly proceeds like this:
Start from an initial guess
Use the Jacobian matrix to compute the direction of steepest descent
Use the Hessian matrix to compute the descent step and reach the next point
repeat until convergence
The difference with Newton methods is that quasi Newton methods use approximates for the Jacobian and/or Hessian matrices.
Newton and quasi-newton methods requires more smoothness from the function to optimize than the gradient descent, but converge faster. Indeed, computing the descent step with the Hessian matrix is more efficient because it can foresee the distance to the local optimum, thus not ending up oscillating around it or converging very slowly. On the other side, the gradient descent only use the Jacobian matrix (first order derivatives) to compute the direction of steepest descent and use the learning rate as the descent step.
Practically the gradient descent is used in deep learning because computing the Hessian matrix would be too expensive.
Here it makes no sense to talk about a learning rate for Newton methods (or Quasi-Newton methods), it is just not applicable.

Not a complete answer, but hopefully a good pointer.
The sklearn.neural_network.MLPRegressor is implemented multilayer_perceptron module on github.
by inspecting the module I noticed that differently from other solvers, scitkit implements the lbfgs algorithm in the Base class itself. So you can easily adapt it.
It seems that they don't use any learning rate, so you could adapt this code and multiply the loss by a learning rate you want to test. I'm just not totally sure if it makes sense adding a learning rate in the context of lbfgs.
I believe the loss if being used here:
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
the code is located in line 430 of the _multilayer_perceptron.py module
def _fit_lbfgs(self, X, y, activations, deltas, coef_grads,
intercept_grads, layer_units):
# Store meta information for the parameters
self._coef_indptr = []
self._intercept_indptr = []
start = 0
# Save sizes and indices of coefficients for faster unpacking
for i in range(self.n_layers_ - 1):
n_fan_in, n_fan_out = layer_units[i], layer_units[i + 1]
end = start + (n_fan_in * n_fan_out)
self._coef_indptr.append((start, end, (n_fan_in, n_fan_out)))
start = end
# Save sizes and indices of intercepts for faster unpacking
for i in range(self.n_layers_ - 1):
end = start + layer_units[i + 1]
self._intercept_indptr.append((start, end))
start = end
# Run LBFGS
packed_coef_inter = _pack(self.coefs_,
self.intercepts_)
if self.verbose is True or self.verbose >= 1:
iprint = 1
else:
iprint = -1
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
args=(X, y, activations, deltas, coef_grads, intercept_grads))
self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
self.loss_ = opt_res.fun
self._unpack(opt_res.x)

Which optimization metric for differenced data when doing a multi-step forecast?

I've written an LSTM in Keras for univariate time series forecasting. I'm using an input window of size 48 and an output window of size 12, i.e. I'm predicting 12 steps at once. This is working generally well with an optimization metric such as RMSE.
For non-stationary time series I'm differencing the data before feeding the data to the LSTM. Then after predicting, I take the inverse difference of the predictions.
When differencing, RMSE is not suitable as an optimization metric as the earlier prediction steps are a lot more important than later steps. When we do the inverse difference after creating a 12-step forecast, then the earlier (differenced) prediction steps are going to affect the inverse difference of later steps.
So what I think I need is an optimization metric that gives the early prediction steps more weight, preferably exponentially.
Does such a metric exist already or should I write my own? Am I overlooking something?

Just wrote my own optimization metric, it seems to work well, certainly better than RMSE.
Still curious what's the best practice here. I'm relatively new to forecasting.
from tensorflow.keras import backend as K
def weighted_rmse(y_true, y_pred):
weights = K.arange(start=y_pred.get_shape()[1], stop=0, step=-1, dtype='float32')
y_true_w = y_true * weights
y_pred_w = y_pred * weights
return K.sqrt(K.mean(K.square(y_true_w - y_pred_w), axis=-1))

problem with alpha and lambda regularization parameters in python

Question :
Logistic Regression Train logistic regression models with L1 regularization and L2 regularization using alpha = 0.1
and lambda = 0.1. Report accuracy, precision, recall, f1-score and print the confusion matrix
My code is :
_lambda = 0.1
c = 1/_lambda
classifier = LogisticRegression(penalty='l1',C=c)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
I don't know where is really location of alpha and lambda.
Did I work right?

your example
alpha=0, lambda=10 (AKA .1/1)
alpha
alpha is the parameter that adds penalty for number of features to control overfitting, in this case either L1 (Lasso Regression) or L2 (Ridge Regression). L1 and L2 penalty cannot both be done at the same time, as there is only one Lambda coefficient. Quick aside - Elastic Net is an alpha parameter that is somewhere in between L1 and L2, so for example, if you are using sklearn.SGD_Regressor() alpha=0 is L1 alpha=0.5 is elasticnet, alpha=1 is Ridge.
Lambda
is a term that controls the learning rate. In other words, how much change do you want the model to make during each iteration of learning.
confusion
To make matters worse, these terms are often used interchangedly, I think due to different yet similar concepts in graph theory, statistical theory, mathematical theory, and the individuals who write commonly-used machine-learning libraries
check out some info here: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ but also look for some of the free academic textbooks about statistical learning.

How to set the learning rate in scikit-learn's ridge regression?

I'm using scikit-learn's ridge regression:
regr = linear_model.Ridge (alpha = 0.5)
# Train the model using the training sets
regr.fit(X_train, Y_train)
#bias:
print('bias: \n', regr.intercept_)
# The coefficients
print('Coefficients: \n', regr.coef_)
I found (here) the different options for the linear_model.Ridge function, but there is a specific option that I didn't find in the list: How could I set the learning rate (or learning step) of the update function?
By learning rate, I mean:
w_{t+1} = w_t + (learning_rate) * (partial derivative of objective function)

I refer to learning rate as step size.
Your code is not using the sag (stochastic average gradient) solver. The default parameter for solver is set to auto, which will choose a solver depending on the data type. A description of the other solvers and which to use is here.
To use the sag solver:
regr = linear_model.Ridge (alpha = 0.5, solver = 'sag')
However, for this solver you do not set the step size because the solver computes the step size based on your data and alpha. Here is the code for sag solver used for ridge regression, where they explain how the step size is computed.
The step size is set to 1 / (alpha_scaled + L + fit_intercept) where L is
the max sum of squares for over all samples.
Line 401 shows how sag_solver being used for ridge regression.

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?

Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.

Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.