MLPRegressor learning_rate_init for lbfgs solver in sklearn - python

For a school project I need to evaluate a neural network with different learning rates. I chose sklearn to implement the neural network (using the MLPRegressor class). Since the training data is pretty small (20 instances, 2 inputs and 1 output each) I decided to use the lbfgs solver, since stochastic solvers like sgd and adam for this size of data don't make sense.
The project mandates testing the neural network with different learning rates. That, however, is not possible with the lbfgs solver according to the documentation:
learning_rate_init double, default=0.001
The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.
Is there a way I can access the learning rate of the lbfgs solver somehow and modify it or that question doesn't even make sense?

LBFGS is an optimization algorithm that simply does not use a learning rate. For the purpose of your school project, you should use either sgd or adam. Regarding whether it makes more sense or not, I would say that training a neural network on 20 data points doesn't make a lot of sense anyway, except for learning the basics.
LBFGS is a quasi-newton optimization method. It is based on the hypothesis that the function you seek to optimize can be approximated locally by a second order Taylor development. It roughly proceeds like this:
Start from an initial guess
Use the Jacobian matrix to compute the direction of steepest descent
Use the Hessian matrix to compute the descent step and reach the next point
repeat until convergence
The difference with Newton methods is that quasi Newton methods use approximates for the Jacobian and/or Hessian matrices.
Newton and quasi-newton methods requires more smoothness from the function to optimize than the gradient descent, but converge faster. Indeed, computing the descent step with the Hessian matrix is more efficient because it can foresee the distance to the local optimum, thus not ending up oscillating around it or converging very slowly. On the other side, the gradient descent only use the Jacobian matrix (first order derivatives) to compute the direction of steepest descent and use the learning rate as the descent step.
Practically the gradient descent is used in deep learning because computing the Hessian matrix would be too expensive.
Here it makes no sense to talk about a learning rate for Newton methods (or Quasi-Newton methods), it is just not applicable.

Not a complete answer, but hopefully a good pointer.
The sklearn.neural_network.MLPRegressor is implemented multilayer_perceptron module on github.
by inspecting the module I noticed that differently from other solvers, scitkit implements the lbfgs algorithm in the Base class itself. So you can easily adapt it.
It seems that they don't use any learning rate, so you could adapt this code and multiply the loss by a learning rate you want to test. I'm just not totally sure if it makes sense adding a learning rate in the context of lbfgs.
I believe the loss if being used here:
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
the code is located in line 430 of the _multilayer_perceptron.py module
def _fit_lbfgs(self, X, y, activations, deltas, coef_grads,
intercept_grads, layer_units):
# Store meta information for the parameters
self._coef_indptr = []
self._intercept_indptr = []
start = 0
# Save sizes and indices of coefficients for faster unpacking
for i in range(self.n_layers_ - 1):
n_fan_in, n_fan_out = layer_units[i], layer_units[i + 1]
end = start + (n_fan_in * n_fan_out)
self._coef_indptr.append((start, end, (n_fan_in, n_fan_out)))
start = end
# Save sizes and indices of intercepts for faster unpacking
for i in range(self.n_layers_ - 1):
end = start + layer_units[i + 1]
self._intercept_indptr.append((start, end))
start = end
# Run LBFGS
packed_coef_inter = _pack(self.coefs_,
self.intercepts_)
if self.verbose is True or self.verbose >= 1:
iprint = 1
else:
iprint = -1
opt_res = scipy.optimize.minimize(
self._loss_grad_lbfgs, packed_coef_inter,
method="L-BFGS-B", jac=True,
options={
"maxfun": self.max_fun,
"maxiter": self.max_iter,
"iprint": iprint,
"gtol": self.tol
},
args=(X, y, activations, deltas, coef_grads, intercept_grads))
self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
self.loss_ = opt_res.fun
self._unpack(opt_res.x)

Related

PyTorch: Is retain_graph=True necessary in alternating optimization?

I'm trying to optimize two models in an alternating fashion using PyTorch. The first is a neural network that is changing the representation of my data (ie a map f(x) on my input data x, parameterized by some weights W). The second is a Gaussian mixture model that is operating on the f(x) points, ie in the neural network space (rather than clustering points in the input space. I am optimizing the GMM using expectation maximization, so the parameter updates are analytically derived, rather than using gradient descent.
I have two loss functions here: the first is a function of the distances ||f(x) - f(y)||, and the second is the loss function of the Gaussian mixture model (ie how 'clustered' everything looks in the NN representation space). What I want to do is take a step in the NN optimization using both of the above loss functions (since it depends on both), and then do an expectation-maximization step for the GMM. The code looks like this (I have removed a lot since there is a ton of code):
data, labels = load_dataset()
net = NeuralNetwork()
net_optim = torch.optim.Adam(net.parameters(), lr=0.05, weight_decay=1)
# initialize weights, means, and covariances for the Gaussian clusters
concentrations, means, covariances, precisions = initialization(net.forward_one(data))
for i in range(1000):
net_optim.zero_grad()
pairs, pair_labels = pairGenerator(data, labels) # samples some pairs of datapoints
outputs = net(pairs[:, 0, :], pairs[:, 1, :]) # computes pairwise distances
net_loss = NeuralNetworkLoss(outputs, pair_labels) # loss function based on pairwise dist.
embedding = net.forward_one(data) # embeds all data in the NN space
log_prob, log_likelihoods = expectation_step(embedding, means, precisions, concentrations)
concentrations, means, covariances, precisions = maximization_step(embedding, log_likelihoods)
gmm_loss = GMMLoss(log_likelihoods, log_prob, precisions, concentrations)
net_loss.backward(retain_graph=True)
gmm_loss.backward(retain_graph=True)
net_optim.step()
Essentially, this is what is happening:
Sample some pairs of points from the dataset
Push pairs of points through the NN and compute network loss based on those outputs
Embed all datapoints using the NN and perform a clustering EM step in that embedding space
Compute variational loss (ELBO) based on clustering parameters
Update neural network parameters using both the variational loss and the network loss
However, to perform (5), I am required to add the flag retain_graph=True, otherwise I get the error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
It seems like having two loss functions means that I need to retain the computational graph?
I am not sure how to work around this, as with retain_graph=True, around iteration 400, each iteration is taking ~30 minutes to complete. Does anyone know how I might fix this? I apologize in advance – I am still very new to automatic differentiation.
I would recommend doing
total_loss = net_loss + gmm_loss
total_loss.backward()
Note that the gradient of net_loss w.r.t gmm weights is 0 thus summing the losses won't have any effect.
Here is a good thread on pytorch regarding the retain_graph. https://discuss.pytorch.org/t/what-exactly-does-retain-variables-true-in-loss-backward-do/3508/24

How to define the policy in the case of continuous action space that sum up to 1?

I am currently working on a continuous state-action space problem using policy gradient methods.
The environment action space is defined as ratios that has to sum up to 1 at each timestep. Hence, using the gaussian policy doesn't seem to be suitable in this case.
What I did instead is I tried to tweak the softmax policy (to make sure the policy network output sums up to 1), but I had hard time determining the loss function to use and eventually its gradient in order to update the network parameters.
So far, I have tried a discounted return-weighted Mean Squared Error, but the results aren't satisfactory.
Are there any other policies that can be used in this particular case? Or ar there any ideas which loss function to use?
Here is the implementation of my policy network (inside my agent class) in tensorflow.
def policy_network(self):
self.input = tf.placeholder(tf.float32,
shape=[None, self.input_dims],
name='input')
self.label = tf.placeholder(tf.float32, shape=[None, self.n_actions], name='label')
# discounted return
self.G = tf.placeholder(tf.float32, shape=[
None,
], name='G')
with tf.variable_scope('layers'):
l1 = tf.layers.dense(
inputs=self.input,
units=self.l1_size,
activation=tf.nn.relu,
kernel_initializer=tf.contrib.layers.xavier_initializer())
l2 = tf.layers.dense(
inputs=l1,
units=self.l2_size,
activation=tf.nn.relu,
kernel_initializer=tf.contrib.layers.xavier_initializer())
l3 = tf.layers.dense(
inputs=l2,
units=self.n_actions,
activation=None,
kernel_initializer=tf.contrib.layers.xavier_initializer())
self.actions = tf.nn.softmax(l3, name='actions')
with tf.variable_scope('loss'):
base_loss = tf.reduce_sum(tf.square(self.actions - self.label))
loss = base_loss * self.G
with tf.variable_scope('train'):
self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
On top of my head, you may want to try 2D-Gaussian or multivariate Gaussian. https://en.wikipedia.org/wiki/Gaussian_function
For example, you could predict the 4 parameters (x_0, x_1, sigma_0, sigma_1) of 2D-Gaussian, which you could generate a pair of numbers on the 2D-Gaussian plane, say (2, 1.5), then you could use softmax to produce the desired action softmax([2, 1.5])=[0.62245933 0.37754067].
Then you could calculate the probability of the pair of numbers on the 2D-Gaussian plane, which you could then use to calculate the negative log probability, advantage, etc, to make the loss function and update the gradient.
Have you thought of using Dirichlet distribution? Your network can output concentration parameters alpha > 0 and then you can use them to generate a sample which would sum to one. Both PyTorch and TF support this distribution and you can both sample and get logProb from them. In this case, in addition to getting your sample, since it is a probability distribution, you can get a sense its variance too which can be a measure of the agent confidence. For the action of 3 dimensions, having alpha={1,1,1} basically means your agent doesn't have any preference and having alpha={100,1,1} would imply that it is very certain about most of the weight should go to the first dimensions.
Edit based on the comment:
Vanilla REINFORCE would have a hard time optimizing the policy when you use Dirichlet distribution. The problem is, in vanilla policy gradient, you can control how fast you change your policy in the network parameters space through gradient clipping and adaptive learning rate, etc. However, what matters the most is to control the rate of change in the probability space. Some network parameters may change probabilities a lot more than the others. Therefore, even though you control the learning rate to limit the delta of your network parameters, you may change the variance of your Dirichlet distribution a lot, which makes sense for your network if you think. In order to maximize the log-prob of your actions, your network might focus more on reducing the variance than shifting the mode of your distribution which would later hurt you in both exploration and learning meaningful policy. One way to alleviate this problem is to limit the rate of change of your policy probability distribution through limiting the KL-divergence of your new policy ditribution vs old one. TRPO or PPO are two of the ways to address this issue and solve the constraint optimization problems.
It is also probably good to make sure that in practice alpha > 1. You can achieve this easily by using softplus ln(1+exp(x)) + 1 on your neural network outputs before feeding it into your Drichlet distribution. Also monitor the gradients reaching your layers and make sure it exists.
You may also want to add the entropy of the distribution to your objective function to ensure enough exploration and prevent distribution with very low variance (very high alphas).

Which optimization metric for differenced data when doing a multi-step forecast?

I've written an LSTM in Keras for univariate time series forecasting. I'm using an input window of size 48 and an output window of size 12, i.e. I'm predicting 12 steps at once. This is working generally well with an optimization metric such as RMSE.
For non-stationary time series I'm differencing the data before feeding the data to the LSTM. Then after predicting, I take the inverse difference of the predictions.
When differencing, RMSE is not suitable as an optimization metric as the earlier prediction steps are a lot more important than later steps. When we do the inverse difference after creating a 12-step forecast, then the earlier (differenced) prediction steps are going to affect the inverse difference of later steps.
So what I think I need is an optimization metric that gives the early prediction steps more weight, preferably exponentially.
Does such a metric exist already or should I write my own? Am I overlooking something?
Just wrote my own optimization metric, it seems to work well, certainly better than RMSE.
Still curious what's the best practice here. I'm relatively new to forecasting.
from tensorflow.keras import backend as K
def weighted_rmse(y_true, y_pred):
weights = K.arange(start=y_pred.get_shape()[1], stop=0, step=-1, dtype='float32')
y_true_w = y_true * weights
y_pred_w = y_pred * weights
return K.sqrt(K.mean(K.square(y_true_w - y_pred_w), axis=-1))

Loss function for simple Reinforcement Learning algorithm

This question comes from watching the following video on TensorFlow and Reinforcement Learning from Google I/O 18: https://www.youtube.com/watch?v=t1A3NTttvBA
Here they train a very simple RL algorithm to play the game of Pong.
In the slides they use, the loss is defined like this ( approx # 11m 25s ):
loss = -R(sampled_actions * log(action_probabilities))
Further they show the following code ( approx # 20m 26s):
# loss
cross_entropies = tf.losses.softmax_cross_entropy(
onehot_labels=tf.one_hot(actions, 3), logits=Ylogits)
loss = tf.reduce_sum(rewards * cross_entropies)
# training operation
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay=0.99)
train_op = optimizer.minimize(loss)
Now my question is this; They use the +1 for winning and -1 for losing as rewards. In the code that is provided, any cross entropy loss that's multiplied by a negative reward will be very low? And if the training operation is using the optimizer to minimize the loss, well then the algorithm is trained to lose?
Or is there something fundamental I'm missing ( probably because of my very limited mathematical skills )
Great question Corey. I am also wondering exactly what this popular loss function in RL actually means. I've seen many implementations of it, but many contradict each other. For my understanding, it means this:
Loss = - log(pi) * A
Where A is the advantage compared to a baseline case. In Google's case, they used a baseline of 0, so A = R. This is multiplied by that specific action at that specific time, so in your above example, actions were one hot encoded as [1, 0, 0]. We will ignore the 0s and only take the 1. Hence we have the above equation.
If you intuitively calculate this loss for a negative reward:
Loss = - (-1) * log(P)
But for any P less than 1, log of that value will be negative. Therefore, you have a negative loss which can be interpreted as "very good", but really doesn't make physical sense.
The correct way:
However in my opinion, and please others correct me if I'm wrong, you do not calculate the loss directly. You take the gradient of the loss. That is, you take the derivative of -log(pi)*A.
Therefore, you would have:
-(d(pi) / pi) * A
Now, if you have a large negative reward, it will translate to a very large loss.
I hope this makes sense.

sklearn: Hyperparameter tuning by gradient descent?

Is there a way to perform hyperparameter tuning in scikit-learn by gradient descent? While a formula for the gradient of hyperparameters might be difficult to compute, numerical computation of the hyperparameter gradient by evaluating two close points in hyperparameter space should be pretty easy. Is there an existing implementation of this approach? Why is or isn't this approach a good idea?
The calculation of the gradient is the least of problems. At least in times of advanced automatic differentiation software. (Implementing this in a general way for all sklearn-classifiers of course is not easy)
And while there are works of people who used this kind of idea, they only did this for some specific and well-formulated problem (e.g. SVM-tuning). Furthermore there probably were a lot of assumptions because:
Why is this not a good idea?
Hyper-param optimization is in general: non-smooth
GD really likes smooth functions as a gradient of zero is not helpful
(Each hyper-parameter which is defined by some discrete-set (e.g. choice of l1 vs. l2 penalization) introduces non-smooth surfaces)
Hyper-param optimization is in general: non-convex
The whole convergence-theory of GD assumes, that the underlying problem is convex
Good-case: you obtain some local-minimum (can be arbitrarily bad)
Worst-case: GD is not even converging to some local-minimum
I might add, that your general problem is the worst kind of optimization problem one can consider because it's:
non-smooth, non-convex
and even stochastic / noisy as most underlying algorithms are heuristic approximations with some variance in regards to the final output (and often even PRNG-based random-behaviour).
The last part is the reason, why the offered methods in sklearn are that simple:
random-search:
if we can't infere something because the problem is too hard, just try many instances and pick the best
grid-search:
let's assume there is some kind of smoothness
instead of random-sampling, we sample in regards to our smoothness-assumption
(and other assumptions like: param is probably big -> np.logspace to analyze more big numbers)
While there are a lot of Bayesian-approaches including available python-software like hyperopt and spearmint, many people think, that random-search is the best method in general (which might be surprising but emphasizes the mentioned problems).
Here are some papers describing gradient-based hyperparameter optimization:
Gradient-based hyperparameter optimization through reversible learning (2015):
We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
Forward and reverse gradient-based hyperparameter optimization (2017):
We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets.
Gradient descent: the ultimate optimizer (2019):
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.
For generalized linear models (i.e. logistic regression, ridge regression, poisson regression),
you can efficiently tune many regularization hyperparameters
using exact derivatives and approximate leave-one cross-validation.
But don't stop at just the gradient, compute the full hessian and use a second-order optimizer -- it's
both more efficient and robust.
sklearn doesn't currently have this functionality, but there are other tools available that can do it.
For example, here's how you can use the python package bbai to fit the
hyperparameter for ridge regularized logistic regression to maximize the log likelihood of the
approximate leave-one-out cross-validation of the training data set for the Wisconsin Breast Cancer Data Set.
Load the data set
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Fit the model
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: it automatically fits the C parameter to minimize the error on
# the approximate leave-one-out cross-validation.
model.fit(X, y)
Because it uses use both the gradient and hessian with efficient exact formulas
(no automatic differentiation), it can dial into an exact hyperparameter quickly with only a few
evaluations.
YMMV, but when I compare it to sklearn's LogisticRegressionCV with default parameters, it runs
in a fraction of the time.
t1 = time.time()
model = bbai.glm.LogisticRegression()
model.fit(X, y)
t2 = time.time()
print('***** approximate leave-one-out optimization')
print('C = ', model.C_)
print('time = ', (t2 - t1))
from sklearn.linear_model import LogisticRegressionCV
print('***** sklearn.LogisticRegressionCV')
t1 = time.time()
model = LogisticRegressionCV(scoring='neg_log_loss', random_state=0)
model.fit(X, y)
t2 = time.time()
print('C = ', model.C_[0])
print('time = ', (t2 - t1))
Prints
***** approximate leave-one-out optimization
C = 0.6655139682151275
time = 0.03996014595031738
***** sklearn.LogisticRegressionCV
C = 0.3593813663804626
time = 0.2602980136871338
How it works
Approximate leave-one-out cross-validation (ALOOCV) is a close approimxation to leave-one-out
cross-validation that's much more efficient to evaluate for generalized linear models.
It first fits the regularized model. Then uses a single step of Newton's algorithm to approximate what
the model weights would be when we leave a single data point out. If the regularized cost function for
the generalized linear model is represented as
Then the ALOOCV can be computed as
where
(Note: H represents the hessian of the cost function at the optimal weights)
For more background on ALOOCV, you can check out this guide.
It's also possible to compute exact derivatives for ALOOCV which makes it efficient to optimize.
I won't put the derivative formulas here as they are quite involved, but see the paper
Optimizing Approximate Leave-one-out Cross-validation.
If we plot out ALOOCV and compare to leave-one-out cross-validation for the example data set,
you can see that it tracks it very closely and the ALOOCV optimum is nearly the same as the
LOOCV optimum.
Compute Leave-one-out Cross-validation
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
Compute Approximate Leave-one-out Cross-validation
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
Plot out the results (along with the ALOOCV optimum)
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
Displays

Categories