How to set the learning rate in scikit-learn's ridge regression? - python

I'm using scikit-learn's ridge regression:
regr = linear_model.Ridge (alpha = 0.5)
# Train the model using the training sets
regr.fit(X_train, Y_train)
#bias:
print('bias: \n', regr.intercept_)
# The coefficients
print('Coefficients: \n', regr.coef_)
I found (here) the different options for the linear_model.Ridge function, but there is a specific option that I didn't find in the list: How could I set the learning rate (or learning step) of the update function?
By learning rate, I mean:
w_{t+1} = w_t + (learning_rate) * (partial derivative of objective function)

I refer to learning rate as step size.
Your code is not using the sag (stochastic average gradient) solver. The default parameter for solver is set to auto, which will choose a solver depending on the data type. A description of the other solvers and which to use is here.
To use the sag solver:
regr = linear_model.Ridge (alpha = 0.5, solver = 'sag')
However, for this solver you do not set the step size because the solver computes the step size based on your data and alpha. Here is the code for sag solver used for ridge regression, where they explain how the step size is computed.
The step size is set to 1 / (alpha_scaled + L + fit_intercept) where L is
the max sum of squares for over all samples.
Line 401 shows how sag_solver being used for ridge regression.

Related

Why can't I get the result I got with the sklearn LogisticRegression with the coefficients_sgd method?

from math import exp
import numpy as np
from sklearn.linear_model import LogisticRegression
I used code below from How To Implement Logistic Regression From Scratch in Python
def predict(row, coefficients):
yhat = coefficients[0]
for i in range(len(row)-1):
yhat += coefficients[i + 1] * row[i]
return 1.0 / (1.0 + exp(-yhat))
def coefficients_sgd(train, l_rate, n_epoch):
coef = [0.0 for i in range(len(train[0]))]
for epoch in range(n_epoch):
sum_error = 0
for row in train:
yhat = predict(row, coef)
error = row[-1] - yhat
sum_error += error**2
coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
for i in range(len(row)-1):
coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
return coef
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)
[-0.39233141593823756, 1.4791536027917747, -2.316697087065274]
x = np.array(dataset)[:,:2]
y = np.array(dataset)[:,2]
model = LogisticRegression(penalty="none")
model.fit(x,y)
print(model.intercept_.tolist() + model.coef_.ravel().tolist())
[-3.233238244349982, 6.374828107647225, -9.631487530388092]
What should I change to get the same or closer coefficients ? How can I establish initial coefficients , learning rate , n_epoch ?
Well, there are many nuances here 🙂
First, recall that estimating coefficients of logistic regression with (negative) log-likelihood is possible using various optimization methods, including SGD you implemented, but there is no exact, closed-form solution. So even if you implement an exact copy of scikit-learn's LogisticRegression, you will need to set the same hyperparameters (number of epochs, learning rate, etc.) and random state to obtain the same coefficients.
Second, LogisticRegression offers five different optimization methods (solver parameter). You run LogisticRegression(penalty="none") with its default parameters and the default for solver is 'lbfgs', not SGD; so depending on your data and hyperparameters, you may get significantly different results.
What should I change to get the same or closer coefficients ?
I would suggest comparing your implementation with SGDClassifier(loss='log') first, since LogisticRegression does not offer SGD solver. Although keep in mind that scikit-learn's implementation is more sophisticated, in particular having more hyperparameters for early stopping like tol.
How can I establish initial coefficients, learning rate, n_epoch?
Typically, coefficients for SGD are initialized randomly (e.g., uniform(-1/(2n), 1/(2n))), using some data statistics (e.g., dot(y, w)/(dot(w, w) for every coefficient w), or with pre-trained model's parameters. On the contrary, there is no golden rule for learning rate or number of epochs. Usually, we set a big number of epochs and some other stopping criterion (e.g., whether norm between current and previous coefficients is smaller than some small tol), a moderate learning rate, and every iteration we reduce the learning rate following some rule (see learning_rate parameter of SGDClassifier or User Guide) and check the stopping criterion.

Tensorflow model architecture for sparse dataset

I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.

Linear Regression model (using Gradient Descent) does not converge on Boston Housing Dataset

I've been trying to find out why my linear regression model performs poorly when compared to sklearn's linear regression model.
My linear regression model (update rules based on gradient descent)
w0 = 0
w1 = 0
alpha = 0.001
N = len(xTrain)
for i in range(1000):
yPred = w0 + w1*xTrain
w0 = w0 - (alpha/N)* sum(yPred - yTrain)
w1 = w1 - (alpha/N)*sum((yPred - yTrain) * xTrain)
Code for plotting the values of x from the training set and the predicted values of y
#Scatter plot between x and y
plot.scatter(xTrain,yTrain, c='black')
plot.plot(xTrain, w0+w1*xTrain, color='r')
plot.xlabel('Number of rooms')
plot.ylabel('Median value in 1000s')
plot.show()
I get the output as shown here https://i.stack.imgur.com/jvOfM.png
On running the same code using sklearn's inbuilt linear regression, I get this
https://i.stack.imgur.com/jvOfM.png
Can anyone help me where my model is going wrong? I have tried changing a number of iterations and learning rates, but there were no significant changes.
Here's the ipython notebook on colab if it helps: https://colab.research.google.com/drive/1c3lWKkv2lJfZAc19LiDW7oTuYuacQ3nd
Any help is highly appreciated
You can set a bigger learning rate such an 0.01. And it more times such as 500000 times. Then you will get a similar result.
Or you can initialize the w1 with a bigger number such as 5.

Gradually update weights of custom loss in Keras during training

I defined a custom loss function in Keras (tensorflow backend) that is comprised of reconstruction MSE and the kullback leibler divergence between the learned probability distribution and a standard normal distribution. (It is for a variational autoencoder.)
I want to be able to slowly increase how much the cost is affected by the KL divergence term during training, with a weight called "reg", starting at reg=0.0 and increasing until it gets to 1.0. I would like the rate of increase to be tuned as a hyperparameter.(As of now, I just have the "reg" parameter set constant at 0.5.)
Is there functionality in Keras to do this?
def vae_loss(y_true,y_pred):
reg = 0.5
# Average cosine distance for all words in a sequence
reconstruction_loss = tf.reduce_mean(mean_squared_error(y_true, y_pred),1)
# Second part of the loss ensures the z probability distribution doesn't stray too far from normal
KL_divergence_loss = tf.reduce_mean(tf.log(z_sigma) + tf.div((1 + tf.square(z_mu)),2*tf.square(z_sigma)) - 0.5,1)
loss = reconstruction_loss + tf.multiply(reg,KL_divergence_loss)
return loss

sklearn: Hyperparameter tuning by gradient descent?

Is there a way to perform hyperparameter tuning in scikit-learn by gradient descent? While a formula for the gradient of hyperparameters might be difficult to compute, numerical computation of the hyperparameter gradient by evaluating two close points in hyperparameter space should be pretty easy. Is there an existing implementation of this approach? Why is or isn't this approach a good idea?
The calculation of the gradient is the least of problems. At least in times of advanced automatic differentiation software. (Implementing this in a general way for all sklearn-classifiers of course is not easy)
And while there are works of people who used this kind of idea, they only did this for some specific and well-formulated problem (e.g. SVM-tuning). Furthermore there probably were a lot of assumptions because:
Why is this not a good idea?
Hyper-param optimization is in general: non-smooth
GD really likes smooth functions as a gradient of zero is not helpful
(Each hyper-parameter which is defined by some discrete-set (e.g. choice of l1 vs. l2 penalization) introduces non-smooth surfaces)
Hyper-param optimization is in general: non-convex
The whole convergence-theory of GD assumes, that the underlying problem is convex
Good-case: you obtain some local-minimum (can be arbitrarily bad)
Worst-case: GD is not even converging to some local-minimum
I might add, that your general problem is the worst kind of optimization problem one can consider because it's:
non-smooth, non-convex
and even stochastic / noisy as most underlying algorithms are heuristic approximations with some variance in regards to the final output (and often even PRNG-based random-behaviour).
The last part is the reason, why the offered methods in sklearn are that simple:
random-search:
if we can't infere something because the problem is too hard, just try many instances and pick the best
grid-search:
let's assume there is some kind of smoothness
instead of random-sampling, we sample in regards to our smoothness-assumption
(and other assumptions like: param is probably big -> np.logspace to analyze more big numbers)
While there are a lot of Bayesian-approaches including available python-software like hyperopt and spearmint, many people think, that random-search is the best method in general (which might be surprising but emphasizes the mentioned problems).
Here are some papers describing gradient-based hyperparameter optimization:
Gradient-based hyperparameter optimization through reversible learning (2015):
We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
Forward and reverse gradient-based hyperparameter optimization (2017):
We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets.
Gradient descent: the ultimate optimizer (2019):
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.
For generalized linear models (i.e. logistic regression, ridge regression, poisson regression),
you can efficiently tune many regularization hyperparameters
using exact derivatives and approximate leave-one cross-validation.
But don't stop at just the gradient, compute the full hessian and use a second-order optimizer -- it's
both more efficient and robust.
sklearn doesn't currently have this functionality, but there are other tools available that can do it.
For example, here's how you can use the python package bbai to fit the
hyperparameter for ridge regularized logistic regression to maximize the log likelihood of the
approximate leave-one-out cross-validation of the training data set for the Wisconsin Breast Cancer Data Set.
Load the data set
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Fit the model
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: it automatically fits the C parameter to minimize the error on
# the approximate leave-one-out cross-validation.
model.fit(X, y)
Because it uses use both the gradient and hessian with efficient exact formulas
(no automatic differentiation), it can dial into an exact hyperparameter quickly with only a few
evaluations.
YMMV, but when I compare it to sklearn's LogisticRegressionCV with default parameters, it runs
in a fraction of the time.
t1 = time.time()
model = bbai.glm.LogisticRegression()
model.fit(X, y)
t2 = time.time()
print('***** approximate leave-one-out optimization')
print('C = ', model.C_)
print('time = ', (t2 - t1))
from sklearn.linear_model import LogisticRegressionCV
print('***** sklearn.LogisticRegressionCV')
t1 = time.time()
model = LogisticRegressionCV(scoring='neg_log_loss', random_state=0)
model.fit(X, y)
t2 = time.time()
print('C = ', model.C_[0])
print('time = ', (t2 - t1))
Prints
***** approximate leave-one-out optimization
C = 0.6655139682151275
time = 0.03996014595031738
***** sklearn.LogisticRegressionCV
C = 0.3593813663804626
time = 0.2602980136871338
How it works
Approximate leave-one-out cross-validation (ALOOCV) is a close approimxation to leave-one-out
cross-validation that's much more efficient to evaluate for generalized linear models.
It first fits the regularized model. Then uses a single step of Newton's algorithm to approximate what
the model weights would be when we leave a single data point out. If the regularized cost function for
the generalized linear model is represented as
Then the ALOOCV can be computed as
where
(Note: H represents the hessian of the cost function at the optimal weights)
For more background on ALOOCV, you can check out this guide.
It's also possible to compute exact derivatives for ALOOCV which makes it efficient to optimize.
I won't put the derivative formulas here as they are quite involved, but see the paper
Optimizing Approximate Leave-one-out Cross-validation.
If we plot out ALOOCV and compare to leave-one-out cross-validation for the example data set,
you can see that it tracks it very closely and the ALOOCV optimum is nearly the same as the
LOOCV optimum.
Compute Leave-one-out Cross-validation
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
Compute Approximate Leave-one-out Cross-validation
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
Plot out the results (along with the ALOOCV optimum)
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
Displays

Categories