problem with alpha and lambda regularization parameters in python

problem with alpha and lambda regularization parameters in python - python

Question :
Logistic Regression Train logistic regression models with L1 regularization and L2 regularization using alpha = 0.1
and lambda = 0.1. Report accuracy, precision, recall, f1-score and print the confusion matrix
My code is :
_lambda = 0.1
c = 1/_lambda
classifier = LogisticRegression(penalty='l1',C=c)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
I don't know where is really location of alpha and lambda.
Did I work right?

your example
alpha=0, lambda=10 (AKA .1/1)
alpha
alpha is the parameter that adds penalty for number of features to control overfitting, in this case either L1 (Lasso Regression) or L2 (Ridge Regression). L1 and L2 penalty cannot both be done at the same time, as there is only one Lambda coefficient. Quick aside - Elastic Net is an alpha parameter that is somewhere in between L1 and L2, so for example, if you are using sklearn.SGD_Regressor() alpha=0 is L1 alpha=0.5 is elasticnet, alpha=1 is Ridge.
Lambda
is a term that controls the learning rate. In other words, how much change do you want the model to make during each iteration of learning.
confusion
To make matters worse, these terms are often used interchangedly, I think due to different yet similar concepts in graph theory, statistical theory, mathematical theory, and the individuals who write commonly-used machine-learning libraries
check out some info here: https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/ but also look for some of the free academic textbooks about statistical learning.

Related

How to understand the loss function in scikit-learn logestic regression code?

The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.

First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.

Optimize sparse softmax cross entropy with L2 regularization

I was training my network using tf.losses.sparse_softmax_cross_entropy as the classification function in the last layer and everything was working fine.
I simply added a L2 regularization over my weights now and my loss is not getting optimized anymore. What can be happening?
reg = tf.nn.l2_loss(w1) + tf.nn.l2_loss(w2)
loss = tf.reduce_mean(tf.losses.sparse_softmax_cross_entropy(y, logits)) + reg*beta
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

It is hard to answer with certainty given the provided information, but here is a possible cause:
tf.nn.l2_loss is computed as a sum over the elements, while your cross-entropy loss is reduced to its mean (c.f. tf.reduce_mean), hence a numerical unbalance between the 2 terms.
Try for instance to divide each L2 loss by the number of elements it is computed over (e.g. tf.size(w1)).

sklearn: Hyperparameter tuning by gradient descent?

Is there a way to perform hyperparameter tuning in scikit-learn by gradient descent? While a formula for the gradient of hyperparameters might be difficult to compute, numerical computation of the hyperparameter gradient by evaluating two close points in hyperparameter space should be pretty easy. Is there an existing implementation of this approach? Why is or isn't this approach a good idea?

The calculation of the gradient is the least of problems. At least in times of advanced automatic differentiation software. (Implementing this in a general way for all sklearn-classifiers of course is not easy)
And while there are works of people who used this kind of idea, they only did this for some specific and well-formulated problem (e.g. SVM-tuning). Furthermore there probably were a lot of assumptions because:
Why is this not a good idea?
Hyper-param optimization is in general: non-smooth
GD really likes smooth functions as a gradient of zero is not helpful
(Each hyper-parameter which is defined by some discrete-set (e.g. choice of l1 vs. l2 penalization) introduces non-smooth surfaces)
Hyper-param optimization is in general: non-convex
The whole convergence-theory of GD assumes, that the underlying problem is convex
Good-case: you obtain some local-minimum (can be arbitrarily bad)
Worst-case: GD is not even converging to some local-minimum
I might add, that your general problem is the worst kind of optimization problem one can consider because it's:
non-smooth, non-convex
and even stochastic / noisy as most underlying algorithms are heuristic approximations with some variance in regards to the final output (and often even PRNG-based random-behaviour).
The last part is the reason, why the offered methods in sklearn are that simple:
random-search:
if we can't infere something because the problem is too hard, just try many instances and pick the best
grid-search:
let's assume there is some kind of smoothness
instead of random-sampling, we sample in regards to our smoothness-assumption
(and other assumptions like: param is probably big -> np.logspace to analyze more big numbers)
While there are a lot of Bayesian-approaches including available python-software like hyperopt and spearmint, many people think, that random-search is the best method in general (which might be surprising but emphasizes the mentioned problems).

Here are some papers describing gradient-based hyperparameter optimization:
Gradient-based hyperparameter optimization through reversible learning (2015):
We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.
Forward and reverse gradient-based hyperparameter optimization (2017):
We study two procedures (reverse-mode and forward-mode) for computing the gradient of the validation error with respect to the hyperparameters of any iterative learning algorithm such as stochastic gradient descent. These procedures mirror two methods of computing gradients for recurrent neural networks and have different trade-offs in terms of running time and space requirements. Our formulation of the reverse-mode procedure is linked to previous work by Maclaurin et al. [2015] but does not require reversible dynamics. The forward-mode procedure is suitable for real-time hyperparameter updates, which may significantly speed up hyperparameter optimization on large datasets.
Gradient descent: the ultimate optimizer (2019):
Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as the learning rate. There exist many techniques for automated hyperparameter optimization, but they typically introduce even more hyperparameters to control the hyperparameter optimization process. We propose to instead learn the hyperparameters themselves by gradient descent, and furthermore to learn the hyper-hyperparameters by gradient descent as well, and so on ad infinitum. As these towers of gradient-based optimizers grow, they become significantly less sensitive to the choice of top-level hyperparameters, hence decreasing the burden on the user to search for optimal values.

For generalized linear models (i.e. logistic regression, ridge regression, poisson regression),
you can efficiently tune many regularization hyperparameters
using exact derivatives and approximate leave-one cross-validation.
But don't stop at just the gradient, compute the full hessian and use a second-order optimizer -- it's
both more efficient and robust.
sklearn doesn't currently have this functionality, but there are other tools available that can do it.
For example, here's how you can use the python package bbai to fit the
hyperparameter for ridge regularized logistic regression to maximize the log likelihood of the
approximate leave-one-out cross-validation of the training data set for the Wisconsin Breast Cancer Data Set.
Load the data set
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']
Fit the model
import bbai.glm
model = bbai.glm.LogisticRegression()
# Note: it automatically fits the C parameter to minimize the error on
# the approximate leave-one-out cross-validation.
model.fit(X, y)
Because it uses use both the gradient and hessian with efficient exact formulas
(no automatic differentiation), it can dial into an exact hyperparameter quickly with only a few
evaluations.
YMMV, but when I compare it to sklearn's LogisticRegressionCV with default parameters, it runs
in a fraction of the time.
t1 = time.time()
model = bbai.glm.LogisticRegression()
model.fit(X, y)
t2 = time.time()
print('***** approximate leave-one-out optimization')
print('C = ', model.C_)
print('time = ', (t2 - t1))
from sklearn.linear_model import LogisticRegressionCV
print('***** sklearn.LogisticRegressionCV')
t1 = time.time()
model = LogisticRegressionCV(scoring='neg_log_loss', random_state=0)
model.fit(X, y)
t2 = time.time()
print('C = ', model.C_[0])
print('time = ', (t2 - t1))
Prints
***** approximate leave-one-out optimization
C = 0.6655139682151275
time = 0.03996014595031738
***** sklearn.LogisticRegressionCV
C = 0.3593813663804626
time = 0.2602980136871338
How it works
Approximate leave-one-out cross-validation (ALOOCV) is a close approimxation to leave-one-out
cross-validation that's much more efficient to evaluate for generalized linear models.
It first fits the regularized model. Then uses a single step of Newton's algorithm to approximate what
the model weights would be when we leave a single data point out. If the regularized cost function for
the generalized linear model is represented as
Then the ALOOCV can be computed as
where
(Note: H represents the hessian of the cost function at the optimal weights)
For more background on ALOOCV, you can check out this guide.
It's also possible to compute exact derivatives for ALOOCV which makes it efficient to optimize.
I won't put the derivative formulas here as they are quite involved, but see the paper
Optimizing Approximate Leave-one-out Cross-validation.
If we plot out ALOOCV and compare to leave-one-out cross-validation for the example data set,
you can see that it tracks it very closely and the ALOOCV optimum is nearly the same as the
LOOCV optimum.
Compute Leave-one-out Cross-validation
import numpy as np
def compute_loocv(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
n = len(y)
loo_likelihoods = []
for i in range(n):
train_indexes = [i_p for i_p in range(n) if i_p != i]
test_indexes = [i]
X_train, X_test = X[train_indexes], X[test_indexes]
y_train, y_test = y[train_indexes], y[test_indexes]
model.fit(X_train, y_train)
pred = model.predict_proba(X_test)[0]
loo_likelihoods.append(pred[y_test[0]])
return sum(np.log(loo_likelihoods))
Compute Approximate Leave-one-out Cross-validation
import scipy
def fit_logistic_regression(X, y, C):
model = bbai.glm.LogisticRegression(C=C)
model.fit(X, y)
return np.array(list(model.coef_[0]) + list(model.intercept_))
def compute_hessian(p_vector, X, alpha):
n, k = X.shape
a_vector = np.sqrt((1 - p_vector)*p_vector)
R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
H = np.dot(R.T, R)
for i in range(k-1):
H[i, i] += alpha
return H
def compute_alo(X, y, C):
alpha = 1.0 / C
w = fit_logistic_regression(X, y, C)
X = np.hstack((X, np.ones((X.shape[0], 1))))
n = X.shape[0]
y = 2*y - 1
u_vector = np.dot(X, w)
p_vector = scipy.special.expit(u_vector*y)
H = compute_hessian(p_vector, X, alpha)
L = np.linalg.cholesky(H)
T = scipy.linalg.solve_triangular(L, X.T, lower=True)
h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
loo_u_vector = u_vector - \
y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
loo_likelihoods = scipy.special.expit(y*loo_u_vector)
return sum(np.log(loo_likelihoods))
Plot out the results (along with the ALOOCV optimum)
import matplotlib.pyplot as plt
Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]
fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()
Displays

Why there is a difference between the accuracy of sklearn.LogisticRegression with penalty='l1' and 'l2' and C=1e80?

I am somewhat disappointed by the results I am getting. I create two models (sklearn.linear_models.LogisticRegression) with C=1e80 and penalty = 'l1' or 'l2', and then test them using sklearn.cross_validation.cross_val_score with cv=3 and scoring='roc_auc'. To me, C=1e80 should result in virtually no regularization, and the AUC should be the same. Instead, the model with 'l2' penalty gives worse AUC, and multiple runs give me the same results. How does this happen?

Just to make it a bit more clear. The general form of most loss functions is
C SUM_i=1^N loss(h(x_i), y_i|theta) + regularizer(theta)
thus the whole problem with C is to find a balance between sum of losses over training samples and regularizer value.
Now, if loss is bounded (like in the case of logistic regression), then without proper normalization L2 regularizer (||theta||^2) may grow to infinity, thus you will need very high C to make it irrelevant and thus equal in solution to L1 (max_j |theta_j|). Similarly if you have loss which grows very fast, such as Lp loss for p>=2, then regularizer might be very small thus you will need very small C to make it do anything.

How does sklearn.svm.svc's function predict_proba() work internally?

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?

Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.

Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.