Improve speed of scikit-learn multinomial logistic regression - python

i am trying to train a logistic regression model in scikit learn and it is taking very long to train, around 2 hours. The size of the dataset is 21613 x 19. I am new to scikit learn, as such i dont know whether my code is wrong or that it just takes very long to train. Any suggestion on how to improve the training speed would be very much appreciated!
code used to train is below
# get the LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
# training the model
# apply algorithm to data using fit()
clf = LogisticRegression(solver='newton-cg',multi_class='multinomial')
clf.fit(X_train,y_train)

If you have a specific reason for using this solver, one thing you can do is parallelize the computations by setting the n_jobs=-1 argument.
If you're open to using other solvers, you can use faster solvers with a one-versus-rest strategy. For instance:
clf = LogisticRegression(solver='liblinear', multi_class='ovr')
It's all in the documentation, which can help you guide your choice of solver:
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not support setting penalty='none'

It's probably that slow because of the solver you have chosen. The newton-cg is a newton method. It's slow for large datasets because it computes the second derivatives. Use a different solver like sag or saga, they are fast for big datasets.

You might want to change your solver. The documentation says that scikit-learn has 5 different solvers: 'sag', 'saga', 'newton-cg', 'lbfgs', and 'liblinear' (not suitable for multinomial).
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=19, n_classes = 8, n_informative=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Solvers
solvers = ['newton-cg', 'sag', 'saga', 'lbfgs']
for sol in solvers:
start = time.time()
logreg = LogisticRegression(solver=sol, multi_class='multinomial')
logreg.fit(X_train, y_train)
end = time.time()
print(sol + " Fit Time: ",end-start)
Output (from 16GB RAM 8 Core Macbook):
Choosing the right solver for a problem can save a lot of time (code adapted from here). To determine which solver is right for your problem, you can check out the table from the documentation to learn more (notice that 'newton-cg' is not faster for large datasets).

Related

Why SGDClassifier with hinge loss is faster than SVC implementation in scikit-learn

As we know For the support vector machine we can use SVC as well as SGDClassifier with hinge loss implementation. Is SGDClassifier with hinge loss implementation is faster than SVC. Why?
Links of both implementations of SVC in scikit-learn:
SVC
SGDClassifier
I read on the documentation page of the sci-kit learn that SVC uses some algorithm of libsvm library for optimization. While SGDClassifier uses SGD(obviously).
Maybe it is better to start trying some practical cases and read the code. Let's start...
First of all, if we read the documentation of SGDC, it says the linear SVM is used only:
Linear classifiers (SVM, logistic regression, a.o.) with SGD training
What if instead of using the usual SVC, we use the LinearSVC?
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Let's add an example for the three types of algorithms:
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = np.random.rand(20000,2)
Y = np.random.choice(a=[False, True], size=(20000, 1))
# hinge is used as the default
svc = SVC(kernel='linear')
sgd = SGDClassifier(loss='hinge')
svcl = LinearSVC(loss='hinge')
Using jupyter and the command %%time we get the execution time (you can use similar ways in normal python, but this is how I did it):
%%time
svc.fit(X, Y)
Wall time: 5.61 s
%%time
sgd.fit(X, Y)
Wall time: 24ms
%%time
svcl.fit(X, Y)
Wall time: 26.5ms
As we can see there is a huge difference between all of them, but linear and SGDC have more or less the same time. The time keeps being a little bit different, but this will always happen since the execution of each algorithm does not come from the same code.
If you are interested in each implementation, I suggest you read the github code using the new github reading tool which is really good!
Code of linearSVC
Code of SGDC
I think its because of the batch size used in SGD, if you use full batch with SGD classifier it should take same time as SVM but changing the batch size can lead to faster convergence.
The sklearn SVM is computationally expensive compared to sklearn SGD classifier with loss='hinge'. Hence we use SGD classifier which is faster. This is good only for linear SVM. If we are using 'rbf' kernel, then SGD is not suitable.

Multi-target regression using scikit-learn

I am solving the classic regression problem using the python language and the scikit-learn library. It's simple:
ml_model = GradientBoostingRegressor()
ml_params = {}
ml_model.fit(X_train, y_train)
where y_train is one-dimensional array-like object
Now I would like to expand the functionality of the task, to get not a single target value, but a set of them. Training set of samples X_train will remain the same.
An intuitive solution to the problem is to train several models, where X_train for all of them will be the same but y_train for each model will be specific. This is definitely a working, but, it seems to me, inefficient solution.
When searching for alternatives, I came across such concepts as Multi-Target Regression. As I understand such functionality is not implemented in scikit-learn.
How to solve Multi-Target Regression problem in python in efficient way? Thanks)
It depends on what problem you solve, training data you have, and an algorithm you choose to find a solution. It's really hard to suggest anything without knowing all the details. You could try a random forest as a starting point. It's a very powerful and robust algorithm which is resistant to overfitting in the case you have not so much data, and also it can be used for multi-target regression. Here is a working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X, y = make_regression(n_targets=2)
print('Feature vector:', X.shape)
print('Target vector:', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print('Build and fit a regressor model...')
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Done. Score', score)
Output:
Feature vector: (100, 100)
Target vector: (100, 2)
Build and fit a regressor model...
Done. Score 0.4405974071273537
This algorithm natively supports multi-target regression. For those ones which don't, you can use the multi-output regressor which simply fits one regressor per target.
Another alternative to the random forest approach would be to use an adapted version of Support Vector Regression, that fits multi-target regression problems. The advantage over fitting SVR with MultiOutputRegressor is that this method takes the underlying correlations between the multiple targets into account and hence should perform better.
A working implementation with a paper reference can be found here

Using statsmodels OLS on a test-set

I would like to use a technique from Scikit Learn, namely the ShuffleSplit to benchmark my linear regression model with a sequence of randomized test and train sets. This is well established and works great for the LinearModel in Scikit Learn using:
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
train_score = LM.score(X[train_index], Y[train_index])
test_score = LM.score(X[test_index], Y[test_index])
The score one gets here is only the R² values and nothing more. Using the statsmodel OLS implementation for linear models gives a very rich set of scores among whcih are adjusted R² and AIC, BIC etc. However here on can only fit the model with the training data to get these scores. Is there a way to get them also for the test set?
so in my example:
from sklearn.model_selection import ShuffleSplit
from statsmodels.regression.linear_model import OLS
ss = ShuffleSplit(n_splits=40, train_size=0.15, random_state=42)
for train_index, test_index in ss.split(X):
regr = OLS( Y.[train_index], X.[train_index]).fit()
train_score_AIC = regr.aic
is there a way to add something like
test_score_AIC = regr.test(Y.[test_index], X.[test_index]).aic
Most of those measure are goodness of fit measures that are build into the model/results classes and only available for the training data or estimation sample.
Many of those measures are not well defined for out of sample, predictive accuracy measures, or I have never seen definitions that would fit that case.
Specifically, loglike is a method of the model and can only be evaluated at the attached training sample.
related issues:
https://github.com/statsmodels/statsmodels/issues/2572
https://github.com/statsmodels/statsmodels/issues/1282
It would be possible to partially work around the current limitations of statsmodels but none of those are currently supported and unit tested.

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.
This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does
Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

How to increase the presicion of text classification with the RBM?

I am learning about text classification and I classify with my own corpus with linnear regression as follows:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=7)
classifier.fit(training_matrix, y_train)
prediction = classifier.predict(testing_matrix)
I would like to increase the classification report with a Restricted Boltzman Machine that scikit-learn provide, from the documentation I read that this could be use to increase the classification recall, f1-score, accuracy, etc. Could anybody help me to increase this is what I tried so far, thanks in advance:
vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
ngram_range=(1, 1),
norm='l2',
use_idf=True)
X_train = vectorizer.fit_transform(X_train_r)
X_test = vectorizer.transform(X_test_r)
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
logistic = LogisticRegression()
rbm= BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
classifier.fit(X_train, y_train)
First, you have to understand the concepts here. RBM can be seen as a powerful clustering algorithm and clustering algorithms are unsupervised, i.e., they don't need labels.
Perhaps, the best way to use RBM in your problem is, first to train an RBM (which only needs data without labels) and then use the RBM weights to initialize a Neural network. To get a logistic regression in the output, you have to add an output layer with logistic reg. cost function to this neural net and train this neural network. This setting may result in performance improvement.
There are a couple of things that could be wrong.
1. You haven't properly calibrated the RBM
Look at the example on the scikit-learn site: http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html
In particular, these lines:
rbm.learning_rate = 0.06
rbm.n_iter = 20
# More components tend to give better prediction performance, but larger
# fitting time
rbm.n_components = 100
You don't set these anywhere. In the example, these are obtained through cross validation using a grid search. You should do the same and try to obtain (close to) optimal parameters for your own problem.
Additionally, you might want to try using cross validation to determine other parameters as well, such as the ngram range (using higher level ngrams as well usually helps, if you can afford the memory and execution time. For some problems, character level ngrams do better than word level) and logistic regression parameters.
2. You are just unlucky
There is nothing that says using an RBM in an intermediate step will definitely improve any performance measure. It can, but it's not a rule, it may very well do nothing or very little for your problem. You have to be prepared for this.
It's worth trying because it shouldn't take long to implement, but be prepare to have to look elsewhere.
Also look at the SGDClassifier and the PassiveAggressiveClassifier. These might improve performance.

Categories