How to set `n_jobs` in sklearn ElasticNet - python

I am trying to run the ElasticNet function from scikit-learn on a machine with multiple CPUs. However, I need the ElasticNet fit to only use one CPU, since I need to run other fitting routines in parallel on the remaining CPUs. Whenever the thread containing ElasticNet starts the fit, it quickly takes over any free space on all CPUs instead of just the one its called on. Because other code routines are running on these machines already, ElasticNet oversubscribes the machines and slows everything down tremendously, including itself. I need these routines to run in parallel, so I cannot just run the ElasticNet fit serially ahead of time.
Unlike other regression functions (linear, logistic...) in sklearn there is no n_jobs argument for ElasticNet. Reading the documentation, it appears that ElasticNet defaults to the n_jobs specified in joblib.parallel_backend which itself defaults to n_jobs=-1, which is all available CPUs.
I am trying to figure out the proper method for specifying n_jobs in parallel_backend so that it will override the default for ElasticNet. Following are three attempts to change n_jobs that have not worked so far.
Attempt 1
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Attempt 2
from sklearn.utils import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Both Attempt 1 and Attempt 2 do not throw any errors, but also do not appear to change n_jobs from the default of using every available CPU. ElasticNet still takes over all available CPU space across all CPUs and quickly oversubscribes the machines.
Attempt 3
This is my first time using joblib directly, and so I've been reading the documentation on parallelization with joblib. Most of the example routines placed in the parallel_backend container are prefaced with the Parallel() helper class.
Following the examples, I modified Attempt 1 in the following way:
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
Parallel(n_jobs=1)(model.fit(predictors, response))
However when running Attempt 3, I get the following error message:
TypeError: 'ElasticNet' object is not iterable
Does anyone know how to set n_jobs=1 for sklearn's ElasticNet? There must be some way to do this because ElasticNetCV has n_jobs as a possible argument. Any help with this is greatly appreciated!

In addition to the first solution, you can manually set MKL_NUM_THREADS=1 by
import mkl
mkl.set_num_threads(1)

Related

Improve speed of scikit-learn multinomial logistic regression

i am trying to train a logistic regression model in scikit learn and it is taking very long to train, around 2 hours. The size of the dataset is 21613 x 19. I am new to scikit learn, as such i dont know whether my code is wrong or that it just takes very long to train. Any suggestion on how to improve the training speed would be very much appreciated!
code used to train is below
# get the LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
# training the model
# apply algorithm to data using fit()
clf = LogisticRegression(solver='newton-cg',multi_class='multinomial')
clf.fit(X_train,y_train)
If you have a specific reason for using this solver, one thing you can do is parallelize the computations by setting the n_jobs=-1 argument.
If you're open to using other solvers, you can use faster solvers with a one-versus-rest strategy. For instance:
clf = LogisticRegression(solver='liblinear', multi_class='ovr')
It's all in the documentation, which can help you guide your choice of solver:
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not support setting penalty='none'
It's probably that slow because of the solver you have chosen. The newton-cg is a newton method. It's slow for large datasets because it computes the second derivatives. Use a different solver like sag or saga, they are fast for big datasets.
You might want to change your solver. The documentation says that scikit-learn has 5 different solvers: 'sag', 'saga', 'newton-cg', 'lbfgs', and 'liblinear' (not suitable for multinomial).
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=19, n_classes = 8, n_informative=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Solvers
solvers = ['newton-cg', 'sag', 'saga', 'lbfgs']
for sol in solvers:
start = time.time()
logreg = LogisticRegression(solver=sol, multi_class='multinomial')
logreg.fit(X_train, y_train)
end = time.time()
print(sol + " Fit Time: ",end-start)
Output (from 16GB RAM 8 Core Macbook):
Choosing the right solver for a problem can save a lot of time (code adapted from here). To determine which solver is right for your problem, you can check out the table from the documentation to learn more (notice that 'newton-cg' is not faster for large datasets).

Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=1)
knn_clf.fit(pixels, labels)
However, the following takes 1722 seconds, in other words ~64 times longer:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(knn_clf, pixels, labels, cv = 3, n_jobs=1)
My naive understanding is that cross_val_predict with cv=3 is doing 3-fold cross validation, so I'd expect it to fit the model 3 times, and so take at least ~3 times longer, but I don't see why it would take 64x!
To check if it was something specific to my environment, I ran the same in a Colab notebook - the difference was less extreme (15x), but still way above the ~3x I expected:
What am I missing? Why is cross_val_predict so much slower than just fitting the model?
In case it matters, I'm running scikit-learn 0.20.2.
KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.
During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!
cross_val_predict does a fit and a predict so it might take longer than just fitting, but I did not expect 64 times longer

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
_ = grid.fit(X=np.random.rand(10, 3),
y=np.random.randint(2, size=(10,)))
In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.
So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.
I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.
Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?
Update:
Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.
Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.
So instead of:
grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=False)
Do this:
clf = make_pipeline(StandardScaler(),
GridSearchCV(LogisticRegression(),
param_grid={'logisticregression__C': [0.1, 10.]},
cv=2,
refit=True))
clf.fit()
clf.predict()
What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.
Edit:
Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:
refit : boolean, default=True
Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance
after fitting.
If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit().
When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().
So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.
For those who stumbled upon a little bit different problem, that I had as well.
Suppose you have this pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])
Then, when specifying parameters you need to include this 'clf_' name that you used for your estimator. So the parameters grid is going to be:
params={'clf__max_features':[0.3, 0.5, 0.7],
'clf__min_samples_leaf':[1, 2, 3],
'clf__max_depth':[None]
}
It is not possible to do this in the current version of scikit-learn (0.18.1). A fix has been proposed on the github project:
https://github.com/scikit-learn/scikit-learn/issues/8830
https://github.com/scikit-learn/scikit-learn/pull/8322

sklearn Logistic Regression with n_jobs=-1 doesn't actually parallelize

I'm trying to train a huge dataset with sklearn's logistic regression.
I've set the parameter n_jobs=-1 (also have tried n_jobs = 5, 10, ...), but when I open htop, I can see that it still uses only one core.
Does it mean that logistic regression just ignores the n_jobs parameter?
How can I fix this? I really need this process to become parallelized...
P.S. I am using sklearn 0.17.1
the parallel process backend also depends on the solver method. if you want to utilize multi core, the multiprocessing backend is needed.
but solver like 'sag' can only use threading backend.
and also mostly, it can be blocked due to a lot of pre-processing.
There are multiple reasons for it if you read their instructions carefully.
Can you try this set of parameters:
logit = LogisticRegression(penalty='l2',
random_state=42,
C=0.2,
n_jobs=-1,
solver='sag',
multi_class='ovr',
max_iter=200,
verbose=10
)
And it takes one minute or two to start multiple threads.

Scikit-Learn: Several X-Vals in parallel?

I would like to try several different models for my data and crossvalidate them, so the results are somewhat reliable.
For my crossvalidation I call:
cross_val_score(model, X, y, scoring = 'mean_squared_error', cv=kf, n_jobs = -1)
which does my 10-fold crossvalidation in parallel. Since the machine I'm running on has 40 cores and enough memory, I would like to try four different values for "model" in parallel, each doing a 10-fold crossvalidation.
However, when I try to do it using joblib in the following way, I get an error:
results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:1433: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1 for train, test in cv)
where crossVal is a method I defined, which consists mainly of calling cross_val_score.
Is there an elegant way to do this without starting several different python files manually?
Joblib can use multiprocessing and threading backend, by default it uses multiprocessing (This is because of CPython implementation, where threading will be faster only in some particular cases, i don't want to go into details here, you can find tonns of articles about CPython and Python GIL).
It's not an error, it's just a warning which tells you that you tried to create processes from processes. I.e. by this line:
results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)
You already spawned some number of processes (n_jobs), and then each cross_val_score inside your crossVal tries to do the same thing (Spawn some amount of processes), because cross_val_score by default is using multoprocessing. Joblib doesn't allow to do such things with multiprocessing backend. Thus AFAIK it gives this warning and runs nested Parallel loop in single process, i.e. cross_val_score internals now run in single thread, but still your crossVal function runs in multiprocessing mode.
You can avoid this warning if you get rid of any of those two multiprocessing cycles, i.e. you can get rid of nested multiprocessing by calling:
cross_val_score(..., n_jobs=1)
in your crossVal function, or you can call default cross_val_score several times in simple loop, without multiprocessing, and then aggregate results, e.g:
results = [cross_val_score(estimator = est, ...) for est in estimators]
In the first case you can run min(n_models, n_jobs) simultaneously (In your original case when joblib gives warning you do this implicitly already), in the second one - min(n_folds, n_cores). If you want to run min(n_jobs, n_models*n_folds) you should use GridSearchCV, because internally it spawns jobs in this way:
out = Parallel(
n_jobs=self.n_jobs, verbose=self.verbose,
pre_dispatch=pre_dispatch
)(
delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
train, test, self.verbose, parameters,
self.fit_params, return_parameters=True,
error_score=self.error_score)
for parameters in parameter_iterable
for train, test in cv)

Categories