sklearn Logistic Regression with n_jobs=-1 doesn't actually parallelize

sklearn Logistic Regression with n_jobs=-1 doesn't actually parallelize - python

I'm trying to train a huge dataset with sklearn's logistic regression.
I've set the parameter n_jobs=-1 (also have tried n_jobs = 5, 10, ...), but when I open htop, I can see that it still uses only one core.
Does it mean that logistic regression just ignores the n_jobs parameter?
How can I fix this? I really need this process to become parallelized...
P.S. I am using sklearn 0.17.1

the parallel process backend also depends on the solver method. if you want to utilize multi core, the multiprocessing backend is needed.
but solver like 'sag' can only use threading backend.
and also mostly, it can be blocked due to a lot of pre-processing.

There are multiple reasons for it if you read their instructions carefully.
Can you try this set of parameters:
logit = LogisticRegression(penalty='l2',
random_state=42,
C=0.2,
n_jobs=-1,
solver='sag',
multi_class='ovr',
max_iter=200,
verbose=10
)
And it takes one minute or two to start multiple threads.

Related

XGBoost and scikit-optimize: BayesSearchCV and XGBRegressor are incompatible - why?

I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost. To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.
For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV. The relevant part of my code looks like this:
xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 1))])
xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}
xgb_kfold = KFold(n_splits = 5, random_state = 42)
xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)
xgb_unsm_cv.fit(X_train.values, y_train.values)
However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.
Is this some fundamental incompatibility between XGBoost and scikit-optimize, or can both packages be forced to work together somehow? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform. What can I do to fix this?

I don't think the error has something to do with the incompatibility of the libraries. Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Segmentation Fault refers to an error where the system ran out of available memory.
Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)
The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article.
The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.
Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.

Should I use LassoCV or GridSearchCV to find an optimal alpha for Lasso?

From my understanding, when using Lasso regression, you can use GridSearchCV or LassoCV in sklearn to find the optimal alpha, the regularization parameter. Which one is preferred over the other?

You can get the same results with both. LassoCV makes it easier by letting you pass an array of alpha-values to alphas as well as a cross validation parameter directly into the classifier.
To do the same thing with GridSearchCV, you would have to pass it a Lasso classifier a grid of alpha-values (i.e. {'alpha':[.5, 1, 5]}) and the CV parameter.
I would not recommend one over the other though. The only advantage I can see is that you can access results_ as well as many other attributes if you use GridSearchCV. This may be helpful if you want a summary of all the models returned by the alphas you tried. On the other hand, as pointed out by #amiola, LassoCV can take advantage of using pre-computed results in previous steps of the cross-validation process (aka warm-starting), which may result in faster fitting times.

Wrapper Ktrain for Text Classification Multiclass - Parallelize the fit with CPU

I'm trying to solve a multiclass classification problem with BERT by leveraging the Ktrain wrapper. Having a large amount of data, I wanted to know if it is possible to parallelize the learner.fit_onecycle (3e-4, 5) function or the autofit () function.
For example as happens with sklerarn where you specify the number of parallel jobs, with the n_jobs parameter.
You can help me?
Thanks

First, if you use DistilBert instead of BERT in ktrain, it will cut your training time in half. Second, multiple cores should be used by default by TensorFlow2 when calling any of the fit methods in ktrain using a CPU. You can control the number of CPU cores, if you want.

Are the two kinds of interface of xgboost work completely same?

I'm currently working on a In Class Competition in Kaggle.
I have read about the official python API reference, and I'm kind of confused about the two kinds of interfaces, especially in grid-search, cross-validation and early-stopping.
In XGBoost API, I can use xgb.cv(), which split the whose dataset into two parts to cross validate, to tune a good hyper parameters and then get the best_iteration.
Thus I can adjust the num_boost_round to the best_iteration. To maximizely utilize the data, I train the whole dataset again with the well-tuned hyper parameters, and then use it to classify. The only defect is I have to write the code of GridSearch myself.
ATTENTION: this cross validation set is changed at each fold, so the traning result will have no specific tendency to any part of the data.
But in sklearn, it seem that I can not get best_iteration using clf.fit() as I do in xgb model. Indeed, fit() method has early_stopping_rounds and eval_set to implement the early stopping part. Many people implement the code like that:
X_train, X_test, y_train, y_test = train_test_split(train, target_label, test_size=0.2, random_state=0)
clf = GridSearchCV(xgb_model, para_grid, scoring='roc_auc', cv=5, \
verbose=True, refit=True, return_train_score=False)
clf.fit(X_train, y_train, early_stopping_rounds=30, eval_set=[(X_test, y_test)])
....
clf.predict(something)
But problem is that I have split the data into two part at first. The cross validation set will not be changed at each fold. So maybe the result will have a tendency toward this random part of the whole dataset. The same problem also occurs in the grid search, the final parameter may tend to fit
X_test and y_test more.
I'm fond of the GridSearchCV in sklearn, but I also want to get the eval_set changed at each fold, just like xgb.cv do. I believe it can utilize the data while preventing overfitting.
How should I do?
I have thought of two ways:
using XGB API, and write GridSearch myself.
using sklean API, and change the eval_set manually at each fold.
Are there any more convenient methods?

AS you have summarised, both approaches have advantages and disadvantages.
xgb.cv will use the left-out fold for early stopping, thus you do not need an additional split into a validation/train sample to determine when to trigger early stopping.
GridSearchCV (or maybe you try out RandomizedSearchCV) will handle parameter grid and optimal choice for you.
Note, that it is not a problem to use a fixed sub-sample for early stopping in all CV folds. So i do not think that you have to do anything like "change the eval_set manually at each fold". The evaluation sample used in early stopping does not directly affect model parameters- it is used to decide when evaluation metric on a hold-out sample stops improving. For the final model you can drop early-stopping- you can see when the model stops with the optimal hyper-parameters using the aforementioned split and then use that number of tree as a fixed parameter in the final model fit.
So at the end it is a matter of taste as in both cases you will need to compromise on something. IMO, the sklearn API is the optimal choice as it allows to use the rest of sklearn tools (e.g. for data pre-processing) in a natural way in a pipeline in CV and it allows a homogeneous interface to model training for various approaches. But at the end it is up to you

scikit-learn OpenMP libsvm

I am using scikit-learn SVC to classify some data. I would like to increase the training performance.
clf = svm.SVC(cache_size=4000, probability=True, verbose=True)
Since sckikit-learn interfaces with libsvm and libsvm uses OpenMp I was hoping that:
export OMP_NUM_THREADS=16
would run on multiple cores.
Unfortunately this did not help.
Any Ideas?
Thanks

There is no OpenMP support in the current binding for libsvm in scikit-learn. However it is very likely that if you have performance issues with sklearn.svm.SVC should you use a more scalable model instead.
If your data is high dimensional it might be linearly separable. In that case it is advised to first try simpler models such as naive bayes models or sklearn.linear_model.Perceptron that are known to be very speedy to train. You can also try sklearn.linear_model.LogisticRegression and sklearn.svm.LinearSVC both implemented using liblinear that is more scalable than libsvm albeit less memory efficients than other linear models in scikit-learn.
If your data is not linearly separable, you can try sklearn.ensemble.ExtraTreesClassifier (adjust the n_estimators parameter to trade-off training speed vs. predictive accuracy).
Alternatively you can try to approximate a RBF kernel using the RBFSampler transformer of scikit-learn + fitting a linear model on the output:
http://scikit-learn.org/dev/modules/kernel_approximation.html

If you are using cross validation or grid search in scikit-learn then you can use multiple CPUs with the n_jobs parameter:
GridSearchCV(..., n_jobs=-1)
cross_val_score(..., n_jobs=-1)
Note that cross_val_score only needs a job per forld so if your number of folds is less than your CPUs you still won't be using all of your processing power.
LibSVM can use OpenMP if you can compile it and use it directly as per these instructions in the LibSVM FAQ. So you could export your scaled data in LibSVM format (here's a StackOverflow question on how to do that) and use LibSVM directly to train your data. But that will only be of benefit if you're grid searching or wanting to know accuracy scores, as far as I know the model LibSVM creates cannot be used in scikit-learn.
There is also a GPU accelerated version of LibSVM which I have tried and is extremely fast, but is not based on the current LibSVM version. I have talked to the developers and they say they hope to release a new version soon.

Although this thread is a year+ old, I thought it is worth answering.
I wrote a patch for openmp support on scikit-learn for both libsvm and liblinear (linearSVC) that's available here - https://github.com/fidlr/sklearn-openmp.
It is based on libsvm's FAQ on how to add OpenMP support, and the multi-core implementation of liblinear.
Just clone the repo and run sklearn-build-openmp.sh to apply the patch and build it.
Timing OMP_NUM_THREADS=4 python plot_permutation_test_for_classification.py:
svmlib with linear kernel timinig dropped by a factor of 2.3
RBF kernel - same.
Liblinear with 4 thread dropped by x1.6
Details about and usage information can be found here -
http://fidlr.org/post/137303264732/scikit-learn-017-with-libsvm-openmp-support

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.