Scikit-Learn: Several X-Vals in parallel? - python

I would like to try several different models for my data and crossvalidate them, so the results are somewhat reliable.
For my crossvalidation I call:
cross_val_score(model, X, y, scoring = 'mean_squared_error', cv=kf, n_jobs = -1)
which does my 10-fold crossvalidation in parallel. Since the machine I'm running on has 40 cores and enough memory, I would like to try four different values for "model" in parallel, each doing a 10-fold crossvalidation.
However, when I try to do it using joblib in the following way, I get an error:
results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:1433: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1 for train, test in cv)
where crossVal is a method I defined, which consists mainly of calling cross_val_score.
Is there an elegant way to do this without starting several different python files manually?

Joblib can use multiprocessing and threading backend, by default it uses multiprocessing (This is because of CPython implementation, where threading will be faster only in some particular cases, i don't want to go into details here, you can find tonns of articles about CPython and Python GIL).
It's not an error, it's just a warning which tells you that you tried to create processes from processes. I.e. by this line:
results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)
You already spawned some number of processes (n_jobs), and then each cross_val_score inside your crossVal tries to do the same thing (Spawn some amount of processes), because cross_val_score by default is using multoprocessing. Joblib doesn't allow to do such things with multiprocessing backend. Thus AFAIK it gives this warning and runs nested Parallel loop in single process, i.e. cross_val_score internals now run in single thread, but still your crossVal function runs in multiprocessing mode.
You can avoid this warning if you get rid of any of those two multiprocessing cycles, i.e. you can get rid of nested multiprocessing by calling:
cross_val_score(..., n_jobs=1)
in your crossVal function, or you can call default cross_val_score several times in simple loop, without multiprocessing, and then aggregate results, e.g:
results = [cross_val_score(estimator = est, ...) for est in estimators]
In the first case you can run min(n_models, n_jobs) simultaneously (In your original case when joblib gives warning you do this implicitly already), in the second one - min(n_folds, n_cores). If you want to run min(n_jobs, n_models*n_folds) you should use GridSearchCV, because internally it spawns jobs in this way:
out = Parallel(
n_jobs=self.n_jobs, verbose=self.verbose,
pre_dispatch=pre_dispatch
)(
delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
train, test, self.verbose, parameters,
self.fit_params, return_parameters=True,
error_score=self.error_score)
for parameters in parameter_iterable
for train, test in cv)

Related

How to set `n_jobs` in sklearn ElasticNet

I am trying to run the ElasticNet function from scikit-learn on a machine with multiple CPUs. However, I need the ElasticNet fit to only use one CPU, since I need to run other fitting routines in parallel on the remaining CPUs. Whenever the thread containing ElasticNet starts the fit, it quickly takes over any free space on all CPUs instead of just the one its called on. Because other code routines are running on these machines already, ElasticNet oversubscribes the machines and slows everything down tremendously, including itself. I need these routines to run in parallel, so I cannot just run the ElasticNet fit serially ahead of time.
Unlike other regression functions (linear, logistic...) in sklearn there is no n_jobs argument for ElasticNet. Reading the documentation, it appears that ElasticNet defaults to the n_jobs specified in joblib.parallel_backend which itself defaults to n_jobs=-1, which is all available CPUs.
I am trying to figure out the proper method for specifying n_jobs in parallel_backend so that it will override the default for ElasticNet. Following are three attempts to change n_jobs that have not worked so far.
Attempt 1
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Attempt 2
from sklearn.utils import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Both Attempt 1 and Attempt 2 do not throw any errors, but also do not appear to change n_jobs from the default of using every available CPU. ElasticNet still takes over all available CPU space across all CPUs and quickly oversubscribes the machines.
Attempt 3
This is my first time using joblib directly, and so I've been reading the documentation on parallelization with joblib. Most of the example routines placed in the parallel_backend container are prefaced with the Parallel() helper class.
Following the examples, I modified Attempt 1 in the following way:
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
Parallel(n_jobs=1)(model.fit(predictors, response))
However when running Attempt 3, I get the following error message:
TypeError: 'ElasticNet' object is not iterable
Does anyone know how to set n_jobs=1 for sklearn's ElasticNet? There must be some way to do this because ElasticNetCV has n_jobs as a possible argument. Any help with this is greatly appreciated!
In addition to the first solution, you can manually set MKL_NUM_THREADS=1 by
import mkl
mkl.set_num_threads(1)

XGBoost and scikit-optimize: BayesSearchCV and XGBRegressor are incompatible - why?

I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost. To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.
For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV. The relevant part of my code looks like this:
xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 1))])
xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}
xgb_kfold = KFold(n_splits = 5, random_state = 42)
xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)
xgb_unsm_cv.fit(X_train.values, y_train.values)
However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.
Is this some fundamental incompatibility between XGBoost and scikit-optimize, or can both packages be forced to work together somehow? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform. What can I do to fix this?
I don't think the error has something to do with the incompatibility of the libraries. Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Segmentation Fault refers to an error where the system ran out of available memory.
Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)
The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article.
The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.
Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.

Using model.predict (Keras + TF) in multiprocessing

I have following problem. I'm using a Tensorflow Keras model to evaluate continuous sensor data. My input for my model consists of 15 sensor data frames. Because the function model.predict() takes near 1 second I wanted to execute this function asynchronous so that I can collect the next data frames in this time period.
To accomplish this I created a Pool with the multiprocessing libary and a function to for model.predict. My code looks something like this:
def predictData(data):
return model.predict(data)
global model
model = tf.keras.models.load_model("Network.h5")
model._make_predict_function()
p = Pool(processes = 4)
...
res = p.apply_async(predictData, ([[iinput]],))
print(res.get(timeout = 10))
Now I always get a timeout-error when calling predictData(). It seems like model.predict() is not working right. What am I making wrong?
The reason is that each process you spawn will require a new initialized version of your model which it uses to make predictions. Therefore you have to make sure you instantiate/load your model for every spawned process. This is defiantly not optimal.
This is a known caveat with multiprocessing machine learning training and/or inference. Some libraries come with multiprocessing features out-of-the-box and provide parallizable calls to their models. However, in most libraries once you want to do multiprocessing, you are on your own!
Make sure you instantiate your model once and then find a way to share that model across processes. One basic way to do that, is to serve your model as a flask service then make predictions against that service to your hearts content. Cheers!
It is possible to run multiple predictions in multiple concurrent python processes, only you have to build inside each independent process its own tensorflow computational graph and then call the keras.model.predict
Write a function which you will use with the multiprocessing module (with the Process or Pool class),
within this function you should build your model, tensorflow graph and whatever you need,
set all tensorflow and keras variables, then you can call the predict method on it,
and then pipe the result back to your master process.
for example:
def f(data):
import tensorflow, keras
configure your tensorflow and keras settings (e.g. GPU/CPU usage)
keras_model = build_your_keras_model()
result = keras_model.predict(data)
return result
if __main__ = '__main__':
p = Pool(processes = 4)
res = p.apply_async(f, (data,))
print(res.get(timeout = 10))

How can I use sklearn's GridSearchCV with data that doesn't fit in memory?

I have a dataset that is much too large to fit in memory, so I must train models in batches. I have wrapped my model in a GridSearchCV, a RandomizedSearchCV, or a BayesSearchCV (from scikit-optimize) and see I can not train multiple instances of these on different parts of my enormous dataset and expect the best hyperparameters found by each will agree.
I have considered wrapping my estimators in a BatchVoter (of my own design) that manages reading from the database in batches and keeps a list of models. Passing this to the XSeachCV and updating the parameter space dictionary so all keys lead with 'estimator__' might direct the the search to set the parameters of the sub-object, but there is still a problem: A search is begun with a call to the .fit() method, which must take data.
Is there a clever way to use the native GridSearchCV with data that is too big to pass to the .fit() method?
Try dask. It supports Data Frames, arrays and collections. It consists of a scheduler and workers. It also has a distributed scheduler, allowing to process data frames on several PCs.
Here is the description on how to parallelize models.
Here is the link to a complete module, that could be a drop-in replacement of GridSearchCV

sklearn Logistic Regression with n_jobs=-1 doesn't actually parallelize

I'm trying to train a huge dataset with sklearn's logistic regression.
I've set the parameter n_jobs=-1 (also have tried n_jobs = 5, 10, ...), but when I open htop, I can see that it still uses only one core.
Does it mean that logistic regression just ignores the n_jobs parameter?
How can I fix this? I really need this process to become parallelized...
P.S. I am using sklearn 0.17.1
the parallel process backend also depends on the solver method. if you want to utilize multi core, the multiprocessing backend is needed.
but solver like 'sag' can only use threading backend.
and also mostly, it can be blocked due to a lot of pre-processing.
There are multiple reasons for it if you read their instructions carefully.
Can you try this set of parameters:
logit = LogisticRegression(penalty='l2',
random_state=42,
C=0.2,
n_jobs=-1,
solver='sag',
multi_class='ovr',
max_iter=200,
verbose=10
)
And it takes one minute or two to start multiple threads.

Categories