How to parallelize with jupyter and sklearn? - python

I'm trying to parallelize the GridSearchCV of scikit-learn. It's running on a jupyter (hub) notebook environment. After some research I found this code:
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
with parallel_backend('ipyparallel'):
grid.fit(X_train, Y_train)
Note that I've set the n_jobs parameter to 4, what is the number of machine's cpu cores. (It's what nproc returns)
But it doesn't seem to work: ImportError: cannot import name 'register_parallel_backend', although I installed joblib with conda install joblib and also tried pip install -U joblib.
So, what's the best way to parallelize the GridSearchCV in this environment?
UPDATE:
Without ipyparallel and just setting the n_jobs parameter:
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
grid.fit(X_train, Y_train)
Result is the following warning message:
/opt/conda/lib/python3.5/site- packages/sklearn/externals/joblib/parallel.py:540: UserWarning:
Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
Seems like it ends up in sequential execution rather than parallel execution.

Related

Perceptron in Python

I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade
The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code

Issue defining KneighborsClassifier in Jupyter Notebooks

I am attempting to utilize KNN on the Iris data set as a "Hello World" of Machine Learning. I am using a Jupyter Notebook from Anaconda and have been clearly documenting each step. A "NameError: name 'knn' is not defined" exception is currently being thrown when I attempt to use knn.fit(X,Y) What am I missing here? I attempted to test the definition of knn by calling print(knn) and I get the following output:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')
Code below:
#import the load_iris dataset
from sklearn.datasets import load_iris
#save "bunch" object containing iris dataset and its attributes
iris = load_iris()
X = iris.data
Y = iris.target
#import class you plan to use
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 1)
#Fit the model with data (aka "model training")
knn.fit(X,Y)
Had same issue.
running the following worked for me:
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
ran in:
Python 3.6.9
update your scikit learn modeule.
if you are using jupyter notebook then you can update by running the below code
conda install -c conda-forge scikit-learn

Parallel threading with xgboost?

According to its documentation, xgboost has an n_jobs parameter. However, when I attempt to set n_jobs, I get this error:
TypeError: __init__() got an unexpected keyword argument 'n_jobs'
Same issue for some other parameters like random_state. I assumed this might be an update issue, but it seems I have the latest version (0.6a2, installed with pip).
There isn't much needed for me to reproduce the error:
from xgboost import XGBClassifier
estimator_xGBM = XGBClassifier(max_depth = 5, learning_rate = 0.05, n_estimators = 400, n_jobs = -1).fit(x_train)
Any ideas?
I installed xgboost yesterday (25.09.2017):
If you install with pip or conda, the xgboost version does not support the n_jobs parameter; only the nthreads parameter.
If you build xgboost from github repository, you can use n_jobs though.
Update: n_jobs is the number of parallel threads used to run xgboost. (replaces nthread) for all algorithms like XGBClassifier, XGBRanker, XGBRegressor etc.
Reference - here
nthread is same with n_jobs but n_jobs is prefered for now.
maybe you can try nthread instead.
nthread : int
Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs)
n_jobs : int
Number of parallel threads used to run xgboost. (replaces nthread)
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py#L71-L74
Use this for n_jobs value for your machine
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1
print(n_jobs)

Using dask as for task scheduling to run machine learning models in parallel

So basically what I want is to run ML Pipelines in parallel.
I have been using scikit-learn, and I have decided to use DaskGridSearchCV.
I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator) objects, and I run each of them sequentially:
for gridSearchCV in list:
gridSearchCV.fit(train_data, train_target)
predicted = gridSearchCV.predict(test_data)
If I have N different GridSearch objects, I want to take advantage as much as possible of all the available resources. If there are resources to run 2, 3, 4, ... or N at the same time in parallel, I want to do so.
So I started trying a few things based on dask's documentation. First I tried dask.threaded and dask.multiprocessing but it ends up being slower and I keep getting:
/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
This is the code snippet:
def run_pipeline(self, gs, data):
train_data, test_data, train_target, expected = train_test_split(data, target, test_size=0.25, random_state=33)
model = gs.fit(train_data, train_target)
predicted = gs.predict(test_data)
values = [delayed(run_pipeline)(gs, df) for gs in gs_list]
compute(*values, get=dask.threaded.get)
Maybe I am approaching this the wrong way, would you have any suggestions for me?
Yes, but I have a list of GridSearch objects, for example one using DecisionTree and another with RandomForest. And I wanna run them in parallel as long as there are resources for it.
If this is your goal, I would merge them all into the same grid. Scikit-Learn Pipelines support grid-search across steps, which would allow you to do your search in only a single GridSearchCV object (for an example of this from the scikit-learn docs, see here). If you only have a single estimator (instead of a pipeline), you can use a Pipeline with a single step as a proxy. For example:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import dask_searchcv as dcv
pipeline = Pipeline([('est', DecisionTreeClassifier())])
grid = [
{'est': [DecisionTreeClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for DecisionTreeClassifier
},
{'est': [RandomForestClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for RandomForesetClassifier
},
# more estimator/parameter subsets
]
gs = dcv.GridSearchCV(pipeline, grid)
gs.fit(train_data, train_target)
gs.predict(test_data)
Note that for this specific case (where all estimators share the same parameters, you can merge the grid:
grid = {'est': [DecisionTreeClassifier(), RandomForestClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for all estimators}
As far as to why your delayed example didn't work - dask.delayed is for wrapping functions that don't call dask code. Since you're calling fit on a dask_searchcv.GridSearchCV object (which uses dask to compute) inside the delayed function (which also uses dask to compute), you're nesting calls to the dask scheduler, which can lead to poor performance at best, and weird bugs at worst.

scikit-learn GridSearchCV doesn't work as samples increase

The following script runs fine on my machine with n_samples=1000, but dies (no error, just stops working) with n_samples=10000. This only happens using the Anaconda python distribution (numpy 1.8.1) but is fine with Enthought's (numpy 1.9.2). Any ideas what would be causing this?
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.scorer import log_loss_scorer
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np
X, y = datasets.make_classification(n_samples=10000, n_features=50,
n_informative=35, n_redundant=10,
random_state=1984)
lr = LogisticRegression(random_state=1984)
param_grid = {'C': np.logspace(-1, 2, 4, base=2)}
kf = KFold(n=y.size, n_folds=5, shuffle=True, random_state=1984)
gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring=log_loss_scorer, cv=kf, verbose=100,
n_jobs=-1)
gs.fit(X, y)
Note: I'm using sklearn 0.16.1 in both distributions and am using OS X.
I've noticed that upgrading to numpy version 1.9.2 with Enthought distribution (by updating manually) breaks the grid search. I haven't had any luck downgrading Anaconda numpy version to 1.8.1 though.
Are you on windows? If so, you need to protect the code with
if __name__ == "__main__":
do_stuff()
Otherwise multiprocessing will not work.
Per Andreas's comment, the problem seems to be with multi threading in the linear algebra library. I solved it with the following command in the terminal:
export VECLIB_MAXIMUM_THREADS=1
My (weak) understanding is that this limits the linear algebra's library use of multiple threads and lets multiprocessing handle multithreading as it wants.

Categories