Parallel threading with xgboost?

Parallel threading with xgboost? - python

According to its documentation, xgboost has an n_jobs parameter. However, when I attempt to set n_jobs, I get this error:
TypeError: __init__() got an unexpected keyword argument 'n_jobs'
Same issue for some other parameters like random_state. I assumed this might be an update issue, but it seems I have the latest version (0.6a2, installed with pip).
There isn't much needed for me to reproduce the error:
from xgboost import XGBClassifier
estimator_xGBM = XGBClassifier(max_depth = 5, learning_rate = 0.05, n_estimators = 400, n_jobs = -1).fit(x_train)
Any ideas?

I installed xgboost yesterday (25.09.2017):
If you install with pip or conda, the xgboost version does not support the n_jobs parameter; only the nthreads parameter.
If you build xgboost from github repository, you can use n_jobs though.
Update: n_jobs is the number of parallel threads used to run xgboost. (replaces nthread) for all algorithms like XGBClassifier, XGBRanker, XGBRegressor etc.
Reference - here

nthread is same with n_jobs but n_jobs is prefered for now.
maybe you can try nthread instead.
nthread : int
Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs)
n_jobs : int
Number of parallel threads used to run xgboost. (replaces nthread)
https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py#L71-L74

Use this for n_jobs value for your machine
import multiprocessing
n_jobs = multiprocessing.cpu_count()-1
print(n_jobs)

Related

Unexpected keyword argument when running Lightgbm on GPU

When running the code below:
import lightgbm as lgb
params = {'num_leaves': 38,
'min_data_in_leaf': 50,
'objective': 'regression',
'max_depth': -1,
'learning_rate': 0.1,
'device': 'gpu'
}
trn_data = lgb.Dataset(x_train, y_train)
val_data = lgb.Dataset(x_test, y_test)
model = lgb.train(params,
trn_data,
20000,
valid_sets=[trn_data, val_data],
verbose_eval=300,
early_stopping_rounds=1000)
I get the follow errors:
train() got an unexpected keyword argument 'verbose_eval'
train() got an unexpected keyword argument 'early_stopping_rounds'
It is important to note that I run this on GPU. When running it on CPU i do not get this error.
Has anyone got an idea how I can incorporate a verbose output and early stopping rounds when running Lightgbm on GPU??

For Lightgbm on GPU you can check the official documentation. On documentation there are no configuration option as vebose_ecal and early_stopping_rounds
Official Documentation
Also you can check this link Running LightGBM on GPU with python

I get "UserWarning: Port 8787 is already in use" when use AutoSklearn, Why would AutoSklearn use ports?

Here is my code
automl = autosklearn.classification.AutoSklearnClassifier(
include={'feature_preprocessor': ["no_preprocessing"],
},
exclude={ 'classifier': ['random_forest']},
time_left_for_this_task=60*10,
per_run_time_limit=60*1,
memory_limit = 1024 * 10,
n_jobs=-1,
metric=autosklearn.metrics.f1_macro,
)
clf = OneVsRestClassifier(automl, n_jobs=-1)
clf.fit(X_train, y_train)
when I try to fit I get this error
/home/user/.local/lib/python3.8/site-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42433 instead
warnings.warn(
Killed
why AutoSKlearn asking for dask and how to fix this error??

Auto-sklearn uses Dask for parallel optimization, which is controlled by the n_jobs argument, as explained in more detail here. The warning message you're seeing occurs when you're starting a new Dask cluster when there is already one in use. One option to remove the warning is to guard the code, as shown in the example in the above link, by placing your code within if __name__ == '__main__':.

'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False

I am trying to create a Bert model for classifying Turkish Lan. here is my code:
import pandas as pd
import torch
df = pd.read_excel (r'preparedDataNoId.xlsx')
df = df.sample(frac = 1)
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.10)
print('train shape: ',train_df.shape)
print('test shape: ',test_df.shape)
from simpletransformers.classification import ClassificationModel
# define hyperparameter
train_args ={"reprocess_input_data": True,
"fp16":False,
"num_train_epochs": 4}
# Create a ClassificationModel
model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args
)
I am using Anaconda and Spyder. I think every thing is correct but when I run this I got the following error:
'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False.
how can I fix this exactly?

I ran into the same problem. If you have CUDA available, then set both use_cuda and fp16 to True. If not, then set both to False.

CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs.
If your computer does not have GPU, this error will be thrown to you.
Don't forget to include this parameter
use_cuda= False
This will not affect your result, just take a few more seconds than usual to process.

model = ClassificationModel(
"bert", "dbmdz/bert-base-turkish-cased",
num_labels=4,
args=train_args,
use_cuda=False
)
Adding use_cuda=False will help if GPU is not available

If your GPU is unavailable on your computer. Make sure to check CUDA or try use_cuda=False in args of your model. This error will be throw since CUDA does not exist on your computer.

Perceptron in Python

I'm using sklearn library. I have a question about the attribute: n_iter_. When executing the code I get TypeError: __init__() got an unexpected keyword argument 'n_iter_'. Also try using n_iter but I get the same error, or maybe I am misspelling the attribute. It is not all the code, if you need more information, let me know
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn= Perceptron(n_iter_=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)

Perceptron Model in sklearn.linear_model doesn't have n_iter_ as a parameter. It has following parameters with similar names.
max_iter: int, default=1000
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit method.
and
n_iter_no_change : int, default=5
Number of iterations with no improvement to wait before early stopping.
New in version 0.20.
By looking at your code it looks like you intended to use max_iter.
So do
ppn=Perceptron(max_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)
Note:
You should first upgrade your sklearn using
pip install sklearn -upgrade

The attribute given in the documentation is n_iter and not n_iter_
So this should work:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
ppn=Perceptron(n_iter=40, eta0= 0.1, random_state=1)
ppn.fit(X_train_std, y_train)

First check which Scikit-learn version you have installed. You can do that by executing
python -c "import sklearn;print(sklearn.__version__)"
on your terminal/environment to which you have the python that executes your code.
Perceptron initial parameters have changed from n_iter to max_iter in version 0.20. The best way to keep up, head to the documentation or source code of the correct version and read the params: e.g.
documentation: perceptron docs v.0.23
source code: perceptions.0.23 code

How to parallelize with jupyter and sklearn?

I'm trying to parallelize the GridSearchCV of scikit-learn. It's running on a jupyter (hub) notebook environment. After some research I found this code:
from sklearn.externals.joblib import Parallel, parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
c = Client(profile='myprofile')
print(c.ids)
bview = c.load_balanced_view()
register_parallel_backend('ipyparallel', lambda : IPythonParallelBackend(view=bview))
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
with parallel_backend('ipyparallel'):
grid.fit(X_train, Y_train)
Note that I've set the n_jobs parameter to 4, what is the number of machine's cpu cores. (It's what nproc returns)
But it doesn't seem to work: ImportError: cannot import name 'register_parallel_backend', although I installed joblib with conda install joblib and also tried pip install -U joblib.
So, what's the best way to parallelize the GridSearchCV in this environment?
UPDATE:
Without ipyparallel and just setting the n_jobs parameter:
grid = GridSearchCV(pipeline, cv=3, n_jobs=4, param_grid=param_grid)
grid.fit(X_train, Y_train)
Result is the following warning message:
/opt/conda/lib/python3.5/site- packages/sklearn/externals/joblib/parallel.py:540: UserWarning:
Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1
Seems like it ends up in sequential execution rather than parallel execution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallel threading with xgboost? - python

Use this for n_jobs value for your machine import multiprocessing n_jobs = multiprocessing.cpu_count()-1 print(n_jobs)

Related

Unexpected keyword argument when running Lightgbm on GPU

I get "UserWarning: Port 8787 is already in use" when use AutoSklearn, Why would AutoSklearn use ports?

'use_cuda' set to True when cuda is unavailable. Make sure CUDA is available or set use_cuda=False

Perceptron in Python

How to parallelize with jupyter and sklearn?

Categories

Resources