Cross-validation and parameters tuning with XGBoost and hyperopt - python

One way to do nested cross-validation with a XGB model would be:
from sklearn.model_selection import GridSearchCV, cross_val_score
from xgboost import XGBClassifier
# Let's assume that we have some data for a binary classification
# problem : X (n_samples, n_features) and y (n_samples,)...
gs = GridSearchCV(estimator=XGBClassifier(),
param_grid={'max_depth': [3, 6, 9],
'learning_rate': [0.001, 0.01, 0.05]},
scores = cross_val_score(gs, X, y, cv=2)
However, regarding the tuning of XGB parameters, several tutorials (such as this one) take advantage of the Python hyperopt library. I would like to be able to do nested cross-validation (as above) using hyperopt to tune the XGB parameters.
To do so, I wrote my own Scikit-Learn estimator:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split
from sklearn.exceptions import NotFittedError
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
def optimize_params(X, y, params_space, validation_split=0.2):
"""Estimate a set of 'best' model parameters."""
# Split X, y into train/validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=validation_split, stratify=y)
# Estimate XGB params
def objective(_params):
_clf = XGBClassifier(n_estimators=10000,
gamma=_params['gamma']), y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
y_pred_proba = _clf.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_true=y_val, y_score=y_pred_proba)
return {'loss': 1. - roc_auc, 'status': STATUS_OK}
trials = Trials()
return fmin(fn=objective,
class OptimizedXGB(BaseEstimator, ClassifierMixin):
"""XGB with optimized parameters.
custom_params_space : dict or None
If not None, dictionary whose keys are the XGB parameters to be
optimized and corresponding values are 'a priori' probability
distributions for the given parameter value. If None, a default
parameters space is used.
def __init__(self, custom_params_space=None):
self.custom_params_space = custom_params_space
def fit(self, X, y, validation_split=0.3):
"""Train a XGB model.
X : ndarray, shape (n_samples, n_features)
y : ndarray, shape (n_samples,) or (n_samples, n_labels)
validation_split : float (default: 0.3)
Float between 0 and 1. Corresponds to the percentage of samples in X which will be used as validation data to estimate the 'best' model parameters.
# If no custom parameters space is given, use a default one.
if self.custom_params_space is None:
_space = {
'learning_rate': hp.uniform('learning_rate', 0.0001, 0.05),
'max_depth': hp.quniform('max_depth', 8, 15, 1),
'min_child_weight': hp.quniform('min_child_weight', 1, 5, 1),
'subsample': hp.quniform('subsample', 0.7, 1, 0.05),
'gamma': hp.quniform('gamma', 0.9, 1, 0.05),
'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 0.7, 0.05)
_space = self.custom_params_space
# Estimate best params using X, y
opt = optimize_params(X, y, _space, validation_split)
# Instantiate `xgboost.XGBClassifier` with optimized parameters
best = XGBClassifier(n_estimators=10000,
colsample_bytree=opt['colsample_bytree']), y)
self.best_estimator_ = best
return self
def predict(self, X):
"""Predict labels with trained XGB model.
X : ndarray, shape (n_samples, n_features)
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict`.')
return self.best_estimator_.predict(X)
def predict_proba(self, X):
"""Predict labels probaiblities with trained XGB model.
X : ndarray, shape (n_samples, n_features)
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict_proba`.')
return self.best_estimator_.predict_proba(X)
My questions are:
Is this a valid approach? For instance, in the fit method of my OptimizedXGB,, y) will train a XGB model on X, y. However, this might lead to overfitting as no eval_set is specified to ensure early stopping.
On a toy example (the famous iris dataset), this OptimizedXGB performs worse than a basic LogisticRegression classifier. Why is that? Is it because the example is to simplistic? See below for the code of the example.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_iris(return_X_y=True)
X = X[:, :2]
X = X[y < 2]
y = y[y < 2]
skf = StratifiedKFold(n_splits=2, random_state=42)
# With a LogisticRegression classifier
pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])
gs = GridSearchCV(estimator=pipe, param_grid={'lr__C': [1., 10.]})
lr_scores = cross_val_score(gs, X, y, cv=skf)
# With OptimizedXGB
xgb_scores = cross_val_score(OptimizedXGB(), X, y, cv=skf)
# Print results
print('Accuracy with LogisticRegression = %.4f (+/- %.4f)' % (np.mean(lr_scores), np.std(lr_scores)))
print('Accuracy with OptimizedXGB = %.4f (+/- %.4f)' % (np.mean(xgb_scores), np.std(xgb_scores)))
Accuracy with LogisticRegression = 0.9900 (+/- 0.0100)
Accuracy with OptimizedXGB = 0.9100 (+/- 0.0300)
Although the scores are close, I would have expected the XGB model to score at least as well as a LogisticRegression classifier.
similar post

First, check this post - might help - nested CV.
Regarding your questions:
Yes, that's the right way to go. Once you have your hyper parameters selected, you should fit your model (selected model) on the entire training data. However, since this model includes a model selection process inside, you can only "score" how well it generalizes using an external CV, like you did.
Since you are scoring the selection process as well (and not just the model, say XGB Vs Linear regression) there might be some problem with the selection process. Maybe you hyper space is not properly defined and you are choosing poor parameters?


Sklearn: How to pass different features to each target value in a MultiOutputRegressor?

Dear colleagues I have created an scikit learn pipeline to traing and tune different HistBoostRegressors.
from scipy.stats import loguniform
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import RandomizedSearchCV
class loguniform_int:
"""Integer valued version of the log-uniform distribution"""
def __init__(self, a, b):
self._distribution = loguniform(a, b)
def rvs(self, *args, **kwargs):
"""Random variable sample"""
return self._distribution.rvs(*args, **kwargs).astype(int)
data_train, data_test, target_train, target_test = train_test_split(
pipeline_hist_boost_mimo_inside = Pipeline([('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator', MultiOutputRegressor(HistGradientBoostingRegressor(loss='poisson')))])
parameters = {
'estimator__estimator__l2_regularization': loguniform(1e-6, 1e3),
'estimator__estimator__learning_rate': loguniform(0.001, 10),
'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
'estimator__estimator__min_samples_leaf': loguniform_int(1, 100),
'estimator__estimator__max_bins': loguniform_int(2, 255),
random_grid_inside = RandomizedSearchCV(estimator=pipeline_hist_boost_mimo_inside, param_distributions=parameters, random_state=0, n_iter=50,
n_jobs=-1, refit=True, cv=3, verbose=True,
results_inside_train =, target_train)
However now I would like to know if it would be possible to pass different feature names to the step pipeline_hist_boost_mimo_inside["estimator"].
I have noticed that in the documentation of the multi output regressor we have a parameter call feature_names:
feature_names_in_ndarray of shape (n_features_in_,) Names of features
seen during fit. Only defined if the underlying estimators expose such
an attribute when fit.
New in version 1.0.
I have also found some documentation in scikit learn column selector which has the argument:
patternstr, default=None Name of columns containing this regex pattern
will be included. If None, column selection will not be selected based
on pattern.
The problem is that this pattern will depend on the target that I am fitting.
Is there a way to do this elegantly?
EDIT: Example of the dataset:
feat1, feat2, feat3.... target1, target2, target3....
1 47 0.65 0 0.5 0.6
The multioutput regressor will fit an histogram regressor for every pair of (feat1, feat2, feat3 and targetn). In the example of the table below I will have a pipeline which estimator step will contain a list of 3 estimators as a have 3 targets.
The question is how to pass for instance feat1 and feat2 to target1 but pass feat1 and feat3 to target2.
A solution is to modify MultiOutputRegressor so that it can filter specific columns to fit a model to individual target variables.
For example, I define a MultiOutputRegressorTargetFilter that accepts a features_in parameter which is a dictionary indicating which columns to use for each target y value
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
X, y = load_linnerud(return_X_y=True)
# Pass a dictionary indicating which columns to use for each target variable value
features_in = {
0: [0, 2], # Use columns 1 and 3 for y[0]
1: [1, 2], # Use columns 2 and 3 for y[1]
2: [0, 1, 2] # Use all columns for y[2]
clf = MultiOutputRegressorTargetFilter(Ridge(random_state=123), features_in=features_in).fit(X, y)
Code for MultiOutputRegressorTargetFilter
from sklearn.multioutput import _MultiOutputEstimator
from sklearn.base import RegressorMixin, clone
from sklearn.utils.validation import _check_fit_params, has_fit_parameter, check_is_fitted
from sklearn.utils.fixes import delayed
from joblib import Parallel
import numpy as np
def _fit_estimator(estimator, X, y, sample_weight=None, **fit_params):
estimator = clone(estimator)
if sample_weight is not None:, y, sample_weight=sample_weight, **fit_params)
else:, y, **fit_params)
return estimator
class MultiOutputRegressorTargetFilter(RegressorMixin, _MultiOutputEstimator):
"""Multi target regression.
This strategy consists of fitting one regressor per target. This is a
simple strategy for extending regressors that do not natively support
multi-target regression. This Estimator allows to select different columns
to fit a model for each of the target values.
.. versionadded:: 0.18
estimator : estimator object
An estimator object implementing :term:`fit` and :term:`predict`.
features_in : dict
Dictionary with (key, value) pairs indicating which variables to use
to fit model at target y.
n_jobs : int or None, optional (default=None)
The number of jobs to run in parallel.
:meth:`fit`, :meth:`predict` and :meth:`partial_fit` (if supported
by the passed estimator) will be parallelized for each target.
When individual estimators are fast to train or predict,
using ``n_jobs > 1`` can result in slower performance due
to the parallelism overhead.
``None`` means `1` unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all available processes / threads.
See :term:`Glossary <n_jobs>` for more details.
.. versionchanged:: 0.20
`n_jobs` default changed from `1` to `None`.
estimators_ : list of ``n_output`` estimators
Estimators used for predictions.
n_features_in_ : int
Number of features seen during :term:`fit`. Only defined if the
underlying `estimator` exposes such an attribute when fit.
.. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
Names of features seen during :term:`fit`. Only defined if the
underlying estimators expose such an attribute when fit.
.. versionadded:: 1.0
See Also
RegressorChain : A multi-label model that arranges regressions into a
MultiOutputClassifier : Classifies each output independently rather than
>>> import numpy as np
>>> from sklearn.datasets import load_linnerud
>>> from sklearn.multioutput import MultiOutputRegressor
>>> from sklearn.linear_model import Ridge
>>> X, y = load_linnerud(return_X_y=True)
>>> clf = MultiOutputRegressor(Ridge(random_state=123)).fit(X, y)
>>> clf.predict(X[[0]])
array([[176..., 35..., 57...]])
def __init__(self, estimator, *, n_jobs=None, features_in=None):
super().__init__(estimator, n_jobs=n_jobs)
self.features_in = features_in
def fit(self, X, y, sample_weight=None, **fit_params):
"""Fit the model to data, separately for each output variable.
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input data.
y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
Multi-output targets. An indicator matrix turns on multilabel
sample_weight : array-like of shape (n_samples,), default=None
Sample weights. If `None`, then samples are equally weighted.
Only supported if the underlying regressor supports sample
**fit_params : dict of string -> object
Parameters passed to the ```` method of each step.
.. versionadded:: 0.23
self : object
Returns a fitted instance.
if not hasattr(self.estimator, "fit"):
raise ValueError("The base estimator should implement a fit method")
y = self._validate_data(X="no_validation", y=y, multi_output=True)
if y.ndim == 1:
raise ValueError(
"y must have at least two dimensions for "
"multi-output regression but has only one."
if sample_weight is not None and not has_fit_parameter(
self.estimator, "sample_weight"
raise ValueError("Underlying estimator does not support sample weights.")
fit_params_validated = _check_fit_params(X, fit_params)
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
self.estimator, X[:, self.features_in[i]], y[:, i], sample_weight, **fit_params_validated
for i in range(y.shape[1])
if hasattr(self.estimators_[0], "n_features_in_"):
self.n_features_in_ = self.estimators_[0].n_features_in_
if hasattr(self.estimators_[0], "feature_names_in_"):
self.feature_names_in_ = self.estimators_[0].feature_names_in_
return self
def predict(self, X):
"""Predict multi-output variable using model for each target variable.
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input data.
y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
Multi-output targets predicted across multiple predictors.
Note: Separate models are generated for each predictor.
if not hasattr(self.estimators_[0], "predict"):
raise ValueError("The base estimator should implement a predict method")
y = Parallel(n_jobs=self.n_jobs)(
delayed(e.predict)(X[:, self.features_in[i]]) for i, e in enumerate(self.estimators_)
return np.asarray(y).T

Optimization of XGBoost Hyperparameters ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1]

I'm doing some basic hyperparameter optimization for an xgboost model and have run across the following issue. Firstly my code:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import xgboost as xgb
from functools import partial
from skopt import space, gp_minimize
<Some preprocessing...>
x = Oe.fit_transform(x)
y = Ly.fit_transform(y)
def optimize(params, param_names, x, y):
params = dict(zip(params, param_names))
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
nc = len(set(y_train))
xgb_model = xgb.XGBClassifier(use_label_encoder=False, num_class=nc+1, objective="multi:softprob", **params), y_train)
preds = xgb_model.predict(X_test)
acc = accuracy_score(y_test, preds)
return -1.0 * acc
param_space = [
space.Integer(3, 10, name="max_depth"),
space.Real(0.01, 0.3, prior="uniform", name="learning_rate"),
param_names = [
optimization_function = partial(
result = gp_minimize(
print(dict(zip(param_names, result.x)))
After doing some searching myself, I realized that if I don't use a random_state on my train test split, in order to have a deterministic result, then I risk getting a y_train that doesn't contain labels in form of 0,1,2 ... thus getting the following error
ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1].
On the other hand, if I use a random state then my optimization implementation that I use here loses its purpose since I will always have the same result, considering I'm using a small dataset.
And indeed after running my code with random_state=0 for example, after 3 iterations of gp_minimize, I end up getting the same optimum no matter what combination of hyperparameters it produces.
Update: One could argue that even if I chose different random states, the optimal combination that I would get, would also depend on that set of random states, so in the end I only want to know if this is the right approach to optimize my model.

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel()., y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
rfr_cv = skms.RandomizedSearchCV(pipe,
n_iter = 15,
cv = 3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
n_jobs = -1), y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

How to get the prediction probabilities using cross validation in scikit-learn

I am using RandomForestClassifier as follows using cross validation for a binary classification (class labels are 0 and 1).
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print("Accuracy: " + str(round(100*accuracy.mean(), 2)) + "%")
f1 = cross_val_score(clf, X, y, cv=k_fold, scoring = 'f1_weighted')
print("F Measure: " + str(round(100*f1.mean(), 2)) + "%")
Now I want to order my data using prediction probabilities of class 1 with cross validation results. For that I tried the following two ways.
pred = clf.predict_proba(X)[:,1]
probs = clf.predict_proba(X)
best_n = np.argsort(probs, axis=1)[:,-6:]
I get the following error
NotFittedError: This RandomForestClassifier instance is not fitted
yet. Call 'fit' with appropriate arguments before using this method.
for both the situations.
I am just wondering where I am making things wrong.
I am happy to provide more details if needed.
In case, you want to use the CV model for a unseen data point/s, use the following approach.
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
iris = datasets.load_iris()
X =
y =
clf = RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")
cv_results = cross_validate(clf, X, y, cv=3, return_estimator=True)
clf_fold_0 = cv_results['estimator'][0]
# array([[0. , 0.5, 0.5]])
Have a look at the documentation it specifies that the probability is calculated based on the mean results from the trees.
In your case, you first need to call the fit() method to generate the tress in the model. Once you fit the model on the training data, you can call the predict_proba() method.
This is also specified in the error.
# Fit model
model = RandomForestClassifier(...), Y_train)
# Probabilty
I solved my problem using the following code:
proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')

scikit learn decision tree model evaluation

Here are the related code and document, wondering for the default cross_val_score without explicitly specify score, the output array means precision, AUC or some other metrics?
Using Python 2.7 with miniconda interpreter.
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf,,, cv=10)
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
From the user guide:
By default, the score computed at each CV iteration is the score
method of the estimator. It is possible to change this by using the
scoring parameter:
From the DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In
multi-label classification, this is the subset accuracy which is a
harsh metric since you require for each sample that each label set be
correctly predicted.
Don't be confused by "mean accuracy," it's just the regular way one computes accuracy. Follow the links to the source:
from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
Now the source for metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
And if you still aren't convinced:
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return, sample_weight)
return sample_score.sum()
Note: for accuracy_score normalize parameter defaults to True, thus it simply returns np.average of the boolean numpy arrays, thus it's simply the average number of correct predictions.
If a scoring argument isn't given, cross_val_score will default to using the .score method of the estimator you're using. For DecisionTreeClassifier, it's mean accuracy (as shown in the docstring below):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
score : float
Mean accuracy of self.predict(X) wrt. y.
