Here are the related code and document, wondering for the default cross_val_score without explicitly specify score, the output array means precision, AUC or some other metrics?
Using Python 2.7 with miniconda interpreter.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
regards,
Lin
From the user guide:
By default, the score computed at each CV iteration is the score
method of the estimator. It is possible to change this by using the
scoring parameter:
From the DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In
multi-label classification, this is the subset accuracy which is a
harsh metric since you require for each sample that each label set be
correctly predicted.
Don't be confused by "mean accuracy," it's just the regular way one computes accuracy. Follow the links to the source:
from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
Now the source for metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
...
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
else:
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
And if you still aren't convinced:
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return np.dot(sample_score, sample_weight)
else:
return sample_score.sum()
Note: for accuracy_score normalize parameter defaults to True, thus it simply returns np.average of the boolean numpy arrays, thus it's simply the average number of correct predictions.
If a scoring argument isn't given, cross_val_score will default to using the .score method of the estimator you're using. For DecisionTreeClassifier, it's mean accuracy (as shown in the docstring below):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns
-------
score : float
Mean accuracy of self.predict(X) wrt. y.
Related
Dear colleagues I have created an scikit learn pipeline to traing and tune different HistBoostRegressors.
from scipy.stats import loguniform
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import RandomizedSearchCV
class loguniform_int:
"""Integer valued version of the log-uniform distribution"""
def __init__(self, a, b):
self._distribution = loguniform(a, b)
def rvs(self, *args, **kwargs):
"""Random variable sample"""
return self._distribution.rvs(*args, **kwargs).astype(int)
data_train, data_test, target_train, target_test = train_test_split(
df.drop(columns=TARGETS),
df[target_dict],
random_state=42)
pipeline_hist_boost_mimo_inside = Pipeline([('scaler', StandardScaler()),
('variance_selector', VarianceThreshold(threshold=0.03)),
('estimator', MultiOutputRegressor(HistGradientBoostingRegressor(loss='poisson')))])
parameters = {
'estimator__estimator__l2_regularization': loguniform(1e-6, 1e3),
'estimator__estimator__learning_rate': loguniform(0.001, 10),
'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
'estimator__estimator__max_leaf_nodes': loguniform_int(2, 256),
'estimator__estimator__min_samples_leaf': loguniform_int(1, 100),
'estimator__estimator__max_bins': loguniform_int(2, 255),
}
random_grid_inside = RandomizedSearchCV(estimator=pipeline_hist_boost_mimo_inside, param_distributions=parameters, random_state=0, n_iter=50,
n_jobs=-1, refit=True, cv=3, verbose=True,
pre_dispatch='2*n_jobs',
return_train_score=True)
results_inside_train = random_grid_inside.fit(data_train, target_train)
However now I would like to know if it would be possible to pass different feature names to the step pipeline_hist_boost_mimo_inside["estimator"].
I have noticed that in the documentation of the multi output regressor we have a parameter call feature_names:
feature_names_in_ndarray of shape (n_features_in_,) Names of features
seen during fit. Only defined if the underlying estimators expose such
an attribute when fit.
New in version 1.0.
I have also found some documentation in scikit learn column selector which has the argument:
https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector
patternstr, default=None Name of columns containing this regex pattern
will be included. If None, column selection will not be selected based
on pattern.
The problem is that this pattern will depend on the target that I am fitting.
Is there a way to do this elegantly?
EDIT: Example of the dataset:
feat1, feat2, feat3.... target1, target2, target3....
1 47 0.65 0 0.5 0.6
The multioutput regressor will fit an histogram regressor for every pair of (feat1, feat2, feat3 and targetn). In the example of the table below I will have a pipeline which estimator step will contain a list of 3 estimators as a have 3 targets.
The question is how to pass for instance feat1 and feat2 to target1 but pass feat1 and feat3 to target2.
A solution is to modify MultiOutputRegressor so that it can filter specific columns to fit a model to individual target variables.
For example, I define a MultiOutputRegressorTargetFilter that accepts a features_in parameter which is a dictionary indicating which columns to use for each target y value
import numpy as np
from sklearn.datasets import load_linnerud
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
X, y = load_linnerud(return_X_y=True)
# Pass a dictionary indicating which columns to use for each target variable value
features_in = {
0: [0, 2], # Use columns 1 and 3 for y[0]
1: [1, 2], # Use columns 2 and 3 for y[1]
2: [0, 1, 2] # Use all columns for y[2]
}
clf = MultiOutputRegressorTargetFilter(Ridge(random_state=123), features_in=features_in).fit(X, y)
clf.predict(X[[0]])
Code for MultiOutputRegressorTargetFilter
from sklearn.multioutput import _MultiOutputEstimator
from sklearn.base import RegressorMixin, clone
from sklearn.utils.validation import _check_fit_params, has_fit_parameter, check_is_fitted
from sklearn.utils.fixes import delayed
from joblib import Parallel
import numpy as np
def _fit_estimator(estimator, X, y, sample_weight=None, **fit_params):
estimator = clone(estimator)
if sample_weight is not None:
estimator.fit(X, y, sample_weight=sample_weight, **fit_params)
else:
estimator.fit(X, y, **fit_params)
return estimator
class MultiOutputRegressorTargetFilter(RegressorMixin, _MultiOutputEstimator):
"""Multi target regression.
This strategy consists of fitting one regressor per target. This is a
simple strategy for extending regressors that do not natively support
multi-target regression. This Estimator allows to select different columns
to fit a model for each of the target values.
.. versionadded:: 0.18
Parameters
----------
estimator : estimator object
An estimator object implementing :term:`fit` and :term:`predict`.
features_in : dict
Dictionary with (key, value) pairs indicating which variables to use
to fit model at target y.
n_jobs : int or None, optional (default=None)
The number of jobs to run in parallel.
:meth:`fit`, :meth:`predict` and :meth:`partial_fit` (if supported
by the passed estimator) will be parallelized for each target.
When individual estimators are fast to train or predict,
using ``n_jobs > 1`` can result in slower performance due
to the parallelism overhead.
``None`` means `1` unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all available processes / threads.
See :term:`Glossary <n_jobs>` for more details.
.. versionchanged:: 0.20
`n_jobs` default changed from `1` to `None`.
Attributes
----------
estimators_ : list of ``n_output`` estimators
Estimators used for predictions.
n_features_in_ : int
Number of features seen during :term:`fit`. Only defined if the
underlying `estimator` exposes such an attribute when fit.
.. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
Names of features seen during :term:`fit`. Only defined if the
underlying estimators expose such an attribute when fit.
.. versionadded:: 1.0
See Also
--------
RegressorChain : A multi-label model that arranges regressions into a
chain.
MultiOutputClassifier : Classifies each output independently rather than
chaining.
Examples
--------
>>> import numpy as np
>>> from sklearn.datasets import load_linnerud
>>> from sklearn.multioutput import MultiOutputRegressor
>>> from sklearn.linear_model import Ridge
>>> X, y = load_linnerud(return_X_y=True)
>>> clf = MultiOutputRegressor(Ridge(random_state=123)).fit(X, y)
>>> clf.predict(X[[0]])
array([[176..., 35..., 57...]])
"""
def __init__(self, estimator, *, n_jobs=None, features_in=None):
super().__init__(estimator, n_jobs=n_jobs)
self.features_in = features_in
def fit(self, X, y, sample_weight=None, **fit_params):
"""Fit the model to data, separately for each output variable.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input data.
y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
Multi-output targets. An indicator matrix turns on multilabel
estimation.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights. If `None`, then samples are equally weighted.
Only supported if the underlying regressor supports sample
weights.
**fit_params : dict of string -> object
Parameters passed to the ``estimator.fit`` method of each step.
.. versionadded:: 0.23
Returns
-------
self : object
Returns a fitted instance.
"""
if not hasattr(self.estimator, "fit"):
raise ValueError("The base estimator should implement a fit method")
y = self._validate_data(X="no_validation", y=y, multi_output=True)
if y.ndim == 1:
raise ValueError(
"y must have at least two dimensions for "
"multi-output regression but has only one."
)
if sample_weight is not None and not has_fit_parameter(
self.estimator, "sample_weight"
):
raise ValueError("Underlying estimator does not support sample weights.")
fit_params_validated = _check_fit_params(X, fit_params)
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_estimator)(
self.estimator, X[:, self.features_in[i]], y[:, i], sample_weight, **fit_params_validated
)
for i in range(y.shape[1])
)
if hasattr(self.estimators_[0], "n_features_in_"):
self.n_features_in_ = self.estimators_[0].n_features_in_
if hasattr(self.estimators_[0], "feature_names_in_"):
self.feature_names_in_ = self.estimators_[0].feature_names_in_
return self
def predict(self, X):
"""Predict multi-output variable using model for each target variable.
Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The input data.
Returns
-------
y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
Multi-output targets predicted across multiple predictors.
Note: Separate models are generated for each predictor.
"""
check_is_fitted(self)
if not hasattr(self.estimators_[0], "predict"):
raise ValueError("The base estimator should implement a predict method")
y = Parallel(n_jobs=self.n_jobs)(
delayed(e.predict)(X[:, self.features_in[i]]) for i, e in enumerate(self.estimators_)
)
return np.asarray(y).T
I have a logistic regression that I want to know the AUC for.
I created the logistic regression model using statsmodels:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
logit_model=sm.Logit(y,X)
result=logit_model.fit()
Then, I use sklearn to get an AUC score for my model predictions:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, result.predict())
The code runs and I get a AUC score, I just want to make sure I am passing variables between the package calls correctly.
Input for roc_auc_score :
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
sklearn.metrics.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]
y_true array-like of shape (n_samples,) or (n_samples, n_classes)
True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores.
To validate the input parameters:
y_true: correct
y_score: result.predict(X), based on https://www.statsmodels.org/devel/examples/notebooks/generated/predict.html
But your validation is missing the concept of train and testsplit wich is necassary in machine learning. Normally you always divide the dataset in a training part and in a test part. In a pseudo code this would look like:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
#pseudo line
X_train, X_test, y_train, y_test = train_test_split
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, result.predict(X_test))
One way to do nested cross-validation with a XGB model would be:
from sklearn.model_selection import GridSearchCV, cross_val_score
from xgboost import XGBClassifier
# Let's assume that we have some data for a binary classification
# problem : X (n_samples, n_features) and y (n_samples,)...
gs = GridSearchCV(estimator=XGBClassifier(),
param_grid={'max_depth': [3, 6, 9],
'learning_rate': [0.001, 0.01, 0.05]},
cv=2)
scores = cross_val_score(gs, X, y, cv=2)
However, regarding the tuning of XGB parameters, several tutorials (such as this one) take advantage of the Python hyperopt library. I would like to be able to do nested cross-validation (as above) using hyperopt to tune the XGB parameters.
To do so, I wrote my own Scikit-Learn estimator:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split
from sklearn.exceptions import NotFittedError
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
def optimize_params(X, y, params_space, validation_split=0.2):
"""Estimate a set of 'best' model parameters."""
# Split X, y into train/validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=validation_split, stratify=y)
# Estimate XGB params
def objective(_params):
_clf = XGBClassifier(n_estimators=10000,
max_depth=int(_params['max_depth']),
learning_rate=_params['learning_rate'],
min_child_weight=_params['min_child_weight'],
subsample=_params['subsample'],
colsample_bytree=_params['colsample_bytree'],
gamma=_params['gamma'])
_clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
eval_metric='auc',
early_stopping_rounds=30)
y_pred_proba = _clf.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_true=y_val, y_score=y_pred_proba)
return {'loss': 1. - roc_auc, 'status': STATUS_OK}
trials = Trials()
return fmin(fn=objective,
space=params_space,
algo=tpe.suggest,
max_evals=100,
trials=trials,
verbose=0)
class OptimizedXGB(BaseEstimator, ClassifierMixin):
"""XGB with optimized parameters.
Parameters
----------
custom_params_space : dict or None
If not None, dictionary whose keys are the XGB parameters to be
optimized and corresponding values are 'a priori' probability
distributions for the given parameter value. If None, a default
parameters space is used.
"""
def __init__(self, custom_params_space=None):
self.custom_params_space = custom_params_space
def fit(self, X, y, validation_split=0.3):
"""Train a XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Data.
y : ndarray, shape (n_samples,) or (n_samples, n_labels)
Labels.
validation_split : float (default: 0.3)
Float between 0 and 1. Corresponds to the percentage of samples in X which will be used as validation data to estimate the 'best' model parameters.
"""
# If no custom parameters space is given, use a default one.
if self.custom_params_space is None:
_space = {
'learning_rate': hp.uniform('learning_rate', 0.0001, 0.05),
'max_depth': hp.quniform('max_depth', 8, 15, 1),
'min_child_weight': hp.quniform('min_child_weight', 1, 5, 1),
'subsample': hp.quniform('subsample', 0.7, 1, 0.05),
'gamma': hp.quniform('gamma', 0.9, 1, 0.05),
'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 0.7, 0.05)
}
else:
_space = self.custom_params_space
# Estimate best params using X, y
opt = optimize_params(X, y, _space, validation_split)
# Instantiate `xgboost.XGBClassifier` with optimized parameters
best = XGBClassifier(n_estimators=10000,
max_depth=int(opt['max_depth']),
learning_rate=opt['learning_rate'],
min_child_weight=opt['min_child_weight'],
subsample=opt['subsample'],
gamma=opt['gamma'],
colsample_bytree=opt['colsample_bytree'])
best.fit(X, y)
self.best_estimator_ = best
return self
def predict(self, X):
"""Predict labels with trained XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Returns
-------
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
"""
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict`.')
else:
return self.best_estimator_.predict(X)
def predict_proba(self, X):
"""Predict labels probaiblities with trained XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Returns
-------
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
"""
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict_proba`.')
else:
return self.best_estimator_.predict_proba(X)
My questions are:
Is this a valid approach? For instance, in the fit method of my OptimizedXGB, best.fit(X, y) will train a XGB model on X, y. However, this might lead to overfitting as no eval_set is specified to ensure early stopping.
On a toy example (the famous iris dataset), this OptimizedXGB performs worse than a basic LogisticRegression classifier. Why is that? Is it because the example is to simplistic? See below for the code of the example.
Example:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_iris(return_X_y=True)
X = X[:, :2]
X = X[y < 2]
y = y[y < 2]
skf = StratifiedKFold(n_splits=2, random_state=42)
# With a LogisticRegression classifier
pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])
gs = GridSearchCV(estimator=pipe, param_grid={'lr__C': [1., 10.]})
lr_scores = cross_val_score(gs, X, y, cv=skf)
# With OptimizedXGB
xgb_scores = cross_val_score(OptimizedXGB(), X, y, cv=skf)
# Print results
print('Accuracy with LogisticRegression = %.4f (+/- %.4f)' % (np.mean(lr_scores), np.std(lr_scores)))
print('Accuracy with OptimizedXGB = %.4f (+/- %.4f)' % (np.mean(xgb_scores), np.std(xgb_scores)))
Outputs:
Accuracy with LogisticRegression = 0.9900 (+/- 0.0100)
Accuracy with OptimizedXGB = 0.9100 (+/- 0.0300)
Although the scores are close, I would have expected the XGB model to score at least as well as a LogisticRegression classifier.
EDIT:
similar post
First, check this post - might help - nested CV.
Regarding your questions:
Yes, that's the right way to go. Once you have your hyper parameters selected, you should fit your model (selected model) on the entire training data. However, since this model includes a model selection process inside, you can only "score" how well it generalizes using an external CV, like you did.
Since you are scoring the selection process as well (and not just the model, say XGB Vs Linear regression) there might be some problem with the selection process. Maybe you hyper space is not properly defined and you are choosing poor parameters?
I am using logistic regression for prediction. My predictions are 0's and 1's. After training my model on given data and also when training on important features i.e X_important_train see screenshot. I am getting score around 70% but when I use roc_auc_score(X,y) or roc_auc_score(X_important_train, y_train) I am getting value error:
ValueError: multiclass-multioutput format is not supported
Code:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
roc_auc_score(X_important_train, y_train)
Screenshot:
First of all, the roc_auc_score function expects input arguments with the same shape.
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None)
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
y_true : array, shape = [n_samples] or [n_samples, n_classes]
True binary labels in binary label indicators.
y_score : array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Now, the inputs are the true and predicted scores, NOT the training and label data as you are using in the example that you posted.
In more detail,
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
# this is wrong here
roc_auc_score(X_important_train, y_train)
You should so something like:
y_pred = model.predict(X_test_data)
roc_auc_score(y_true, y_pred)
I try to calculate an aggregated confusion matrix to evaluate my model:
cv_results = cross_validate(estimator, dataset.data, dataset.target, scoring=scoring,
cv=Config.CROSS_VALIDATION_FOLDS, n_jobs=N_CPUS, return_train_score=False)
But I don't know how to extract the single confusion matrices of the different folds. In a scorer I can compute it:
scoring = {
'cm': make_scorer(confusion_matrix)
}
, but I cannot return the comfusion matrix, because it has to return a number instead of an array. If I try it I get the following error:
ValueError: scoring must return a number, got [[...]] (<class 'numpy.ndarray'>) instead. (scorer=cm)
I wonder if it is possible to store the confusion matrices in a global variable, but had no success using
global cm_list
cm_list.append(confusion_matrix(y_true,y_pred))
in a custom scorer.
Thanks in advance for any advice.
To return confusion matrix for each fold ,you can call confusion_matrix from metrics modules in each iteration(fold) which will give you an array as output.Input will be a y_true and y_predict values obtained for each fold.
from sklearn import metrics
print metrics.confusion_matrix(y_true,y_predict)
array([[327582, 264313],
[167523, 686735]])
Alternatively, if you are using pandas then pandas has a crosstab module
df_conf = pd.crosstab(y_true,y_predict,rownames=['Actual'],colnames=['Predicted'],margins=True)
print df_conf
Predicted 0 1 All
Actual
0 332553 58491 391044
1 97283 292623 389906
All 429836 351114 780950
The problem was, that I could not get access to the estimator after RandomizedSearchCV was finished, because I did not know RandomizedSearchCV implements a predict method. Here is my personal solution:
r_search = RandomizedSearchCV(estimator=estimator, param_distributions=param_distributions,
n_iter=n_iter, cv=cv, scoring=scorer, n_jobs=n_cpus,
refit=next(iter(scorer)))
r_search.fit(X, y_true)
y_pred = r_search.predict(X)
cm = confusion_matrix(y_true, y_pred)