sklearn RandomizedSearchCV extract confusion matrix for different folds - python

I try to calculate an aggregated confusion matrix to evaluate my model:
cv_results = cross_validate(estimator,,, scoring=scoring,
cv=Config.CROSS_VALIDATION_FOLDS, n_jobs=N_CPUS, return_train_score=False)
But I don't know how to extract the single confusion matrices of the different folds. In a scorer I can compute it:
scoring = {
'cm': make_scorer(confusion_matrix)
, but I cannot return the comfusion matrix, because it has to return a number instead of an array. If I try it I get the following error:
ValueError: scoring must return a number, got [[...]] (<class 'numpy.ndarray'>) instead. (scorer=cm)
I wonder if it is possible to store the confusion matrices in a global variable, but had no success using
global cm_list
in a custom scorer.
Thanks in advance for any advice.

To return confusion matrix for each fold ,you can call confusion_matrix from metrics modules in each iteration(fold) which will give you an array as output.Input will be a y_true and y_predict values obtained for each fold.
from sklearn import metrics
print metrics.confusion_matrix(y_true,y_predict)
array([[327582, 264313],
[167523, 686735]])
Alternatively, if you are using pandas then pandas has a crosstab module
df_conf = pd.crosstab(y_true,y_predict,rownames=['Actual'],colnames=['Predicted'],margins=True)
print df_conf
Predicted 0 1 All
0 332553 58491 391044
1 97283 292623 389906
All 429836 351114 780950

The problem was, that I could not get access to the estimator after RandomizedSearchCV was finished, because I did not know RandomizedSearchCV implements a predict method. Here is my personal solution:
r_search = RandomizedSearchCV(estimator=estimator, param_distributions=param_distributions,
n_iter=n_iter, cv=cv, scoring=scorer, n_jobs=n_cpus,
refit=next(iter(scorer))), y_true)
y_pred = r_search.predict(X)
cm = confusion_matrix(y_true, y_pred)


How to deal with unbalanced xgboost multiclass classification within Scikit.learn pipeline?

I am using XGBClassifier to model an unbalanced multiclass target. I have a few questions:
First I would like to now where should I use the parameter weight on the instantion of the classifier or on the fit step of the pipeline?
Second question is how I calculate a weights. I assume that the sum of the array should be 1.
Third: Is there any order of the weight array that maps the diferent label classes?
Thank you all in advance
For your first question:
where should I use the parameter weight
Use sample_weight in
xgb_clf = xgb.XGBClassifier(), y, sample_weight=sample_weight)
When using pipeline:
pipe = Pipeline([
('my_xgb_clf', xgb.XGBClassifier()),
]), y, my_xgb_clf__sample_weight=sample_weight)
Btw, some API in sklearn does not support sample_weight kwarg, e.g., learning_curve.
So I simply do this:
import functools = functools.partial(, sample_weight=sample_weight)
Note: You would need to patch fit() again after a grid search, because GridSearchCV.best_estimator_ will not be the original estimator.
For the second question:
how I calculate a weights. I assume that the sum of the array should be 1.
from sklearn.utils import compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)
This simulates class_weight='balanced' in sklearn.
Sum of the array is not 1. You can normalize it, but I think the
score result would be different.
This does not equal to class_weight='balanced_subsample'
I can not find a way to simulate this.
For the third question:
Is there any order...
Sorry I don't understand what you mean...
Maybe you want the order in xgb_clf.classes_?
You can access this after calling
Or just use np.unique(y_train).

post-process cross-validated prediction before scoring

I have a regression problem, where I am cross-validating the results and evaluating the performance. I know beforehand that the ground truth cannot be smaller than zero. Therefore, I would like to intercept the predictions, before they are fed to the score metric, to clip the predictions to zero. I thought that using the make_scorer function would be useful to do this. Is it possible to somehow post-process the predictions after cross-validation, but before applying an evaluation metric to it?
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import cross_validate
# X = Stacked feature vectors
# y = ground truth vector
# regr = some regression estimator
#### How to indicate that the predictions need post-processing
#### before applying the score function???
scoring = {'r2': make_scorer(r2_score),
'neg_mse': make_scorer(mean_squared_error)}
scores = cross_validate(regr, X, y, scoring=scoring, cv=10)
PS: I know there are constrained estimators, but I wanted to see how a heuristic approach like this would perform.
One thing you can do is wrap those scorers you're looking to use (r2_score, mean_squared_error) in a custom scorer function using make_scorer() as you suggested.
Take a look at this part of the sklearn documentation and this Stack Overflow post for some examples. In particular, your function can do something like this:
def clipped_r2(y_true, y_pred):
y_pred_clipped = np.clip(y_pred, 0, None)
return r2_score(y_true, y_pred_clipped)
def clipped_mse(y_true, y_pred):
y_pred_clipped = (y_pred, 0, None)
return mean_squared_error(y_true, y_pred_clipped)
This allows you to do the post-processing right within the scorer before calling the scoring function (in this case r2_score or mean_squared_error). Then to use it just use make_scorer like you were doing above, setting greater_is_better according to whether the scorer is a scoring function (like r2, greater is better), or loss function (mean_squared_error is better when it's 0, i.e. less):
scoring = {'r2': make_scorer(clipped_r2, greater_is_better=True),
'neg_mse': make_scorer(clipped_mse, greater_is_better=False)}
scores = cross_validate(regr, X, y, scoring=scoring, cv=10)

f1_score metric in lightgbm

I want to train a lgb model with custom metric : f1_score with weighted average.
I went through the advanced examples of lightgbm over here and found the implementation of custom binary error function. I implemented as similar function to return f1_score as shown below.
def f1_metric(preds, train_data):
labels = train_data.get_label()
return 'f1', f1_score(labels, preds, average='weighted'), True
I tried to train the model by passing feval parameter as f1_metric as shown below.
evals_results = {}
bst = lgb.train(params,
valid_sets= [dvalid],
Then I am getting ValueError: Found input variables with inconsistent numbers of samples:
The training set is being passed to the function rather than the validation set.
How can I configure such that the validation set is passed and f1_score is returned?
The docs are a bit confusing. When describing the signature of the function that you pass to feval, they call its parameters preds and train_data, which is a bit misleading.
But the following seems to work:
from sklearn.metrics import f1_score
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
evals_result = {}
clf = lgb.train(param, train_data, valid_sets=[val_data, train_data], valid_names=['val', 'train'], feval=lgb_f1_score, evals_result=evals_result)
lgb.plot_metric(evals_result, metric='f1')
To use more than one custom metric, define one overall custom metrics function just like above, in which you calculate all metrics and return a list of tuples.
Edit: Fixed code, of course with F1 bigger is better should be set to True.
Regarding Toby's answers:
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
I suggest change the y_hat part to this:
y_hat = np.where(y_hat < 0.5, 0, 1)
I used the y_hat = np.round(y_hat) and fonud out that during training the lightgbm model will sometimes(very unlikely but still a change) regard our y prediction to multiclass instead of binary.
My speculation:
Sometimes the y prediction will be small or higher enough to be round to negative value or 2?I'm not sure,but when i changed the code using np.where, the bug is gone.
Cost me a morning to figure this bug,although I'm not really sure if the np.where solution is good.

Logistic Regression mean square error

NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.

Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?

I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:, y)
I do not understand the following:
why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().
When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.
