How to deal with unbalanced xgboost multiclass classification within Scikit.learn pipeline? - python

I am using XGBClassifier to model an unbalanced multiclass target. I have a few questions:
First I would like to now where should I use the parameter weight on the instantion of the classifier or on the fit step of the pipeline?
Second question is how I calculate a weights. I assume that the sum of the array should be 1.
Third: Is there any order of the weight array that maps the diferent label classes?
Thank you all in advance

For your first question:
where should I use the parameter weight
Use sample_weight in XGBClassifier.fit()
xgb_clf = xgb.XGBClassifier()
xgb_clf.fit(X, y, sample_weight=sample_weight)
When using pipeline:
pipe = Pipeline([
('my_xgb_clf', xgb.XGBClassifier()),
])
pipe.fit(X, y, my_xgb_clf__sample_weight=sample_weight)
Btw, some API in sklearn does not support sample_weight kwarg, e.g., learning_curve.
So I simply do this:
import functools
xgb_clf.fit = functools.partial(xgb_clf.fit, sample_weight=sample_weight)
Note: You would need to patch fit() again after a grid search, because GridSearchCV.best_estimator_ will not be the original estimator.
For the second question:
how I calculate a weights. I assume that the sum of the array should be 1.
from sklearn.utils import compute_sample_weight
sample_weight = compute_sample_weight('balanced', y_train)
This simulates class_weight='balanced' in sklearn.
Note:
Sum of the array is not 1. You can normalize it, but I think the
score result would be different.
This does not equal to class_weight='balanced_subsample'
I can not find a way to simulate this.
For the third question:
Is there any order...
Sorry I don't understand what you mean...
Maybe you want the order in xgb_clf.classes_?
You can access this after calling xgb_clf.fit.
Or just use np.unique(y_train).

Related

How to use a custom loss function in a Neural Network with MLPClassifier Sklearn?

I would like to use a custom loss function to train a neural network in scikit learn; using MLPClassifier. I would like to give more importance to larger values. Therefore, I would like to use something like the mean square error but multiplying the numerator by y. Thus, it would look like :
1/n∑y(yi-y(hat)i)^2
Here is the code of my model:
mlp10 = MLPClassifier(hidden_layer_sizes=(150,100,50,25,10), max_iter=1000,
random_state=42)
mlp10.fit(X_train, y_train)
How can I modify the loss function ?
I don't believe you can modify the loss function directly as there is no parameter for it in the construction of the classifier and the documentation explicitly specifies that it's optimizing using the log-loss function. If you're willing to be a bit flexible, you might be able to get the effect you're looking for simply by an transform of the y values before training and then use the inverse transform to recover the predicted ys after testing.
For instance, mapping y_prime = transform(y) and y = inverse_transform(y_prime) on each value where you define transform and inverse_transform as:
def transform(y):
return y ** 2
def inverse_transform(y_prime):
return math.sqrt(y_prime)
would cause larger values of y to have more influence in the training. Obviously you could experiment with different transforms to see what works best for your use-case. The key is just to make sure that transform is superlinear.
Before training you'd need to do:
y_train = map(transform, y_train)
And after calling predict:
y_predict = model.predict(x)
y_predict = map(inverse_transform, y_predict)

Logistic Regression mean square error

NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.

Sci-kit: What's the easiest way to get the confusion matrix of an estimator when using GridSearchCV?

In this simplified example, I've trained a learner with GridSearchCV. I would like to return the confusion matrix of the best learner when predicting on the full set X.
lr_pipeline = Pipeline([('clf', LogisticRegression())])
lr_parameters = {}
lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1)
lr_gs = lr_gs.fit(X,y)
print lr_gs.confusion_matrix # Would like to be able to do this
Thanks
You will first need to predict using best estimator in your GridSerarchCV. A common method to use is GridSearchCV.decision_function(), But for your example, decision_function returns class probabilities from LogisticRegression and does not work with confusion_matrix. Instead, find best estimator using lr_gs and predict the labels using that estimator.
y_pred = lr_gs.best_estimator_.predict(X)
Finally, use sklearn's confusion_matrix on real and predicted y
from sklearn.metrics import confusion_matrix
print confusion_matrix(y, y_pred)
I found this question while searching for how to calculate the confusion matrix while fitting Sci-kit Learn's GridSearchCV. I was able to find a solution by defining a custom scoring function, although it's somewhat kludgy. I'm leaving this answer for anyone else who makes a similar search.
As mentioned by #MLgeek and #bugo99iot, the accepted answer by #Sudeep Juvekar isn't really satisfactory. It offers a literal answer to original question as asked, but it's not usually the case that a machine learning practitioner would be interested in the confusion matrix of a fitted model on its training data. It is more typically of interest to know how well a model generalizes to data it hasn't seen.
To use a custom scoring function in GridSearchCV you will need to import the Scikit-learn helper function make_scorer.
from sklearn.metrics import make_scorer
The custom scoring function looks like this
def _count_score(y_true, y_pred, label1=0, label2=1):
return sum((y == label1 and pred == label2)
for y, pred in zip(y_true, y_pred))
For a given pair of labels, (label1, label2), it calculates the number of examples where the true value of y is label1 and the predicted value of y is label2.
To start, find all of the labels in the training data
all_labels = sorted(set(y))
The optional argument scoring of GridSearchCV can receive a dictionary mapping strings to scorers. make_scorer can take a scoring function along with bindings for some of its parameters and produce a scorer, which is a particular type of callable that is used for scoring in GridSearchCV, cross_val_score, etc. Let's build up this dictionary for each pair of labels.
scorer = {}
for label1 in all_labels:
for label2 in all_labels:
count_score = make_scorer(_count_score, label1=label1,
label2=label2)
scorer['count_%s_%s' % (label1, label2)] = count_score
You'll also want to add any additional scoring functions you're interested in. To avoid getting into the subtleties of scoring for multi-class classification let's add a simple accuracy score.
# import placed here for the sake of demonstration.
# Should be imported alongside make_scorer above
from sklearn.metrics import accuracy_score
scorer['accuracy'] = make_scorer(accuracy_score)
We can now fit GridSearchCV
num_splits = 5
lr_gs = GridSearchCV(lr_pipeline, lr_parameters, n_jobs=-1,
scoring=scorer, refit='accuracy',
cv=num_splits)
refit='accuracy' tells GridSearchCV that it should judge by best accuracy score to decide on the parameters to use when refitting. In the case where you are passing a dictionary of multiple scorers to scoring, if you do not pass a value to the optional argument refit, GridSearchCV will not refit the model on all training data. We've explicitly set the number of splits because we'll need to know this later.
Now, for each of the training folds used in cross-validation, essentially what we've done is calculate the confusion matrix on the respective test folds. The test folds do not overlap and cover the entire space of data, we've therefore made predictions for each data point in X in such a way that the prediction for each point does not depend on the associated target label for that point.
We can add up the confusion matrices associated to the test folds to get something useful that gives information on how well the model generalizes. It can also be interesting to look at the confusion matrices for the test folds separately and do stuff like calculate variances.
We're not done yet though. We need to actually pull out the confusion matrix for the best estimator. In this example, the cross validation results will be stored in the dictionary lr_gs.cv_results. First let's get the index in the results corresponding to the best set of parameters
best_index = lr_gs.cv_results['rank_test_accuracy'] - 1
If you are using a different metric to decide upon the best parameters, substitute for 'accuracy' the key you are using for the associated scorer in the scoring dictionary passed to GridSearchCV.
In my own application I chose to store the confusion matrix as a nested dictionary.
confusion = defaultdict(lambda: defaultdict(int))
for label1 in all_labels:
for label2 in all_labels
for i in range(num_splits):
key = 'split%s_test_count_%s_%s' % (i, label1, label2)
val = int(lr_gs.cv_results[key][best_index])
confusion[label1][label2] += val
confusion = {key: dict(value) for key, value in confusion.items()}
There's some stuff to unpack here. defaultdict(lambda: defaultdict(int)) constructs a nested defaultdict; a defaultdict of defaultdict of int (if you're copying and pasting, don't forget to add from collections import defaultdict at the top of your file). The last line of this snippet is used to turn confusion into a regular dict of dict of int. Never leave defaultdicts lying around when they are no longer needed.
You will likely want to store your confusion matrix in a different way. The key fact is that the confusion matrix entry for the pair of labels 'label1', 'label2' for test fold i is stored in
lr_gs.cv_results['spliti_label1_label2'][best_index]
See here for an example of this confusion matrix calculation used in practice. I think it's a bit of a code smell to rely on the specific format of the keys in the cv_results dictionary but this does work, at least as of the day of this post.

Creating scorer for Brier Score Loss in scikit-learn

I'm trying to make use of GridSearchCV and RandomizedSearchCV in scikit-learn (0.16.1) for logistic regression and random forest classifiers (and possibly others down the road) for binary class problems. I managed to get GridSearchCV to work with the standard LogisticRegression classifier, but I cannot get LogisticRegressionCV to work (or RandomizedGridCV for the RandomForestClassifier) with a customized scoring function, specifically brier_score_loss. I have tried this code:
lrcv = LogisticRegressionCV(scoring = make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True, needs_threshold=False, pos_label=1))
lrcv_clf = lrcv.fit(X=X_train,y=y_train)
But I keep getting errors that are essentially saying the brier_score_loss function is receiving input (y_prob) with 2 columns, causing an error (bad input shape). Is there a way to specify to use only the second column of y_prob (lrcv.predict_proba) so that the Brier score can be calculated in this way? I thought pos_label might help but apparently not. Do I need to avoid make_scorer and just create my own scoring function?
Thanks for any suggestions!
predict_proba returns two probabilities for every predicted y value, the first is about 0 and the second is about 1. You should choose which one you need and pass it further to the scoring function.
I'm doing this with the simple proxy function:
def ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs):
return proxied_func(y_true, y_probs[:, class_idx], **kwargs)
That can be used like this:
scorer = metrics.make_scorer(ProbaScoreProxy, greater_is_better=False, needs_proba=True, class_idx=1, proxied_func=metrics.brier_score_loss)
For the binary classification the class_idx can be 0 or 1.

Scikit-Learn Classification and Regression with Weights

How can I do classification or regression in sklearn if I want to weight each sample differently? Is there a way to do it with a custom loss function? If so, what does that loss function look like in general? Is there an easier way?
To weigh individual samples, feed a sample_weight array to the estimator's fit method. This should be a 1-d array of length n_samples (i.e. the same dimension as y in most tasks):
estimator.fit(X, y, sample_weight=some_array)
Not all models support this, check the documentation.

Categories