Use scoring function used by GridSearchCV to get predictions - python

I am creating dynamic pipelines in scikit-learn and I set the scoring function as a parameter string on the GridSearchCV:
gs = GridSearchCV(pipeline, grid, scoring='accuracy')
However, when I try to get the scoring function that was used in order to evaluate the predictions, this is what I get:
File "app/experimenter/sklearn/sklearn-dask-tests.py", line 127, in run_pipeline
print(evaluator(expected, predicted))
TypeError: __call__() takes at least 4 arguments (3 given)
This is the code:
gs.fit(train_data, train_target)
predicted = gs.predict(test_data)
evaluator = gs.scorer_
print(evaluator(expected, predicted))
So from what I have seen the problem is that the evaluator is in fact make_scorer(accuracy_score). I guess it would be possible to get print(evaluator(expected, predicted)) to work if I add the estimator as the first parameter, but how do I get it properly from the pipeline?
Cause when I do gs.best_estimator_ I get this:
Pipeline(steps=[('mapper', DataFrameMapper(default=False, df_out=False,
features=[('Sex', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False))],
sparse=False)), ('DecisionTree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=1, max_... min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'))])

Yes, thats the desired behaviour of GridSearchCV. The pipeline object returned from the gs.best_estimator_ is already fitted on the whole train_data with best parameters found in grid search.
You need to send that pipeline object into the evaluator. But your current usage of evaluator is wrong.
What the scorer_ from make_scorer does it takes the test data, make predictions on it and then calculate score by comparing it with actual data.
Hence, its signature is:
scorer(estimator, X_test, y_test)
But you are trying to use it as:
evaluator(expected, predicted)
It will not work because:
It does not satisfy the signature of scorer
It needs the X data, not the predicted y data.
So, if you already have the actual and predicted values of the data, you can simply use the:
accuracy_score(expected, predicted)
If you want to use the scorer_ (your evaluator), then you should not supply it the predicted, instead supply the test_data (from which the predicted is got)
evaluator(gs.best_estimator_, test_data, expected)

Related

XGBoost classifier returns negative prediction after defining 'objective':'binary:logistic'

Issue
I am trying to use the XGBoost classifier and define a custom matric that uses f-beta-score but I am getting negative predicted values returned to the custom function and after np.round I ended up getting three values [-1,0,1]
Question
How do I solve this issue, as I want to know about the f_beta_score during the training eval?
Function of Custom performance matrics
def fbeta_func(y_predicted, y_true):
f = fbeta_score(list(y_test),np.round(y_predicted),beta=0.5)
return 'f_beta', f
And then I am calling this function in the fit function:
from xgboost import XGBClassifier
clf = XGBClassifier(**param)
clf.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric=fbeta_func,
verbose=True)
Here are the parameters:

create permutation variable importance after using train instead of fit

I tried to use this permutation code:
perm_importance = permutation_importance(model, x_test, y_test, random_state=7)
Error seen:
"TypeError: estimator should be an estimator implementing 'fit' method"
The question:
Is there a way to create permutation variable importance after using xgb.train instead of xgb.fit?

Using Fitted Model in make_scorer For gridsearchcv

I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?

auc score in scikit-learn for no binary classifiers

I want to calculate the roc_auc for different classifiers. Some are not binary classifiers. Here is a portion of the code I used:
if hasattr(clf, "decision_function"):
y_score = clf.fit(X_train, y_train).decision_function(X_test)
else:
y_score = clf.fit(X_train, y_train).predict_proba(X_test)
AUC=roc_auc_score(y_test, y_score)
However, I get an error for some classifiers (Nearest Neighbors
for example):
ValueError: bad input shape
Just a remark, I used: y_score = clf.fit(X_train, y_train).predict_proba(X_test), but I don't really know if it's correct to use it.
okay so first things first
clf.fit(X_train, y_train)
that will fit your model to your training data. first parameter being features, second parameter being target. okay, nicely done.
After fiting, you can apply ".predict" or ".predict_proba" on another dataset to get an estimative/prediction of its results. or you can do both fit and predict at the same time as you did below:
clf.fit(X_train, y_train).predict_proba(X_test)
Now those are your predictions, not your score.
Your score will be a function of the prediction and the true value "(y_test)".
You can use different score metrics depending on the kind of problem you got, such as accuracy, precision, recall, f1, etc.. (read more at http://scikit-learn.org/stable/modules/model_evaluation.html)
Now, roc_auc_score is one of those metrics, but you gotta watch out what you input that function, otherwise it wont work. As explained on the roc_auc_score page (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score), parameters should be:
y_true: True binary labels in binary label indicators.
y_score : Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
so, if you got labels, or multilabels for y_true, the function wont work, it gotta be binary.
y_score in the other hand can be either binary or probabilities (ranging from [0,1])
hope that helps!
edit: if you got a multilabel problem, what you can do is tackle different classes one at a time. that way it will become many binary binary problems/models. (try building a model to predict if its class A or not, and do the roc curve of it, afterwards, move on to the next class and build another model, and so on)

The R^2 score I get from GridSearchCV is very different from the one I get from cross_val_score, why? (sklearn, python)

I'm using GridSearchCV to pick a regressor. Once it's fitted, I pull out the chosen regressor with
predictor = GridSearchCV(Pipeline(...), params={...},
cv=10, scoring='r2')
predictor.fit(X, y)
estimator = predictor.get_params()['estimator']
and then I run cross_val_score with
cross_val_score(estimator, X, y,
cv=10, scoring='r2')
but the R^2 I get is consistently about 5 percentage points lower than predictor.best_score_. Why?
Use predictor.best_estimator_ as the estimator in cross_val_score. This is the one with the best parameters. The way you choose it, you are probably obtaining the initial estimator with default parameters. You could check by putting the latter in cross_val_score as well and comparing results.

Categories