I tried to use this permutation code:
perm_importance = permutation_importance(model, x_test, y_test, random_state=7)
Error seen:
"TypeError: estimator should be an estimator implementing 'fit' method"
The question:
Is there a way to create permutation variable importance after using xgb.train instead of xgb.fit?
Related
Issue
I am trying to use the XGBoost classifier and define a custom matric that uses f-beta-score but I am getting negative predicted values returned to the custom function and after np.round I ended up getting three values [-1,0,1]
Question
How do I solve this issue, as I want to know about the f_beta_score during the training eval?
Function of Custom performance matrics
def fbeta_func(y_predicted, y_true):
f = fbeta_score(list(y_test),np.round(y_predicted),beta=0.5)
return 'f_beta', f
And then I am calling this function in the fit function:
from xgboost import XGBClassifier
clf = XGBClassifier(**param)
clf.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric=fbeta_func,
verbose=True)
Here are the parameters:
I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?
I'd like to make sure my custom score function behaves as expected by comparing it to hand-computation (so to speak) on a pre-defined split using train_test_split.
However I can't seem to pass that split in to cross_val_score. By default it uses 3 fold cross-validation and I can't mimick the splits it used. I think the answer lies in the cv parameter but I can't figure out how to pass in an iterable in the correct form.
If you have a pre-defined split, you can just simply train your model and apply the custom score function on the prediction of the test data to match the calculation. You do not need to use cross_val_score.
I'm pretty sure there's better and easier way but this is what I came up with as the cross_val_score documentation wasn't really clear.
You are right, it's about how you use the cv parameter and I used this format: An iterable yielding train, test splits.
The idea is to create an object that yields train, test split indices, and I referred: http://fa.bianp.net/blog/2015/holdout-cross-validation-generator/.
Assume that you already have a train test split. I used the sklearn built-in split and returned the indices as well:
from sklearn.model_selection import cross_val_score
X_train, X_valid, y_train, y_valid, indices_train, indices_test = train_test_split(train_X, train_y, np.arange(X_train.shape[0]), test_size=0.2, random_state=42)
Then, I create a class to yield the train, test split indices using the output from train_test_split:
class HoldOut:
def __init__(self, indices_train, indices_test):
self.ind_train = indices_train
self.ind_test = indices_test
def __iter__(self):
yield self.ind_train, self.ind_test
Then you can simply pass the Holdout object to the cv parameter:
cross_val_score(RandomForestClassifier(random_state=42, n_estimators=10), train_X, train_y,
cv=HoldOut(indices_train, indices_test), verbose=1)
I am creating dynamic pipelines in scikit-learn and I set the scoring function as a parameter string on the GridSearchCV:
gs = GridSearchCV(pipeline, grid, scoring='accuracy')
However, when I try to get the scoring function that was used in order to evaluate the predictions, this is what I get:
File "app/experimenter/sklearn/sklearn-dask-tests.py", line 127, in run_pipeline
print(evaluator(expected, predicted))
TypeError: __call__() takes at least 4 arguments (3 given)
This is the code:
gs.fit(train_data, train_target)
predicted = gs.predict(test_data)
evaluator = gs.scorer_
print(evaluator(expected, predicted))
So from what I have seen the problem is that the evaluator is in fact make_scorer(accuracy_score). I guess it would be possible to get print(evaluator(expected, predicted)) to work if I add the estimator as the first parameter, but how do I get it properly from the pipeline?
Cause when I do gs.best_estimator_ I get this:
Pipeline(steps=[('mapper', DataFrameMapper(default=False, df_out=False,
features=[('Sex', LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False))],
sparse=False)), ('DecisionTree', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
max_features=1, max_... min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best'))])
Yes, thats the desired behaviour of GridSearchCV. The pipeline object returned from the gs.best_estimator_ is already fitted on the whole train_data with best parameters found in grid search.
You need to send that pipeline object into the evaluator. But your current usage of evaluator is wrong.
What the scorer_ from make_scorer does it takes the test data, make predictions on it and then calculate score by comparing it with actual data.
Hence, its signature is:
scorer(estimator, X_test, y_test)
But you are trying to use it as:
evaluator(expected, predicted)
It will not work because:
It does not satisfy the signature of scorer
It needs the X data, not the predicted y data.
So, if you already have the actual and predicted values of the data, you can simply use the:
accuracy_score(expected, predicted)
If you want to use the scorer_ (your evaluator), then you should not supply it the predicted, instead supply the test_data (from which the predicted is got)
evaluator(gs.best_estimator_, test_data, expected)
having issues with attribute errors when implementing a linear SVM with scikit-learn. I'm using a linear classifier with cross-validation through the RFECV method, and I can't access any of the attributes of the SVC. Not sure if it has to do with the feature selection or base model.
model = svm.SVC(kernel='linear')
selector=RFECV(model)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=pct_test)
selector=selector.fit(X_train, Y_train)
my_prediction = selector.predict(X_test)
f1.append(metrics.f1_score(Y_test, my_prediction))
kappa.append(metrics.cohen_kappa_score(Y_test, my_prediction))
precision.append(metrics.precision_score(Y_test, my_prediction))
recall.append(metrics.recall_score(Y_test, my_prediction))
print model.intercept_
print model.support_vectors_
print model.coef_
Metrics work fine, attributes all fail.
The error message is:
AttributeError: 'SVC' object has no attribute 'intercept_'
Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
Aside: I'm very new to OOP. If there's an underlying concept I'm missing, please elaborate or send over a link.
You are fitting (training the data) on the RFECV object selector, but trying to access attributes of SVC object model. But it is not trained. Hence there is no attribute intercept_ in it.
To access the intercept of SVC, you should use:
selector.estimator_.intercept_
But understand that the above estimator is fitted only on the reduced dataset (After eliminating features as specified)
Explanation:
You see, RFECV internally uses RFE to get important features in each iteration. And RFE clones the supplied estimator for the purpose. So when you initialize RFECV with model, it is trained on the clone of the model.
Checking the source code:
Line 407 (inside the fit method of RFECV):
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
Line 428 (for estimating the scores):
scores = parallel(func(rfe, self.estimator, X, y, train, test, scorer)
for train, test in cv.split(X, y))
And then Line 165 (Inside fit method of RFE):
estimator = clone(self.estimator)