Using Fitted Model in make_scorer For gridsearchcv - python

I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?

Related

XGBoost classifier returns negative prediction after defining 'objective':'binary:logistic'

Issue
I am trying to use the XGBoost classifier and define a custom matric that uses f-beta-score but I am getting negative predicted values returned to the custom function and after np.round I ended up getting three values [-1,0,1]
Question
How do I solve this issue, as I want to know about the f_beta_score during the training eval?
Function of Custom performance matrics
def fbeta_func(y_predicted, y_true):
f = fbeta_score(list(y_test),np.round(y_predicted),beta=0.5)
return 'f_beta', f
And then I am calling this function in the fit function:
from xgboost import XGBClassifier
clf = XGBClassifier(**param)
clf.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric=fbeta_func,
verbose=True)
Here are the parameters:

Using different dataframes to fit estimator and as input of scoring function in cross_val_score and GridSearchCV?

When I use cross_val_score
cross_val_score(randomForestEstimator.fit(X,y), X, y, cv=3, scoring=CustomScoring)
or GridSearchCV
GridSearchCV(randomForestEstimator, parameters, cv=3, verbose=3, scoring=CustomScoring)
with a custom scoring function - having the signature scorer(estimator, X, y) - the same set of features i.e. dataframe is used to fit the estimator AND as scorer's input.
But the problem is that I do not want my estimator to "see" some columns, which are needed later by my scorer to compute the score. So my question is: how can I use different set of features (or dataframes) for fitting my estimator and as scorer input?

Nest cross validation for predictions using groups

I'm not able to do something and I would like to know if it's a bug or normal way.
I was trying to a Nested Cross Validation on dataset, and each of it belong to a patient. To avoid learning and testing on the same patient, I've seen that you implement a "group" mecanism and GroupKFold seems the right one in my case.
As my classifier get differents parameters, I proceed to GridSearchCv to fix hyper parameters of my model. In the same way, I suppose that testing / training have to belong on differents patients.
( For those that are interested in Nested Cross Validation: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html )
I proceed that way:
pipe = Pipeline([('pca', PCA()),
('clf', SVC()),
])
# Find the best parameters for both the feature extraction and the classifier
grid_search = GridSearchCV(estimator=pipe, param_grid=some_param, cv=GroupKFold(n_splits=5), verbose=1)
grid_search.fit(X=features, y=labels, groups=groups)
# Nested CV with parameter optimization
predictions = cross_val_predict(grid_search, X=features, y=labels, cv=GroupKFold(n_splits=5), groups=groups)
And get some:
File : _split.py", line 489, in _iter_test_indices
raise ValueError("The 'groups' parameter should not be None.")
ValueError: The 'groups' parameter should not be None.
In the code it appear that groups is not shared by _fit_and_predict() method to the estimator and so, groups needed can't be used.
Can I have some clues on it?
Have a nice day,
Best regards
I had the same problem and I couldn't find another way than implementing it in a more hands-on fashion:
outer_cv = GroupKFold(n_splits=4).split(X_data, y_data, groups=groups)
nested_cv_scores = []
for train_ids, test_ids in outer_cv:
inner_cv = GroupKFold(n_splits=4).split(X_data[train_ids, :], y_data.iloc[train_ids], groups=groups[train_ids])
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100,
cv=inner_cv, verbose=2, random_state=42,
n_jobs=-1, scoring=my_squared_score)
# Fit the random search model
rf_random.fit(X_data[train_ids, :], y_data.iloc[train_ids])
print(rf_random.best_params_)
nested_cv_scores.append(rf_random.score(X_data[test_ids,:], y_data.iloc[test_ids]))
print("Nested cv score - meta learning: " + str(np.mean(nested_cv_scores)))
I hope this helps.
Best regards,
Felix

Sklearn: Custom scorer on pre-defined split

I'd like to make sure my custom score function behaves as expected by comparing it to hand-computation (so to speak) on a pre-defined split using train_test_split.
However I can't seem to pass that split in to cross_val_score. By default it uses 3 fold cross-validation and I can't mimick the splits it used. I think the answer lies in the cv parameter but I can't figure out how to pass in an iterable in the correct form.
If you have a pre-defined split, you can just simply train your model and apply the custom score function on the prediction of the test data to match the calculation. You do not need to use cross_val_score.
I'm pretty sure there's better and easier way but this is what I came up with as the cross_val_score documentation wasn't really clear.
You are right, it's about how you use the cv parameter and I used this format: An iterable yielding train, test splits.
The idea is to create an object that yields train, test split indices, and I referred: http://fa.bianp.net/blog/2015/holdout-cross-validation-generator/.
Assume that you already have a train test split. I used the sklearn built-in split and returned the indices as well:
from sklearn.model_selection import cross_val_score
X_train, X_valid, y_train, y_valid, indices_train, indices_test = train_test_split(train_X, train_y, np.arange(X_train.shape[0]), test_size=0.2, random_state=42)
Then, I create a class to yield the train, test split indices using the output from train_test_split:
class HoldOut:
def __init__(self, indices_train, indices_test):
self.ind_train = indices_train
self.ind_test = indices_test
def __iter__(self):
yield self.ind_train, self.ind_test
Then you can simply pass the Holdout object to the cv parameter:
cross_val_score(RandomForestClassifier(random_state=42, n_estimators=10), train_X, train_y,
cv=HoldOut(indices_train, indices_test), verbose=1)

Using cross_val_predict against test data set

I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()

Categories