How do I test a final classifier in test set? - python

After tunning parameters for a classifier I build a model using the best ones. I decided to perform a StratifiedKFold as a validation. First I split my dataset in train and test so that I can check my model on a different test set. The thing is that since I applied a Kfold cross-validation the accuracies I get are from the virtual validation set. Now I get those performances I want to test the model in that test set we kept above but don't know how to proceed properly.
I know that an alternative could be to perform the kfold cross validation with the entire dataset (X,y) but I decided to keep a test set beacuse I have to build more classifiers.
Here is my code for a classifier:
from xgboost.sklearn import XGBClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.pipeline import Pipeline
kfold = StratifiedKFold(y_train,
n_folds=10,
random_state=42)
pipe_xgb = Pipeline([('xgb', XGBClassifier(learning_rate =0.01,
n_estimators=5000,
max_depth=4,
min_child_weight=6,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=2.7, #ratio of positive and negative classes
seed=42))])
pipe_xgb.fit(X_train, y_train)
scores = []
for k, (train, val) in enumerate(kfold):
pipe_xgb.fit(X_train[train], y_train[train])#fit on train
score = pipe_xgb.score(X_train[val], y_train[val])#test on val
scores.append(score)
print('Fold: %s, Class dist.: %s, Acc: %.3f' % (k+1,
np.bincount(y_train[train]), score))
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
I was thinking in making predictions with the test set an call for some parameters using the model I train above but I am not sure if that is the proper way of testing on test_set:
This is what I tried on the test set:
y_pred = pipe_xgb.predict(X_test)
from sklearn.metrics import classification_report
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred, target_names=target_names))
from sklearn.metrics import matthews_corrcoef
print('Matthew coefficient')
print()
print(matthews_corrcoef(y_test, pipe_xgb.predict(X_test)))
print('Confusion matrix')
print(metrics.confusion_matrix(y_test,pipe_xgb.predict(X_test)))

Usually when people set aside a validation set, and then go for a cross validation, it's because they want to find the best parameters on the training set. sklearn gives you two nice functions for doing so, which are under the GridSearch. When you use GridSearchCV, you can set the refit to True, which would result in training the model with the best parameters on the whole training data, which then you can use on your set aside validation set.
Also, you don't have to use a Pipeline if there's only one item in it.
A GridSearchCV example taken from here:
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape='ovr', degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params=None, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
scoring=..., verbose=...)
>>> sorted(clf.cv_results_.keys())
...
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'mean_train_score', 'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split0_train_score', 'split1_test_score', 'split1_train_score',...
'split2_test_score', 'split2_train_score',...
'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]

Related

Why eval_set is not used in cross_validate with XGBClassifier?

I'm trying to plot a graph where I can see the evolution in learning of the model as the number of estimators increases. I can do this with eval_set in xgboost.XGBClassifier, giving this plot:
But when I use cross_validate(), it ignores the eval_set argument. Why? The code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.model_selection import cross_validate
scores = cross_validate(
estimator=XGBClassifier(n_estimators=30,
max_depth=3,
min_child_weight =4,
random_state=42,
eval_set=[(X_train,y_train),(X_test,y_test)]),
X=data.drop('Y', axis=1),
y=data['Y'],
#fit_params=params,
cv=5,
error_score='raise',
return_train_score=True,
return_estimator=True
)
The warning is: Parameters: { "eval_set" } are not used.
How can I do such that cross_validate takes the argument of eval_set?
eval_set determines train and validation data, but cross-validation determines train and validation data in every iteration. To be more concrete, ith fold is used as a validation data and the others are used as train data.
Therefore eval_set is meaningless for cross-validation and remove it.

Scikitlearn GridSearchCV best model scores

I am trying to print the training and test score of the best model from my GridSearchCV object. My initial guess was to use cv_results['best_train_score'] and cv_results['best_test_score'] but after looking at the documentation I dont think there is a 'best_train_score' for cv_results.
I also see that there is a best_estimator_ but I'm not sure if I can use this to print a test and a training score. Any help is greatly appreciated.
You can use the best_estimator_ of your fitted GridSearchCV to retrieve the best model and then use the score function of your estimator to calculate the train and test accuracy.
As follows:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, train_test_split
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
parameters = {"kernel": ("linear", "rbf"), "C": [1, 10]}
svc = svm.SVC()
cv = GridSearchCV(svc, parameters)
cv.fit(iris.data, iris.target)
model = cv.best_estimator_
print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")
Output:
train score: 0.9916666666666667
test score: 1.0

Why do my CatBoost fit metrics are different than the sklearn evaluation metrics?

I'm still not sure this should be a question for this forum or for Cross-Validated, but I'll try this one, since it's more about the output of the code than the technique per se. Here's the thing, I'm running a CatBoost Classifier, just like this:
# import libraries
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
# import data
train = pd.read_csv("train.csv")
# get features and label
X = train[["Pclass", "Sex", "SibSp", "Parch", "Fare"]]
y = train[["Survived"]]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# model parameters
model_cb = CatBoostClassifier(
cat_features=["Pclass", "Sex"],
loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.1,
iterations=500,
od_type = "Iter",
od_wait = 200
)
# fit model
model_cb.fit(
X_train,
y_train,
plot=True,
eval_set=(X_test, y_test),
verbose=50,
)
y_pred = model_cb.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(roc_auc_score(y_test, y_pred))
The dataframe I'm using is from the Titanic competition (link).
The problem is that the model_cb.fit step is showing an AUC of 0.87, but the last line, the roc_auc_score from sklearn is showing me an AUC of 0.73, i.e., a much lower. The AUC from CatBoost, from what I understood is supposedly already on the testing dataset.
Any ideas on which is the problem here and how could I fix it?
The ROC curve needs predicted probabilities or some other sort of confidence measure, not hard class predictions. Use
y_pred = model_cb.predict_proba(X_test)[:, 1]
See Scikit-learn : roc_auc_score and Why does roc_curve return only 3 values?.

Why data cleaning decreases accuracy?

Using the 20newsgroups from the scikit learn for reproducibility. When I train an svm model and then perform data cleaning by removing headers, footers and quotes the accuracy decreases. Isn't it supposed to be improved by data cleaning? What is the point in doing all that and then get worse accuracy?
I have created this example with data cleaning to help you understand what I am referring at:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,shuffle=True, random_state=2017,
remove=('headers', 'footers', 'quotes') )
y_train = newsgroups_train.target
y_test = newsgroups_test.target
vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5, ngram_range=(1, 2),stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
from sklearn.svm import SVC
from sklearn import metrics
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
clf = clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print('Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred))
print('Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred))
print("-"*12)
print("Train Metrics: ", metrics.classification_report(y_train, y_train_pred))
print("-"*12)
print("Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred))
Results before data cleaning:
Train accuracy_score: 1.0
Test accuracy_score: 0.9731638418079096
Results after data cleaning:
Train accuracy_score: 0.9887218045112782
Test accuracy_score: 0.9209039548022598
It is not necessarily your data cleaning, I assume you run the script twice?
The problem is this line of code:
clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True, tol=0.001,
verbose=False)
random_state=NoneYou should fix random state to e.g. random_state=42, otherwise you cannot produce the same result, if you would run this code again right now, you will again have a different result.
Edit:
The explanation is on the dataset site itself:
If you implement:
import numpy as np
def show_top10(classifier, vectorizer, categories):
feature_names = np.asarray(vectorizer.get_feature_names())
for i, category in enumerate(categories):
top10 = np.argsort(classifier.coef_[i])[-10:]
print("%s: %s" % (category, " ".join(feature_names[top10])))
You can now see many things that these features have overfit to:
Almost every group is distinguished by whether headers such as NNTP-Posting-Host: and Distribution: appear more or less often.
Another significant feature involves whether the sender is affiliated with a university, as indicated either by their headers or their signature.
The word “article” is a significant feature, based on how often people quote previous posts like this: “In article [article ID], [name] <[e-mail address]> wrote:”
Other features match the names and e-mail addresses of particular people who were posting at the time.
With such an abundance of clues that distinguish newsgroups, the classifiers barely have to identify topics from text at all, and they all perform at the same high level.
For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what kinds of information to strip out of each file. remove should be a tuple containing any subset of
Summarize:
The remove thingy prevents you from data leakage, that means you have information in your training data which you will not have in a prediction phase, so you have to remove it, otherwise you will get a better result, but this will be not there for new data.

Cross validation and model selection

I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.
I split the data into two parts. Train data and test data. Here is the code:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores
Now I need to evaluate the estimator clf on X_test.
clf.score(X_test, y_test)
here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?
cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.
The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.

Categories