I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.
In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.
Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.
Related
I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)
short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.
Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)
I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.
What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.
My question is: do I need to split the data set at the beginning into train and test, or not?
It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.
But some guys just do a cross-val. Here is the workflow I often use:
# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)
# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))
# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
model = model_with_param
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
print('Mean CV-score with param: ' + str(cv_score.mean()))
# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)
# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)
Short answer: NO
Long answer.
If you want to use K-fold validation when you do not usually split initially into train/test.
There are a lot of ways to evaluate a model. The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test.
If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.
It's up to you what to choose but I would go with K-Folds or LOOCV.
K-Folds procedure is summarised in the figure (for K=5):
I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()
I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.