Scaling and data leakage on cross validation and test set

Scaling and data leakage on cross validation and test set - python

I have more of a best practice question.
I am scaling my data and I understand that I should fit_transform on my training set and transform on my test set because of potential data leakage.
Now if I want to use both (5 fold) Cross validation on my training set but I use a holdout test set anyway is it necessary to scale each fold independently?
My problem is that I want to use Feature Selection like this:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
efs = EFS(clf_tmp,
min_features=min,
max_features=max,
cv=5,
n_jobs = n_jobs)
efs = efs.fit(X_train, y_train)
Right now I am scaling X_train and X_test independently. But when the whole training set goes into the feature selector there will be some data leakage. Is this a problem for evaluation?

It's definitely best practice to include everything within your cross-validation loop to avoid data leakage. Any scaling should be done on the training set and then applied to the test set within each CV loop.

Related

Apply a cross validated ML model to unseen data

I would like to use scikit learn to predict with X a variable y. I would like to train a classifier on a training dataset using cross validation and then to apply this classifier to an unseen test dataset (as in https://www.nature.com/articles/s41586-022-04492-9)
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# Import dataset
X, y = datasets.load_iris(return_X_y=True)
# Create binary variable y
y[y == 0] = 1
# Divide in train and test set
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size=75, random_state=4, stratify=y)
# Cross validation on the train data
cv_model = cross_validate(model, x_train, y_train, cv=5)
Now I would like to use this cross validated model and to apply it to the unseen test set. I am unable to find how.
It would be something like
result = cv_model.score(x_test, y_test)
Except this does not work

You cannot do that; you need to fit the model before using it to predict new data. cross_validate is just a convenience function to get the scores; as clearly mentioned in the documentation, it returns just that, i.e. scores, and not a (fitted) model:
Evaluate metric(s) by cross-validation and also record fit/score times.
[...]
Returns: scores : dict of float arrays of shape (n_splits,)
Array of scores of the estimator for each run of the cross validation.
A dict of arrays containing the score/time arrays for each scorer is returned.

how to construct stratified tensorflow dataset?

I'm using a custom tensorflow model for an imbalanced classification problem.
For this I need to split up the data in a train and test set and split the train set into batches.
However the batches need to be stratified due to the imbalance problem. For now I'm doing it like this:
X_train, X_test, y_train, y_test = skmodel.train_test_split(
Xscaled, y_new, test_size=0.2, stratify=y_new)
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(
X_train.shape[0]).batch(batch_size)
But I am not sure if the batches in dataset are stratified or not?
If not, how can I make sure that they are stratified?

Using scaler in Sklearn PIpeline and Cross validation

I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.

In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.

Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.

F-Score difference between cross_val_score and StratifiedKFold

I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within Cross Validation and if it makes a difference I also manually cross validated with StratifiedKFold. I would expect not same but somewhat similar results. As this is not the case I guess that I wrongly use one method but I don´t understand which one. Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import f1_score
rfc = RandomForestClassifier(n_estimators = 200,
criterion = "gini",
max_depth = None,
min_samples_leaf = 1,
max_features = "auto",
random_state = 42,
class_weight = "balanced")
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify=y)
I also tried the Classifier without the class_weight argument. From here I proceed to compare both methods with the f1-score
cv = cross_val_score(estimator=rfc,
X=X_train_val,
y=y_train_val,
cv=10,
scoring="f1")
print(cv)
The 10 f1-scores from cross validation are all around 65%.
Now the StratifiedKFold:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X_train_val, y_train_val):
X_train, X_val = X_train_val[train_index], X_train_val[test_index]
y_train, y_val = y_train_val[train_index], y_train_val[test_index]
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_val)
print("F1-Score: ", round(f1_score(y_val, rfc_predictions),3))
The 10 f1-scores from StratifiedKFold gets me values around 90%. This is where I get confused as I don´t understand the large deviations between both methods. If I just fit the Classifier to the train data and apply it to the test data I get f1-scores of around 90% as well which lets me believe that my way of applying cross_val_score is not correct.

One possible reason for the difference is that cross_val_score uses StratifiedKFold with the default shuffle=False parameter, whereas in your manual cross-validation using StratifiedKFold you have passed shuffle=True. Therefore it could just be an artifact of the way your data is ordered that cross-validating without shuffling produces worse F1 scores.
Try passing shuffle=False when creating the skf instance to see if the scores match the cross_val_score, and then if you want to use shuffling when using cross_val_score just manually shuffle the training data before applying cross_val_score.

How does `fit` function in scikit-learn make validation?

I am having trouble with fit function when applied to MLPClassifier. I carefully read Scikit-Learn's documentation about that but was not able to determine how validation works.
Is it cross-validation or is there a split between training and validation data ?
Thanks in advance.

The fit function per se does not include cross-validation and also does not apply a train test split.
Fortunately you can do this by your own.
Train Test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) // test set size is 0.33
clf = MLPClassifier()
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
clf = MLPClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
For cross validation multiple functions are available, you can read more about it here. The here stated k-fold is just an example.
EDIT:
Thanks for this answer, but basically how does fit function works
concretely ? It just trains the network on the given data (i.e.
training set) until max_iter is reached and that's it ?
I am assuming your are using the default config of MLPClassifier. In this case the fit function tries to do an optimization on basis of adam optimizer. In this case, indeed, the network trains until max_iter is reached.
Moreover, in the K-Fold cross validation, is the model improving as
long as the loop goes through or just restarts from scratch ?
Actually cross-validation is not used to improve the performance of your network, it's actually a methodology to test how well your algrotihm generalizes on different data. For k-fold, k independent classifiers are trained and tested.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scaling and data leakage on cross validation and test set - python

It's definitely best practice to include everything within your cross-validation loop to avoid data leakage. Any scaling should be done on the training set and then applied to the test set within each CV loop.

Related

Apply a cross validated ML model to unseen data

how to construct stratified tensorflow dataset?

Using scaler in Sklearn PIpeline and Cross validation

F-Score difference between cross_val_score and StratifiedKFold

How does `fit` function in scikit-learn make validation?

Categories

Resources