I am doing target encoding for my column, using nested cross validation approach (to avoid leakage) as mentioned here, here and here.
If I had to include my target encoding (by a custom transformer), in the sklearn pipeline, I need different transform function from the train set and the test set. This is because, for the train folds, the encoding is calculated using a further kfold split of the train data. Whereas for the test fold, encoding is mean of the train.
I know the sklearn pipeline will apply the same transformation for train and test split in the cv, is there a way to apply separate transformations for train and test splits using a sklearn pipeline and custom transformer
The category_encoders package implements some target encoders. It gets around the issue of different training and testing dataset behavior by implementing a fit_transform method that is not equivalent to just fit followed by transform: fit_transform performs the training-set transformation, while transform performs the test/production set transformation.
Related
Because I need to do different transformations for train and test (for target encoding), I cannot use sklearn pipeline. Given this, what are options to efficiently (Simpler code and accuracy) apply the column transformations (for train and test cv folds separately) and also do hyper parameter tuning in sklearn /python.
I know I Cannot use grid search cv for tuning, since I cannot use pipelie. Should I be using the folds splits and do the transformations separately for train, test folds and repeat this for all combinations of hyper parameter tuning ?
I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.
I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?
Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?
Is it best practice to re-fit and transform the scaler on every fold?
Yes. You might want to read scikit-learn's doc on cross-validation:
Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection,
etc.) and similar data transformations similarly should be learnt from
a training set and applied to held-out data for prediction.
Which scaler should I save?
Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.
If you're doing a regression model, it's that simple.
If your model training requires hyperparameter search using
cross-validation (e.g., grid search for xgboost learning parameters),
then you have already gathered information from across folds, so you
need another test set to estimate true out-of-sample model
performance. (Once you have made this estimation, you can retrain
yet again on combined train+test data. This final step is not always done for neural
networks that are parameterized for a particular sample size.)
I am using the exact example from SciKit, which compares permutation_importance with tree feature_importances
As you can see, a Pipeline is used:
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)
permutation_importance:
Now, when you fit a Pipeline, it will Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Later in the example, they used the permutation_importance on the fitted model:
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=2)
Problem: What I don't understand is that the features in the result are still the original non-transformed features. Why is this the case? Is this working correctly? What is the purpose of the Pipeline then?
tree feature_importance:
In the same example, when they use the feature_importance, the results are transformed:
tree_feature_importances = (
rf.named_steps['classifier'].feature_importances_)
I can obviously transform my features and then use permutation_importance, but it seems that the steps presented in the examples are intentional, and there should be a reason why permutation_importance does not transform the features.
This is the expected behavior. The way permutation importance works is to shuffle the input data and apply it to the pipeline (or the model if that is what you want). In fact, if you want to understand how the initial input data effects the model then you should apply it to the pipeline.
If you are interested in the feature importance of each of the additional feature that is generated by your preprocessing steps, then you should generate the preprocessed dataset with column names and then apply that data to the model (using permutation importance) directly instead of the pipeline.
In most cases people are not interested in learning the impact of the secondary features that the pipeline generates. That is why they use the pipeline here to encompass the preprocessing and modeling steps.
As shown in the code below, I am using the StandardScaler.fit() function to fit (i.e., calculate the mean and variance from the features) the training dataset. Then, I call the ".transform()" function to scale the features. I found in the doc and here that I should use ".transform()" only to transform test dataset. In my case, I am trying to implement the anomaly detection model such that all training dataset is from one targeted user while all test dataset is collected from multiple other anomaly users. I mean, we have "n" users and we train the model using one class dataset samples from the targeted user while we test the trained model on new anamoly samples selected randomly from all other "n-1" anomaly users.
Training dataset size: (4816, 158) => (No of samples, No of features)
Test dataset size: (2380, 158)
The issue is the model gives bad results when I use fit() then "transform()" for the training dataset and only "transform()" for the test dataset. However, the model gives good results only when I use "fit_transform()" with both train and test datasets instead of only "transform()" for the test dataset.
My question:
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
Is it possible if I use "fit_transform" for both training and testing dataset?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# After preparing and splitting the training and testing dataset, we got
X_train # from only the targeted user
X_test # from other "n-1" anomaly users
# features selection using VarianceThreshold on training set
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_train= sel.fit_transform(X_train)
#Normalization using StandardScaler
scaler = StandardScaler().fit(X_train)
normalized_X_train = scaler.transform(X_train)
set_printoptions(precision=3)
# features selection using VarianceThreshold on testing set
X_test= sel.transform(X_test)
#Normalization using StandardScaler
normalized_X_test = scaler.transform(X_test)
set_printoptions(precision=3)
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
The moment you are re-training your scaler for the testing set you will have a different dependincy of your input features. The original algorithm will be fitted based on the fitting of your training sacling. And if you re-train it this will be overwritten, and you are faking your input of the test data for the algorithm.
So the answer is MUST only be transformed.
The way you do it above is correct. You should, in principle, never use fit on test data, only on the train data. The fact that you get "better" results using fit_transform on the test data is not indicative of any real performance gains. In other words, by using fit on the test data, you lose the ability to say something meaningful about the predictive power of your model on unseen data.
The main lesson here is that any gains in test performance are meaningless once the methodological constraints (i.e. train-test separation) are violated. You may obtain higher scores using fit_transform, but these don't mean anything anymore.
when you want to transform a data you should declare that.
like:
data["afs"]=data["afs"].transform()
I am wondering if it is possible to use pipeline in Scikit-Learn in the following way:
I want to train a model on dataset A and then make predictions with the same model but on dataset B. Then, like this, I can use GridSearch to search for the best parameters on pipeline using prediction of dataset B as a measure.
I know how to write a normal pipeline and use it with GridSearch, but I can't see how I can work with two datasets.