I am using the exact example from SciKit, which compares permutation_importance with tree feature_importances
As you can see, a Pipeline is used:
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)
permutation_importance:
Now, when you fit a Pipeline, it will Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Later in the example, they used the permutation_importance on the fitted model:
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=2)
Problem: What I don't understand is that the features in the result are still the original non-transformed features. Why is this the case? Is this working correctly? What is the purpose of the Pipeline then?
tree feature_importance:
In the same example, when they use the feature_importance, the results are transformed:
tree_feature_importances = (
rf.named_steps['classifier'].feature_importances_)
I can obviously transform my features and then use permutation_importance, but it seems that the steps presented in the examples are intentional, and there should be a reason why permutation_importance does not transform the features.
This is the expected behavior. The way permutation importance works is to shuffle the input data and apply it to the pipeline (or the model if that is what you want). In fact, if you want to understand how the initial input data effects the model then you should apply it to the pipeline.
If you are interested in the feature importance of each of the additional feature that is generated by your preprocessing steps, then you should generate the preprocessed dataset with column names and then apply that data to the model (using permutation importance) directly instead of the pipeline.
In most cases people are not interested in learning the impact of the secondary features that the pipeline generates. That is why they use the pipeline here to encompass the preprocessing and modeling steps.
Related
I am doing target encoding for my column, using nested cross validation approach (to avoid leakage) as mentioned here, here and here.
If I had to include my target encoding (by a custom transformer), in the sklearn pipeline, I need different transform function from the train set and the test set. This is because, for the train folds, the encoding is calculated using a further kfold split of the train data. Whereas for the test fold, encoding is mean of the train.
I know the sklearn pipeline will apply the same transformation for train and test split in the cv, is there a way to apply separate transformations for train and test splits using a sklearn pipeline and custom transformer
The category_encoders package implements some target encoders. It gets around the issue of different training and testing dataset behavior by implementing a fit_transform method that is not equivalent to just fit followed by transform: fit_transform performs the training-set transformation, while transform performs the test/production set transformation.
I am bit confused on how GridSearchCV actually works, so lets imagine an arbitrary regression problem, where I want to predict the price of a house:
Lets say we use a simple preprocessor, for target encoding on the training set:
The target encoder should call fit_transform() on X_train and transform() on X_test to prevent data leakage.
preprocessor = ColumnTransformer(
transformers=
[
('encoded_target_price', TargetEncoder(), ["Zipcodes"]),
],
remainder='passthrough',n_jobs=-1)
We use some pipeline with scaling, again the Scaler should work with respect to
training and test set.
pipe = Pipeline(steps=[("preprocessor", preprocessor),
("scaler", RobustScaler()),
('clf', LinearSVR()),
])
Initialize GridSearch with some arbitrary parameters:
gscv = GridSearchCV(estimator = pipe,
param_grid = tuned_parameters,
cv = kfold,
n_jobs = -1,
random_state=seed
)
Now we can call gscv.fit(X_train, ytrain) and gscv.predict(X_test).
What I do not understand is how this works. For example by calling fit() the target encoder
and the Scaler are fitted to the training set, but they are never transformed, so the data is never changed. How can GridSearch calculate scores based on the untransformed training set?
The predict method I do not understand at all. How can the prediction be made, without ever applying the transformations from the preprocessor to the test set X_test? I mean when I do some big transformations like scaling, encoding, etc. on the training set they HAVE to be done on the test set as well?
But Gridsearch internally only calls best_estimator_.predict(), so where does the .transform() on the test set happen?
The data transformation is implicitly applied when calling the pipeline's predict() function. It is clearly mentioned in the documentation:
Apply transforms to the data, and predict with the final estimator
So there is no need to explicitly transform the data. It is automatically done before the final estimator makes the prediction. There is also no data leakage since the pipeline will call the transform() method of each step when applying predict() to the data.
I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.
I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?
Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?
Is it best practice to re-fit and transform the scaler on every fold?
Yes. You might want to read scikit-learn's doc on cross-validation:
Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection,
etc.) and similar data transformations similarly should be learnt from
a training set and applied to held-out data for prediction.
Which scaler should I save?
Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.
If you're doing a regression model, it's that simple.
If your model training requires hyperparameter search using
cross-validation (e.g., grid search for xgboost learning parameters),
then you have already gathered information from across folds, so you
need another test set to estimate true out-of-sample model
performance. (Once you have made this estimation, you can retrain
yet again on combined train+test data. This final step is not always done for neural
networks that are parameterized for a particular sample size.)
As shown in the code below, I am using the StandardScaler.fit() function to fit (i.e., calculate the mean and variance from the features) the training dataset. Then, I call the ".transform()" function to scale the features. I found in the doc and here that I should use ".transform()" only to transform test dataset. In my case, I am trying to implement the anomaly detection model such that all training dataset is from one targeted user while all test dataset is collected from multiple other anomaly users. I mean, we have "n" users and we train the model using one class dataset samples from the targeted user while we test the trained model on new anamoly samples selected randomly from all other "n-1" anomaly users.
Training dataset size: (4816, 158) => (No of samples, No of features)
Test dataset size: (2380, 158)
The issue is the model gives bad results when I use fit() then "transform()" for the training dataset and only "transform()" for the test dataset. However, the model gives good results only when I use "fit_transform()" with both train and test datasets instead of only "transform()" for the test dataset.
My question:
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
Is it possible if I use "fit_transform" for both training and testing dataset?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# After preparing and splitting the training and testing dataset, we got
X_train # from only the targeted user
X_test # from other "n-1" anomaly users
# features selection using VarianceThreshold on training set
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_train= sel.fit_transform(X_train)
#Normalization using StandardScaler
scaler = StandardScaler().fit(X_train)
normalized_X_train = scaler.transform(X_train)
set_printoptions(precision=3)
# features selection using VarianceThreshold on testing set
X_test= sel.transform(X_test)
#Normalization using StandardScaler
normalized_X_test = scaler.transform(X_test)
set_printoptions(precision=3)
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
The moment you are re-training your scaler for the testing set you will have a different dependincy of your input features. The original algorithm will be fitted based on the fitting of your training sacling. And if you re-train it this will be overwritten, and you are faking your input of the test data for the algorithm.
So the answer is MUST only be transformed.
The way you do it above is correct. You should, in principle, never use fit on test data, only on the train data. The fact that you get "better" results using fit_transform on the test data is not indicative of any real performance gains. In other words, by using fit on the test data, you lose the ability to say something meaningful about the predictive power of your model on unseen data.
The main lesson here is that any gains in test performance are meaningless once the methodological constraints (i.e. train-test separation) are violated. You may obtain higher scores using fit_transform, but these don't mean anything anymore.
when you want to transform a data you should declare that.
like:
data["afs"]=data["afs"].transform()
I am working on a ML project (a binary classification problem) and was able to run successfully few Sci-Kit classifiers (RF, MLP, Extra Trees).
My question is now I have "Predict_Probas" results which I have converted into a Pandas Data frame and I would like to combine this with my original test data which later I will export in CSV. This I need to show to my management as the final result of my ML project. The issue is I adopted following approach -
First standardized the whole data (using StandardScaler)
Then encoded the data using One-Hot encoding.
Then using Train_test_split, split the standardized and encoded data into two parts
How can I now get my original test data back with (without standardization & one-hot encoding) with names of columns intact?
Usually it's done bit differently - we split data set in the original format, before doing preprocessing operations.
The preprocessing operations will be executed against the training data set (X_train) and passed to the estimator.
Then the same set of preprocessing operations will be executed also against the test data set (X_test) in order to estimate (score) your model using the unseen subset of data.
In practice often it's done using Pipeline() class:
X_train, X_test, y_train, y_test = \
train_test_split(df['text'], df['label'], test_size=0.25)
pipeline = Pipeline([
('scaler',StandardScaler()),
('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)