When using a classifier like GaussianNB(), the resulting .predict_proba() values are sometimes poorly calibrated; that's why I'd like to wrap this classifier into sklearn's CalibratedClassifierCV.
I have now a binary classification problem with only a very few positive samples - so few that CalibratedClassifierCV fails because there are less samples than folds (the resulting error is then Requesting 5-fold cross-validation but provided less than 5 examples for at least one class.). Thus, I'd like to upsample the minority class before applying the classifier. I use imblearn's pipeline for this as it ensures that resampling takes place only during fit and not during inference.
However, I do not find a way to upsample my training data and combine it with CalibratedClassifierCV while ensuring that upsampling only takes place during fit and not during inference.
I tried the following reproducible example, but it seems that CalibratedClassifierCV wants to split the data first, prior to upsampling - and it fails.
Is there a way to correctly upsample data while using CalibratedClassifierCV?
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
X, y = make_classification(
n_samples = 100,
n_features = 10,
n_classes = 2,
weights = (0.95,), # 5% of samples are of class 1
random_state = 10,
shuffle = True
)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size = 0.2,
random_state = 10,
shuffle = True,
stratify = y
)
pipeline = Pipeline([
("resampling", RandomOverSampler(
sampling_strategy=0.2,
random_state=10
)),
("model", GaussianNB())
])
m = CalibratedClassifierCV(
base_estimator=pipeline,
method="isotonic",
cv=5,
n_jobs=-1
)
m.fit(X_train, y_train) # results in error
I guess I understand my conceptual error: the cross-validation split has to happen BEFORE upsampling and not after (otherwise there would be information leakage from validation to training). But if it happens before, I cannot have more folds than samples of the positive class... Thus, oversampling does not save me from having not enough samples for CalibratedClassifierCV.
So I indeed have to reduce the number of folds, as #NMH1013 suggests.
Related
I used SHAP to explain my RF
RF_best_parameters = RandomForestRegressor(random_state=24, n_estimators=100)
RF_best_parameters.fit(X_train, y_train.values.ravel())
shap_explainer_model = shap.TreeExplainer(RF_best_parameters)
The TreeExplainer class has an attribute expected_value.
My first guess that this field is the mean of the predicted y, according to the X_train (I also read this here )
But it is not.
The output of the command:
shap_explainer_model.expected_value
is 0.2381.
The output of the command:
RF_best_parameters.predict(X_train).mean()
is 0.2389.
As we can see the values are not same.
So what is the meaning of the expected_value here?
This is due to a peculiarity of the method when used with the Random Forest algorithm; quoting from the response in the relevant Github thread shap explainer expected_value is different from model expected value:
It is because of how sklearn records the training samples in the tree models it builds. Random forests use a random subsample of the data to train each tree, and it is that random subsample that is used in sklearn to record the leaf sample weights in the model. Since TreeExplainer uses the recorded leaf sample weights to represent the training dataset, it will depend on the random sampling used during training. This will cause small variations like the ones you are seeing.
We can actually verify that this behavior is not present with other algorithms, say Gradient Boosting Trees:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
But for RF, we get indeed a "small variation" as mentioned by the main SHAP developer in the thread above:
rf = RandomForestRegressor(random_state=0)
rf.fit(X_train, y_train)
rf_explainer = shap.TreeExplainer(rf)
rf_explainer.expected_value
# array([-11.59166808])
mean_pred_rf = np.mean(rf.predict(X_train))
mean_pred_rf
# -11.280125877556388
np.isclose(mean_pred_rf, rf_explainer.expected_value)
# array([False])
Just try :
shap_explainer_model = shap.TreeExplainer(RF_best_parameters, data=X_train, feature_perturbation="interventional", model_output="raw")
Then the shap_explainer_model.expected_value should give you the mean prediction of your model on train data.
Otherwise, TreeExplainer uses feature_perturbation="tree_path_dependent"; accoding to the documentation:
The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.
I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within Cross Validation and if it makes a difference I also manually cross validated with StratifiedKFold. I would expect not same but somewhat similar results. As this is not the case I guess that I wrongly use one method but I don´t understand which one. Here is the code
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.metrics import f1_score
rfc = RandomForestClassifier(n_estimators = 200,
criterion = "gini",
max_depth = None,
min_samples_leaf = 1,
max_features = "auto",
random_state = 42,
class_weight = "balanced")
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify=y)
I also tried the Classifier without the class_weight argument. From here I proceed to compare both methods with the f1-score
cv = cross_val_score(estimator=rfc,
X=X_train_val,
y=y_train_val,
cv=10,
scoring="f1")
print(cv)
The 10 f1-scores from cross validation are all around 65%.
Now the StratifiedKFold:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in skf.split(X_train_val, y_train_val):
X_train, X_val = X_train_val[train_index], X_train_val[test_index]
y_train, y_val = y_train_val[train_index], y_train_val[test_index]
rfc.fit(X_train, y_train)
rfc_predictions = rfc.predict(X_val)
print("F1-Score: ", round(f1_score(y_val, rfc_predictions),3))
The 10 f1-scores from StratifiedKFold gets me values around 90%. This is where I get confused as I don´t understand the large deviations between both methods. If I just fit the Classifier to the train data and apply it to the test data I get f1-scores of around 90% as well which lets me believe that my way of applying cross_val_score is not correct.
One possible reason for the difference is that cross_val_score uses StratifiedKFold with the default shuffle=False parameter, whereas in your manual cross-validation using StratifiedKFold you have passed shuffle=True. Therefore it could just be an artifact of the way your data is ordered that cross-validating without shuffling produces worse F1 scores.
Try passing shuffle=False when creating the skf instance to see if the scores match the cross_val_score, and then if you want to use shuffling when using cross_val_score just manually shuffle the training data before applying cross_val_score.
I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).
my current process is as follows:
Experiment A
Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
Specify the cross validation method as KFold with 5 folds
Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score
Experiment B
Use the same boston housing data as above
fit_transform StandardScaler on the entire dataset
Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
Compare the scores here to Experiment A
The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.
I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?
See my test code below:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression
np.set_printoptions(15)
boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]
scalar = StandardScaler()
clf = LinearRegression()
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print('Length of Data on which scaler is fit on =', len(X))
output = super().fit(X,y)
# print('mean of scalar =',output.mean_)
output = super().transform(X)
return output
pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])
cv = KFold(n_splits=5, random_state=42)
cross_val_score(pipeline, X, y, cv = cv)
# Now fitting Scaler on whole train data
scaler_2 = StandardScaler()
clf_2 = LinearRegression()
X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)
Thanks!
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
I have a dataset, which has previously been split into 3 sets: train, validation and test. These sets have to be used as given in order to compare the performance across different algorithms.
I would now like to optimize the parameters of my SVM using the validation set. However, I cannot find how to input the validation set explicitly into sklearn.grid_search.GridSearchCV(). Below is some code I've previously used for doing K-fold cross-validation on the training set. However, for this problem I need to use the validation set as given. How can I do that?
from sklearn import svm, cross_validation
from sklearn.grid_search import GridSearchCV
# (some code left out to simplify things)
skf = cross_validation.StratifiedKFold(y_train, n_folds=5, shuffle = True)
clf = GridSearchCV(svm.SVC(tol=0.005, cache_size=6000,
class_weight=penalty_weights),
param_grid=tuned_parameters,
n_jobs=2,
pre_dispatch="n_jobs",
cv=skf,
scoring=scorer)
clf.fit(X_train, y_train)
Use PredefinedSplit
ps = PredefinedSplit(test_fold=your_test_fold)
then set cv=ps in GridSearchCV
test_fold : “array-like, shape (n_samples,)
test_fold[i] gives the test set fold of sample i. A value of -1 indicates that the corresponding sample is not part of any test set folds, but will instead always be put into the training fold.
Also see here
when using a validation set, set the test_fold to 0 for all samples that are part of the validation set, and to -1 for all other samples.
Consider using the hypopt Python package (pip install hypopt) for which I am an author. It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
EDIT: I (think I) received -1's on this response because I'm suggesting a package that I authored. This is unfortunate, given that the package was created specifically to solve this type of problem.
# Import Libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import PredefinedSplit
# Split Data to Train and Validation
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size = 0.8, stratify = y,random_state = 2020)
# Create a list where train data indices are -1 and validation data indices are 0
split_index = [-1 if x in X_train.index else 0 for x in X.index]
# Use the list to create PredefinedSplit
pds = PredefinedSplit(test_fold = split_index)
# Use PredefinedSplit in GridSearchCV
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
To add to the #Vinubalan's answer, when the train-valid-test split is not done with Scikit-learn's train_test_split() function, i.e., the dataframes are already split manually beforehand and scaled/normalized so as to prevent leakage from training data, the numpy arrays can be concatenated.
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
from sklearn.model_selection import PredefinedSplit, GridSearchCV
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)
clf = GridSearchCV(estimator = estimator,
cv=pds,
param_grid=param_grid)
# Fit with all data
clf.fit(X, y)
I wanted to provide some reproducible code that creates a validation split using the last 20% of observations.
from sklearn import datasets
from sklearn.model_selection import PredefinedSplit, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
# load data
df_train = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing().target
param_grid = {"max_depth": [5, 6],
'learning_rate': [0.03, 0.06],
'subsample': [.5, .75]
}
model = GradientBoostingRegressor()
# Create a single validation split
val_prop = .2
n_val_rows = round(len(df_train) * val_prop)
val_starting_index = len(df_train) - n_val_rows
cv = PredefinedSplit([-1 if i < val_starting_index else 0 for i in df_train.index])
# Use PredefinedSplit in GridSearchCV
results = GridSearchCV(estimator = model,
cv=cv,
param_grid=param_grid,
verbose=True,
n_jobs=-1)
# Fit with all data
results.fit(df_train, y)
results.best_params_
The cv argument of the SearchCV i.e. Grid or Random can just be an iterable of indices too for train and validation split i.e. cv=((train_idcs, val_idcs),).
Note that the data on which the search classifier will be fit should be the train+val set and the indices specified will be used by the sklearn to separate them internally. Additionally, when working with dataframes, the indices specified should be accessible as ilocs, so reset indices (don't drop them if they will be required later).
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
train_test_split,
RandomizedSearchCV,
)
data = load_iris(as_frame=True)["frame"]
# These indices will serves as explicit and predefined split
train_idcs, val_idcs = train_test_split(
data.index,
random_state=42,
stratify=data.target,
)
param_grid = dict(
n_estimators=[50,100,150,200],
max_samples=[0.85,0.9,0.95,1],
max_depth=[3,5,7,10],
max_features=["sqrt", "log2", 0.85, 0.9, 0.95, 1],
)
search_clf = RandomizedSearchCV(
estimator=RandomForestClassifier(),
param_distributions=param_grid,
n_iter=50,
cv=((train_idcs, val_idcs),), # explicit predefined split in terms of indices
random_state=42,
)
# X is the first 4 columns i.e. the sepal and petal widths and lengths
# and y is the 5th column i.e. target column
search_clf.fit(X=data.iloc[:,:4], y=data.target)
Also, be mindful if you want to refit on the whole data or only on the train data and thus retrain the classifier using the best fit parameters accordingly.