How to apply oversampling when doing Leave-One-Group-Out cross validation? - python

I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.
I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # classification model example
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.
X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()
logo.get_n_splits(X_std, y, groups)
cv=logo.split(X_std, y, groups)
scores=[]
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
scores.append(model.score(X_test, y_test.ravel()))
How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.

The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.
In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model.fit(X_train_oversampled, y_train_oversampled.ravel())
scores.append(model.score(X_test, y_test.ravel()))

Related

K-fold cross validation to reduce overfitting : problem with the implementation

It is the first time I am trying to use cross-validation and I am facing an error.
Firstly my dataset looks like this :
So, in order to avoid/reduce the overfitting of my model I am trying to use a k-fold cross validation.
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X[train_index]
X_test = X[test_index]
y_train, y_test = y[train_index], y[test_index]
However, I am facing the following error and I am not finding how I could solve it. I am understanding that it looks for these values in the columns but it should probably look in the index no ? May I use X.loc[train_index] for example ?
Thanks in advance for your time and your help !
Your assumption is correct: .iloc[index] will work.
Here is the code:
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X.iloc[train_index]
X_test = X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
Another way is to make creation_X_y() return a numpy.array.

Getting several splits from each fold in StratifiedKFold

I want to perform stratified 10-fold cross validation using sklearn. The train and test indices can be obtained using
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
However, I would like to set not one, but two folds aside (one for tuning of hyperparameters). So, I want each iteration to consist of 8 folds for training, 1 for tuning and 1 for testing. Is this possible with sklearns StratifiedKFold? Or would I need to write a custom split method?
You could use StratifiedShuffleSplit to further split the test set in a stratified way too:
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
#stratified split on the test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
X_test_ix, X_tune_ix = next(sss.split(X_test, y_test))
X_test_ = X_test[X_test_ix]
y_test_ = y_test[X_test_ix]
X_tune = X_test[X_tune_ix]
y_tune = y_test[X_tune_ix]

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.

How to implement SMOTE in cross validation and GridSearchCV

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.
Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)
You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.
Have a look at this example here:
https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html
For your code, you would want to do this:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)
This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.

scikit learn cross validation classification_report

I want to have metrics per class label and an aggregate confusion matrix from a cross validation in scikit learn.
I wrote a method that performs a cross-validation for scikit learn that sums the confusion matrices and also stores all the predicted labels. Then, it calls scikit learn methods to print out the metrics.
The code below should run with any recent scikit learn installation, you can test it out with any dataset.
Is below the correct way to gather an aggregate cm and a classification_report when doing StratifiedKFold cross validation?
from sklearn import metrics
from sklearn.cross_validation import StratifiedKFold
import numpy as np
def customCrossValidation(self, X, y, classifier, n_folds=10, shuffle=True, random_state=0):
''' Perform a cross validation and print out the metrics '''
skf = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle, random_state=random_state)
cm = None
y_predicted_overall = None
y_test_overall = None
for train_index, test_index in skf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = y_predicted
y_test_overall = y_test
else:
y_predicted_overall = np.concatenate([y_predicted_overall, y_predicted])
y_test_overall = np.concatenate([y_test_overall, y_test])
cv_cm = metrics.confusion_matrix(y_test, y_predicted)
# sum the cv per fold
if cm is None:
cm = cv_cm
else:
cm += cv_cm
print (metrics.classification_report(y_test_overall, y_predicted_overall, digits=3))
print (cm)

Categories