K-fold cross validation to reduce overfitting : problem with the implementation - python

It is the first time I am trying to use cross-validation and I am facing an error.
Firstly my dataset looks like this :
So, in order to avoid/reduce the overfitting of my model I am trying to use a k-fold cross validation.
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X[train_index]
X_test = X[test_index]
y_train, y_test = y[train_index], y[test_index]
However, I am facing the following error and I am not finding how I could solve it. I am understanding that it looks for these values in the columns but it should probably look in the index no ? May I use X.loc[train_index] for example ?
Thanks in advance for your time and your help !

Your assumption is correct: .iloc[index] will work.
Here is the code:
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X.iloc[train_index]
X_test = X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
Another way is to make creation_X_y() return a numpy.array.

Related

UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names warnings.warn(

screenshot here Help please?
Already tried adding .values to the X's, still resulted in an error. Any suggestions?
X = df[['Personal income','Personal saving']]
y = df['Gross domestic product']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
regr = linear_model.LinearRegression().fit(X_train, y_train)
sample = [10000, 1000]
sample_pred = regr.predict([sample])
As stated in this issue https://github.com/tylerjrichards/Getting-Started-with-Streamlit-for-Data-Science/issues/5 , converting X_train dataframe to np array (X_train.values) before fitting removes the warning.
it did for my testing : you can either try :
X_train, X_test, y_train, y_test = train_test_split(X.values, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y, test_size=0.2, random_state=42)
In addition, this warning doesn't affect the calculation precision, you can ignore it and continue working until the update of the versions of libraries.

Getting several splits from each fold in StratifiedKFold

I want to perform stratified 10-fold cross validation using sklearn. The train and test indices can be obtained using
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
However, I would like to set not one, but two folds aside (one for tuning of hyperparameters). So, I want each iteration to consist of 8 folds for training, 1 for tuning and 1 for testing. Is this possible with sklearns StratifiedKFold? Or would I need to write a custom split method?
You could use StratifiedShuffleSplit to further split the test set in a stratified way too:
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
#stratified split on the test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
X_test_ix, X_tune_ix = next(sss.split(X_test, y_test))
X_test_ = X_test[X_test_ix]
y_test_ = y_test[X_test_ix]
X_tune = X_test[X_tune_ix]
y_tune = y_test[X_tune_ix]

How to apply oversampling when doing Leave-One-Group-Out cross validation?

I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.
I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # classification model example
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.
X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()
logo.get_n_splits(X_std, y, groups)
cv=logo.split(X_std, y, groups)
scores=[]
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
scores.append(model.score(X_test, y_test.ravel()))
How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.
The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.
In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model.fit(X_train_oversampled, y_train_oversampled.ravel())
scores.append(model.score(X_test, y_test.ravel()))

create training validation split using sklearn

I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?
It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)

Stratified KFold on sparse(csr) feature matrix

I have a large sparse matrix (95000, 12000) containing the features of my model. I want to do a stratified K fold cross validation using Sklearn.cross_validation module in python. However, I haven't found a way of indexing a sparse matrix in python.
Is there anyway I can perform StratifiedKFold on my sparse feature matrix?
try this:
# First make sure sparse matrix is to_csr
X_sparse = x.tocsr()
y= output
X_train = {}
Y_train = {}
skf = StratifiedKFold(5, shuffle=True, random_state=12345)
i=0
for train_index, test_index in skf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train[i], X_test[i] = X[train_index], X[test_index]
y_train[i], y_test[i] = y[train_index], y[test_index]
i +=1

Categories