I have a large sparse matrix (95000, 12000) containing the features of my model. I want to do a stratified K fold cross validation using Sklearn.cross_validation module in python. However, I haven't found a way of indexing a sparse matrix in python.
Is there anyway I can perform StratifiedKFold on my sparse feature matrix?
try this:
# First make sure sparse matrix is to_csr
X_sparse = x.tocsr()
y= output
X_train = {}
Y_train = {}
skf = StratifiedKFold(5, shuffle=True, random_state=12345)
i=0
for train_index, test_index in skf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train[i], X_test[i] = X[train_index], X[test_index]
y_train[i], y_test[i] = y[train_index], y[test_index]
i +=1
Related
It is the first time I am trying to use cross-validation and I am facing an error.
Firstly my dataset looks like this :
So, in order to avoid/reduce the overfitting of my model I am trying to use a k-fold cross validation.
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X[train_index]
X_test = X[test_index]
y_train, y_test = y[train_index], y[test_index]
However, I am facing the following error and I am not finding how I could solve it. I am understanding that it looks for these values in the columns but it should probably look in the index no ? May I use X.loc[train_index] for example ?
Thanks in advance for your time and your help !
Your assumption is correct: .iloc[index] will work.
Here is the code:
from sklearn.model_selection import KFold
X,y = creation_X_y() #Function which is cleaning my data
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Validation:",test_index)
X_train = X.iloc[train_index]
X_test = X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
Another way is to make creation_X_y() return a numpy.array.
How can I access the train and test data for each fold in cross validation? I would like to save these in .csv files. I tried using the split function which generates the indices but it returns a generator object, not the indices.
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
skf = StratifiedKFold(n_splits=3)
x = skf.split(X, y, groups)
x
Output:
<generator object _BaseKFold.split at 0x7ff195979580>
StratifiedKFold returns a generator, therefore you it to iterate over it as follows:
skf = StratifiedKFold(n_splits=3)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I want to perform stratified 10-fold cross validation using sklearn. The train and test indices can be obtained using
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
However, I would like to set not one, but two folds aside (one for tuning of hyperparameters). So, I want each iteration to consist of 8 folds for training, 1 for tuning and 1 for testing. Is this possible with sklearns StratifiedKFold? Or would I need to write a custom split method?
You could use StratifiedShuffleSplit to further split the test set in a stratified way too:
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
kf = StratifiedKFold(n_splits=10)
for fold, (train_index, test_index) in enumerate(kf.split(X, y), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
#stratified split on the test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
X_test_ix, X_tune_ix = next(sss.split(X_test, y_test))
X_test_ = X_test[X_test_ix]
y_test_ = y_test[X_test_ix]
X_tune = X_test[X_tune_ix]
y_tune = y_test[X_tune_ix]
I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.
I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # classification model example
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.
X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()
logo.get_n_splits(X_std, y, groups)
cv=logo.split(X_std, y, groups)
scores=[]
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
scores.append(model.score(X_test, y_test.ravel()))
How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.
The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.
In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model.fit(X_train_oversampled, y_train_oversampled.ravel())
scores.append(model.score(X_test, y_test.ravel()))
As I have a small dataset I'm using LOOCV(leave one out cross validation) in sklearn.
When I ran my classifier I received the following error:
"Number of labels=41 does not match number of samples=42".
I generated the test and training sets using the following code:
otu_trans = test_train.transpose()
# transpose otu table
merged = pd.concat([otu_trans, metadata[status]], axis=1, join='inner')
# merge phenotype column from metadata file with transposed otu table
X = merged.drop([status],axis=1)
# drop status from X
y = merged[status]
encoder = LabelEncoder()
y = pd.Series(encoder.fit_transform(y),
index=y.index, name=y.name)
# convert T and TF lables to 0 and 1 respectively
loocv = LeaveOneOut()
loocv.get_n_splits(X)
for train_index, test_index in loocv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
input data
When I check the shape of X_train and X_test it is 42,41 rather than 41,257 as I believe it should be, thus it appears the data is being partitioned along the wrong axis.
Can anyone explain to me why this is happening?
Thank you
First of all, the initial matrix X will be not affected at all.
It is only used to produce indices and split the data.
The shape of the initial X will be always the same.
Now, here is a simple example using LOOCV spliting:
import numpy as np
from sklearn.model_selection import LeaveOneOut
# I produce fake data with same dimensions as yours.
#fake data
X = np.random.rand(41,257)
#fake labels
y = np.random.rand(41)
#Now check that the shapes are correct:
X.shape
y.shape
This will give you:
(41, 257)
(41,)
Now the splitting:
loocv = LeaveOneOut()
loocv.get_n_splits(X)
for train_index, test_index in loocv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#classifier.fit(X_train, y_train)
#classifier.predict(X_test)
X_train.shape
X_test.shape
This prints:
(40, 257)
(1, 257)
As you can see, the X_train contains 40 samples and the X_test contains only 1 sample. This is correct since we use LOOCV splitting.
The initial X matrix had 42 samples in total so we use 41 for training and 1 for testing.
This loop will produce a lot of X_train and X_test matrices. To be specific, it will produce N matrices where N = number of samples (in our case: N = 41).
N is equal to the loocv.get_n_splits(X).
Hope this helps