How to partition a dataset into three equal parts? - python

I am trying to divide my dataset into three equal parts by using scikit-learn. But when I use StratifiedKFold (on sklearn) to do it, it only shows me the command that I did for partition the dataset, rather than the result:
from sklearn.model_selection import StratifiedKFold
partition = StratifiedKFold(n_splits = 3, shuffle = True, random_state = None)
print(partition)
I am still new with Python libraries, so I am not sure about how to do it.

The second line of your code creates a StratifiedKFold object, it does not really partition your data. It is this object that you should use to split your data (see example below)
partition = StratifiedKFold(n_splits = 3, shuffle = True, random_state = 1)
for train_index, test_index in partition.split(x, y):
x_train_f, x_test_f = x[train_index], x[test_index]
y_train_f, y_test_f = y[train_index], y[test_index]
Your answer for splitting your data in 3 parts has been answered here
X_train, X_test, X_validate = np.split(X, [int(.7*len(X)), int(.8*len(X))])
y_train, y_test, y_validate = np.split(y, [int(.7*len(y)), int(.8*len(y))])

Related

How to determine if data point was placed in training or testing set?

I'm using train_test_split to split image data for a convolutional neural network in Python:
x_train, x_test, y_train, y_test = train_test_split(X, Y)
For each image in X, how can I figure out whether it was sent to the x_train or x_test set? Since all the data in the x_train or x_test datasets are in tensor form and randomized, I'm not sure how to relate a given instance in x_train/x_test back to its original place in X. My confusion matrix is printing inconsistent information, so I'm trying to figure out if the way the data is split being training and testing is the reason.
Edit 1: Folder Structure
All the images are in one array (X = np.array(X_images)) which I derived from collecting image from folders such that:
Data
Class_1
Class_2
...
Class_n
I then used: Y = np_utils.to_categorical(labels, num_classes) to get the Y values
If you are able to re-run the experiment, try generating indicates instead of raw arrays. Then use indicates to extract train and dev sets.
# this is slightly modified example from the sklearn documentation:
import numpy as np
from sklearn.model_selection import KFold
X = np.array([["a", "b"], ["c", "d"], ["e", "f"], ["g", "h"]])
y = np.array(["a1", "c1", "e1", "g1"])
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("indicates", "TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
train_test_split takes as arguments an arbitrary number of arrays/vectors, so you could just pass an additional list/array containing some identifier to that call, e.g. X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(X, y, ids), where ids is some list/array containing the identifiers corresponding to each element in X/y. Then, the data point at index i in X_train/y_train will correspond to the identifier at id_train[i], and so on for the "test" data. If you don't have a row-identifier column handy, you could just use the index of X, e.g. ids = list(range(X.shape[0])).
The following solved my problem. I created a numpy array from the original image data:
indices = np.arange(X.shape[0])
Fed this into the train_test_split function call and added two more that created an index corresponding to the image's respective X index:
x_train, x_test, y_train, y_test, x_train_ind, x_test_ind = train_test_split(X, Y, indices, test_size=0.2, random_state=2)
After getting the index of an image in the x_train dataset, we can plug that into x_train_ind to get the index in the original X dataset

Split dataframe for train_test_split based on indexes

I am trying to use train_test_split to get my train data to be the dataframe between indexes 31 and 39.
I want to write something like X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target, test_size = 0.3) where faces is faces = sk.datasets.fetch_olivetti_faces()
How can I select which indexes I want to go into my train data?
As #berkayln suggested, I'm not sure your train-test split strategy is advisable, but to split the data as you're suggesting, I believe you can use:
from sklearn import datasets
faces = datasets.fetch_olivetti_faces()
X_train = faces.data[31:40]
X_test = faces.data[np.r_[0:31, 40:400]]
y_train = faces.target[31:40]
y_test = faces.target[np.r_[0:31, 40:400]]
You can give with fancy index easily:
X_train=faces.data[:number what you want]
X_test=faces.target[:number what you want]
y_train=aces.data[number what you want]
y_test= faces.target[number what you want:]

create training validation split using sklearn

I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?
It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)

How to implement SMOTE in cross validation and GridSearchCV

I'm relatively new to Python. Can you help me improve my implementation of SMOTE to a proper pipeline? What I want is to apply the over and under sampling on the training set of every k-fold iteration so that the model is trained on a balanced data set and evaluated on the imbalanced left out piece. The problem is that when I do that I cannot use the familiar sklearn interface for evaluation and grid search.
Is it possible to make something similar to model_selection.RandomizedSearchCV. My take on this:
df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
smote_enn = SMOTEENN(smote = sm)
x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
clf_rf.fit(x_train_res, y_train_res)
y_pred = clf_rf.predict(X_test,y_test)
scores[test_index,1] = recall_score(y_test, y_pred)
scores[test_index,2] = auc(y_test, y_pred)
You need to look at the pipeline object. imbalanced-learn has a Pipeline which extends the scikit-learn Pipeline, to adapt for the fit_sample() and sample() methods in addition to fit_predict(), fit_transform() and predict() methods of scikit-learn.
Have a look at this example here:
https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html
For your code, you would want to do this:
from imblearn.pipeline import make_pipeline, Pipeline
smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
pipeline = make_pipeline(smote_enn, clf_rf)
OR
pipeline = Pipeline([('smote_enn', smote_enn),
('clf_rf', clf_rf)])
Then you can pass this pipeline object to GridSearchCV, RandomizedSearchCV or other cross validation tools in the scikit-learn as a regular object.
kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
n_iter=1000,
cv = kf)
This looks like it would fit the bill http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html
You'll want to create your own transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) that upon calling fit returns a balanced data set (presumably the one gotten from StratifiedKFold), but upon calling predict, which is that is going to happen for the test data, calls into SMOTE.

Different values each time I run the code even with random_state

Each time I run this code, I get a different value for the print statement. I'm confused why it's doing that because I specifically included the random_state parameter for the train/test split. (On a side note, I hope I'm supposed to encode the data; it was giving "ValueError: could not convert string to float" otherwise).
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data',
names=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety', 'acceptability'])
# turns variables into numbers (algorithms won't let you do it otherwise)
df = df.apply(LabelEncoder().fit_transform)
print(df)
X = df.reindex(columns=['buying', 'maint', 'doors', 'persons',
'lug_boot', 'safety'])
y = df['acceptability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train)
# decision trees classification
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_train, y_train)
y_true = y_test
y_pred = clf.predict(X_test)
print(math.sqrt(mean_squared_error(y_true, y_pred)))
DecisionTreeClassifier also takes a random_state param: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
All you did was ensure that the train/test splits are repeatable but the classifier also needs to ensure it's own seed is the same on each run
Update
Thanks to #Chester VonWinchester for pointing out: https://github.com/scikit-learn/scikit-learn/issues/8443 due to sklearn's implementation choice it can be non-deterministic with max_features= None even though it should mean that all features are considered.
There is further information and discussion in the link above.

Categories