create training validation split using sklearn

create training validation split using sklearn - python

I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?

It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)

Related

How to determine if data point was placed in training or testing set?

I'm using train_test_split to split image data for a convolutional neural network in Python:
x_train, x_test, y_train, y_test = train_test_split(X, Y)
For each image in X, how can I figure out whether it was sent to the x_train or x_test set? Since all the data in the x_train or x_test datasets are in tensor form and randomized, I'm not sure how to relate a given instance in x_train/x_test back to its original place in X. My confusion matrix is printing inconsistent information, so I'm trying to figure out if the way the data is split being training and testing is the reason.
Edit 1: Folder Structure
All the images are in one array (X = np.array(X_images)) which I derived from collecting image from folders such that:
Data
Class_1
Class_2
...
Class_n
I then used: Y = np_utils.to_categorical(labels, num_classes) to get the Y values

If you are able to re-run the experiment, try generating indicates instead of raw arrays. Then use indicates to extract train and dev sets.
# this is slightly modified example from the sklearn documentation:
import numpy as np
from sklearn.model_selection import KFold
X = np.array([["a", "b"], ["c", "d"], ["e", "f"], ["g", "h"]])
y = np.array(["a1", "c1", "e1", "g1"])
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(X):
print("indicates", "TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

train_test_split takes as arguments an arbitrary number of arrays/vectors, so you could just pass an additional list/array containing some identifier to that call, e.g. X_train, X_test, y_train, y_test, id_train, id_test = train_test_split(X, y, ids), where ids is some list/array containing the identifiers corresponding to each element in X/y. Then, the data point at index i in X_train/y_train will correspond to the identifier at id_train[i], and so on for the "test" data. If you don't have a row-identifier column handy, you could just use the index of X, e.g. ids = list(range(X.shape[0])).

The following solved my problem. I created a numpy array from the original image data:
indices = np.arange(X.shape[0])
Fed this into the train_test_split function call and added two more that created an index corresponding to the image's respective X index:
x_train, x_test, y_train, y_test, x_train_ind, x_test_ind = train_test_split(X, Y, indices, test_size=0.2, random_state=2)
After getting the index of an image in the x_train dataset, we can plug that into x_train_ind to get the index in the original X dataset

scikit learn train_test_split() behaving splitting data unexpectedly

I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code.
Case 1: 60K datapoints
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(40200, 8) (40200,)
(19800, 8) (19800,)
Case 2:109,000 data-points
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(109248, 9)
(90552, 8) (90552,)
(1460, 8) (1460,)
Anything more than 60K data-points is being abruptly like in case 2 into 90K and 1.4K. I've tried changing random state, removing random state,moving data set to new location but the issue seems same.

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)

Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.

Scikit-learn: train/test split not reproducible

I'm using scikit-learn's train_test_split functionality and am getting different results when running the same code repeatedly:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
When I log the number of unique elements in y_train:
logger.info(len(set(y_train)))
I get different values on repeated runs (with no code changes). I would have thought the random_state would ensure a deterministic split.
How can I ensure the same split each time?

The randomness is not caused by train_test_split as you can see if you run this minimal code multiple times:
from sklearn.model_selection import train_test_split
x = [k for k in range(0, 50)]
y = [k for k in range(0, 50)]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=44)
print (x_train)
You probably have another source of randomness in your code. So maybe numpy/pandas is causing the problem.

The value you set the random_state (42 used in many scikit-learn examples) does not really matter, what is most important is that the value is the same always so you can validate your code multiple times.
There might be some other randomness present in your code that produces different result could you post your complete code.

Saving order of splitting with a vector of index

l want to split data into train and test and also a vector that contains names (it serves me as an index and reference).
name_images has a shape of (2440,)
My data are :
data has a shape of (2440, 3072)
labels has a shape of (2440,)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data, labels, test_size=0.3)
but l want also to split my name_images into name_images_train and name_images_test with respect to the split of data and labels
l tried
x_train, x_test, y_train, y_test,name_images_train,name_images_test= train_test_split(data, labels,name_images, test_size=0.3)
it doesn't preserve the order
Any suggestions
thank you
EDIT1:
x_train, x_test, y_train, y_test= train_test_split(data, labels,test_size=0.3, random_state=42)
name_images_train, name_images_test=train_test_split(name_images,
test_size=0.3,
random_state=42)
EDIT1 don't preserve the order

There are multiple ways to accomplish this.
The most straight forward is to use random_state parameter of train_test_split. As the documentation states:
random_state : int or RandomState :-
Pseudo-random number generator state used for random sampling.
When you fix the random_state, the indices which are generated for splitting the arrays into train and test are exact same each time.
So change your code to:
x_train, x_test,
y_train, y_test,
name_images_train, name_images_test=train_test_split(data, labels, name_images,
test_size=0.3,
random_state=42)
For more understanding on random_state, see my answer here:
https://stackoverflow.com/a/42197534/3374996

In my case, I realize that my input arrays were not in proper order in the first place. So for future Googlers--you may want to double-check if (data, labels) are in the same order or not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create training validation split using sklearn - python

Related

How to determine if data point was placed in training or testing set?

scikit learn train_test_split() behaving splitting data unexpectedly

What should be passed as input parameter when using train-test-split function twice in python 3.6

Scikit-learn: train/test split not reproducible

Saving order of splitting with a vector of index

Categories

Resources