Scikit-learn: train/test split not reproducible - python

I'm using scikit-learn's train_test_split functionality and am getting different results when running the same code repeatedly:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
When I log the number of unique elements in y_train:
logger.info(len(set(y_train)))
I get different values on repeated runs (with no code changes). I would have thought the random_state would ensure a deterministic split.
How can I ensure the same split each time?

The randomness is not caused by train_test_split as you can see if you run this minimal code multiple times:
from sklearn.model_selection import train_test_split
x = [k for k in range(0, 50)]
y = [k for k in range(0, 50)]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=44)
print (x_train)
You probably have another source of randomness in your code. So maybe numpy/pandas is causing the problem.

The value you set the random_state (42 used in many scikit-learn examples) does not really matter, what is most important is that the value is the same always so you can validate your code multiple times.
There might be some other randomness present in your code that produces different result could you post your complete code.

Related

How to retrieve the random_state of sklearn.model_selection.train_test_split?

How to retrieve the random state of sklearn.model_selection.train_test_split?
Without setting the random_state, I split my dataset with train_test_split. Because the machine learning model trained on the split dataset performs quite well, I want to retrieve the random_state that was used to split the dataset. Is there something like numpy.random.get_state()
If you trace through the call stack of train_test_split, you'll find the random_state parameters is used like this:
from sklearn.utils import check_random_state
rng = check_random_state(self.random_state)
print(rng)
The relevant part of check_random_state is
def check_random_state(seed):
if seed is None or seed is np.random:
return np.random.mtrand._rand
If random_state=None, you get the default numpy.random.RandomState singleton, which you can use to generate new random numbers, e.g.:
print(rng.permutation(10))
print(rng.randn(10))
See these questions for more information:
Difference between np.random.seed() and np.random.RandomState()
Consistently create same random numpy array
What do you mean?
If you wanna know which random_state you are using, you have to use random_state while running the function, for example:
X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
by default its set to none see the docs.
Here are also further information to random_state.
Or do you mean this?
If you only have an old notebook showing a slice of one+ of the train/test subsets (eg X_test[0:5], y_train[-5:], etc), but you know the other parameters (eg [test_size | train_size, shuffle, stratify]) of the train_test_split() call and can perfectly recreate X and y, you could try brute-forcing it by generating new splits with different random_state seeds and comparing the split to your known subset-slice and recording any random_state values producing matching (or close-enough that differences could just be floating-point weirdness) subset-slice values.
target_y_train = np.array([-5.482, -11.165, -13.926, -7.534, -8.323])
possible_random_state_values = []
for i in range(0, 1000):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=i)
if all(np.isclose(y_train[0:5], target_y_train)):
possible_random_state_values.append(i)
print(f"Possible random state value found: {i}")
If you don't get any possible seeds from the (0, 1000] range, increase the higher range. And when you get values, you can plug them into train_test_split(), compare other subset_slices if you have any, rerun your model training pipeline, and compare your output metrics.

How do I access the datasets after running k-fold with scikit-learn?

I'm trying to apply the kfold method, but I don't know how to access the training and testing sets generated. After going through several blogs and scikitlearn user guide, the only thing people do is to print the training and testing sets. This could work for a small dataframe, but it's not useful when it comes to larger dataframes. Can anyone help me?
The data I'm using: https://github.com/ageron/handson-ml/tree/master/datasets/housing
Where I'm currently at:
X = housing[['total_rooms', 'total_bedrooms']]
y = housing['median_house_value']
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
But this is only useful to get the last dataset generated. I should be able to get all.
Thanks in advance.
AFAIK, KFold (and in fact everything related to the cross validation process) is meant to provide temporary datasets, so that one is able, as you say, to use them on the fly for fitting & evaluating models as shown in Cross-validation metrics in scikit-learn for each data split.
Nevertheless, since Kfold.split() results in a Python generator, you can use the indices generated in order to get permanent subsets, albeit with some manual work. Here is an example with the Boston data:
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
n_splits = 3
kf = KFold(n_splits=n_splits, shuffle=True)
folds = [next(kf.split(X)) for i in range(n_splits)]
Now, for every k in range(n_splits), folds[k][0] contains the training indices and folds[k][1] the corresponding validation indices, so you can do:
X_train_1 = X[folds[0][0]]
X_test_1 = X[folds[0][1]]
and so on. Notice that the same indices are applicable to the labels y too.

create training validation split using sklearn

I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?
It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)

How to split the data set without train_test_split()?

I need to split my dataset into training and testing.
I need the last 20% of the values for testing and the first 80% for training.
I have currently used the 'train_test_split()' but it picks the data randomly instead of the last 20%. How can I get the last 20% for testing and the first 80% for training?
My code is as follows:
numpy_array = df.as_matrix()
X = numpy_array[:, 1:26]
y = numpy_array[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20) #I do not want the data to be random.
Thanks
train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
It's one of those situations where it's just better not to involve sklearn helpers. Very straightforward, readable, and not dependent on knowing internal options of sklearn helpers, which code readers may not have experience with.
I think this Stackoverflow topic answers your question :
How to get a non-shuffled train_test_split in sklearn
And especially this piece of text :
in scikit-learn version 0.19, you can pass the parameter shuffle=False to train_test_split to obtain a non-shuffled split.
From the documentation :
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then >stratify must be None.
Please tell me if I didn't understand your question correctly

Saving order of splitting with a vector of index

l want to split data into train and test and also a vector that contains names (it serves me as an index and reference).
name_images has a shape of (2440,)
My data are :
data has a shape of (2440, 3072)
labels has a shape of (2440,)
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test= train_test_split(data, labels, test_size=0.3)
but l want also to split my name_images into name_images_train and name_images_test with respect to the split of data and labels
l tried
x_train, x_test, y_train, y_test,name_images_train,name_images_test= train_test_split(data, labels,name_images, test_size=0.3)
it doesn't preserve the order
Any suggestions
thank you
EDIT1:
x_train, x_test, y_train, y_test= train_test_split(data, labels,test_size=0.3, random_state=42)
name_images_train, name_images_test=train_test_split(name_images,
test_size=0.3,
random_state=42)
EDIT1 don't preserve the order
There are multiple ways to accomplish this.
The most straight forward is to use random_state parameter of train_test_split. As the documentation states:
random_state : int or RandomState :-
Pseudo-random number generator state used for random sampling.
When you fix the random_state, the indices which are generated for splitting the arrays into train and test are exact same each time.
So change your code to:
x_train, x_test,
y_train, y_test,
name_images_train, name_images_test=train_test_split(data, labels, name_images,
test_size=0.3,
random_state=42)
For more understanding on random_state, see my answer here:
https://stackoverflow.com/a/42197534/3374996
In my case, I realize that my input arrays were not in proper order in the first place. So for future Googlers--you may want to double-check if (data, labels) are in the same order or not.

Categories