how can I train test split in scikit learn [duplicate] - python

This question already has answers here:
scikit-learn error: The least populated class in y has only 1 member
(11 answers)
Closed 1 year ago.
does anyone know what is the problem?
x=np.linspace(-3,3,100)
rng=np.random.RandomState(42)
y=np.sin(4*x)+x+rng.uniform(size=len(x))
X=x[:,np.newaxis]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.25,random_state=42,stratify=y)
I have this error:
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

The parameter (stratify = y) inside the train_test_split is giving you the error. Stratify is used when your labels have repeating values. Eg: Let's say your label columns have values of 0 and 1. Then passing stratify = y, would preserve the original proportion of your labels in your training samples. Say, if you had 60% of 1s and 40% of 0s, then your training sample will also have the same proportion.

Try removing stratify=y, you should do without.
Also, have a peek here.

From the documentation:
3.1.2.2. Cross-validation iterators with stratification based on class labels.
Some classification problems can exhibit a large imbalance in the
distribution of the target classes: for instance there could be
several times more negative samples than positive samples. In such
cases it is recommended to use stratified sampling as implemented in
StratifiedKFold and StratifiedShuffleSplit to ensure that relative
class frequencies is approximately preserved in each train and
validation fold.

Related

multiclass CV in XGboost (python) - some classes not in train/validation sub-groups

I am working with an XGboost model in python, with a large dataset comprised of embeddings (x) and corresponding labels (y), I have about 30000 samples.
The data is very imbalanced, with 8 different classes of labels.
I am attempting to perform hyperparameter tuning (using RandomizedSearchCV).
For some of the CV folds I get an error:
ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 3 4 5 6 7], got [0 1 2 3 5 6 7 8].
Due to the different splitting each time (using stratified split), some splits do not have all the labels in both groups.
I searched the web a lot and couldn't find anything in this exact context, even though I imagine this should be a major issue for many imbalanced multiclass classifications.
My code:
y = y.values.astype(int)
le = LabelEncoder()
y = le.fit_transform(y)
xgb_base = XGBClassifier(objective='multi:softprob', learning_rate=LR)
cv = StratifiedGroupKFold(n_splits=NUM_CV)
# Create the random search Random Forest
xgb_random = RandomizedSearchCV(estimator=xgb_base, param_distributions=xgb_grid,
n_iter=NUM_ITER, cv=cv, verbose=2,
random_state=1)
# Fit the random search model
xgb_random.fit(X, y, groups=groups)
# Get the optimal parameters
xgb_random.best_params_
print(xgb_random.best_params_)
This is not a bug or error. Use StratifiedCV and see if that helps.
Why this is happening:
Let us suppose you have 3 classes and 5 samples as [0,1,0,1,2]. So if even if you split it into 2 groups i.e k=2, either train or test won't have the class == 2. This is happening with your case.
If you have K > minimum number of samples per class, you'll definitely have this problem. If not, then StratifiedKFold can help. It'll split the data in a manner so that each split has almost the same distribution of classes.
On a broader note, if you can, then drop the classes not required or try merging two or more classes if you can.
Check this link to see the difference between different KFold types

How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.
In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):
data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]
sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)
Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:
kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)
for train_index, test_index in kfolder.split(data_sampled):
data_train, data_test = data_sampled[train_index], data_sampled[test_index]
Or the train_test_split method:
data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)
Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.
As it turns out, the problem originates from a combination of sources:
While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.
This is something the documentation points out, thanks to this post:
https://github.com/scikit-learn/scikit-learn/issues/16068
Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of "hard-to-predict" entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I've expected closer results after only one fold, which lead to the discovery of this problem.
Depending on the amount of noise in the data and on the size of the dataset, this could be expected behavior to see scores on out of sample data to deviate by this amount. One split is not guaranteed to be just like any other split, which is why you have 10 in the first place and then average across all results.
What you should trust to be the most generalizable is not any one given split (whether that comes from one of the 10 folds or train_test_split()), but what is far more trustworthy is the average result across all N folds.
Digging deeper into the data could reveal whether there is some reason why one or more splits deviate so much from another. For example, perhaps there is some feature in your data (e.g. "date the sample was collected" and the collection methodology changed from month to month) that makes the data differ from one another in a biased way. If that is the case, you should use a stratified test split (in your CV as well) (see the scikit-learn documentation on that) so you can get a more unbiased grouping of your data.

Train, Test, Validate split Python. Three sets

Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!
Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.

What does this error mean with StratifiedShuffleSplit?

I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data.
Next, I try to use sklearn's StratifiedShuffleSplit function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.
I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc, it should be the 1D vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)

what is the differene between Stratify and StratifiedKFold in python scikit learn?

My data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'?
Please see below code for clarification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y,random_state=42)
Stratification will just return a portion of data which may be shuffled or not based on the arguments you pass to it. let's say your dataset consists of 100 instances of class 1 and 10 instances of class 0, you decide to do a split of 70:30, suppose you pass the appropriate parameters to get a split of 63-class1 instances and 7-class0 instances in training set and 27-class1 instances and 3-class0 instances in the test set. Clearly, it is no way balanced. The classifier you train will be highly biased and as good as a dummy classifier which predicts every input as class1.
A better approach would be, either try to collect more data of class-0, or oversample the dataset to artificially generate more class0 instances or undersample it to get less class1 instances. python imblearn is a library in python which can help you for that
First difference is that the train_test_split(X, y, test_size=0.2, stratify=y) will only split the data once and in which 80% will be in train and 20% in test.
Whereas StratifiedKFold(n_splits=2) will split the data into 50% train and 50% test.
Second is that you can specify n_splits greater than 2 to achieve a cross-validation fold effect, in which the data will splitted n_split number of times. So there will be multiple divisions of data into train and test.
For more information about the K-fold you can look at this question:
difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
The idea is same in that. train_test_split will internally use StratifiedShuffleSplit

Categories