what is the differene between Stratify and StratifiedKFold in python scikit learn?

what is the differene between Stratify and StratifiedKFold in python scikit learn? - python

My data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'?
Please see below code for clarification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,stratify=y,random_state=42)

Stratification will just return a portion of data which may be shuffled or not based on the arguments you pass to it. let's say your dataset consists of 100 instances of class 1 and 10 instances of class 0, you decide to do a split of 70:30, suppose you pass the appropriate parameters to get a split of 63-class1 instances and 7-class0 instances in training set and 27-class1 instances and 3-class0 instances in the test set. Clearly, it is no way balanced. The classifier you train will be highly biased and as good as a dummy classifier which predicts every input as class1.
A better approach would be, either try to collect more data of class-0, or oversample the dataset to artificially generate more class0 instances or undersample it to get less class1 instances. python imblearn is a library in python which can help you for that

First difference is that the train_test_split(X, y, test_size=0.2, stratify=y) will only split the data once and in which 80% will be in train and 20% in test.
Whereas StratifiedKFold(n_splits=2) will split the data into 50% train and 50% test.
Second is that you can specify n_splits greater than 2 to achieve a cross-validation fold effect, in which the data will splitted n_split number of times. So there will be multiple divisions of data into train and test.
For more information about the K-fold you can look at this question:
difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
The idea is same in that. train_test_split will internally use StratifiedShuffleSplit

Related

how can I train test split in scikit learn [duplicate]

This question already has answers here:
scikit-learn error: The least populated class in y has only 1 member
(11 answers)
Closed 1 year ago.
does anyone know what is the problem?
x=np.linspace(-3,3,100)
rng=np.random.RandomState(42)
y=np.sin(4*x)+x+rng.uniform(size=len(x))
X=x[:,np.newaxis]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.25,random_state=42,stratify=y)
I have this error:
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

The parameter (stratify = y) inside the train_test_split is giving you the error. Stratify is used when your labels have repeating values. Eg: Let's say your label columns have values of 0 and 1. Then passing stratify = y, would preserve the original proportion of your labels in your training samples. Say, if you had 60% of 1s and 40% of 0s, then your training sample will also have the same proportion.

Try removing stratify=y, you should do without.
Also, have a peek here.

From the documentation:
3.1.2.2. Cross-validation iterators with stratification based on class labels.
Some classification problems can exhibit a large imbalance in the
distribution of the target classes: for instance there could be
several times more negative samples than positive samples. In such
cases it is recommended to use stratified sampling as implemented in
StratifiedKFold and StratifiedShuffleSplit to ensure that relative
class frequencies is approximately preserved in each train and
validation fold.

Splitting test/training data for scikit?

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?

The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation

First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Train, Test, Validate split Python. Three sets

Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!

Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.

Is train_test_split necessary for binary classification? And why are there 4 outcomes?

Why are there 4 outcomes to train_test_split in sklearn? Why is there y_test, if the testing data has no y_data?

The reason you get 4 outcomes is because you get: train_features, test_features, train_labels and test_labels (X_train, X_test, y_train, y_test). So it not just splits the dataset into train and test set, but also the labels. (so 2 + 2 = 4 outcomes).

Looking into the documentation, you can see that the first parameter is
*arrays, which means you can put as many arrays as you want there. Now, what does it returns?
Returns: splitting : list, length=2 * len(arrays)
Which means it returns twice the amount of arrays passed in the train_test_split function.
So, if you already have a training and a testing set, it only makes sense to split the training set, so you can have a validation set to check the model performance.
Eg.:
train_data, validation_data, train_label, validation_label= train_test_split(original_train_data, original_train_label)
Note that you also must split the labels in the case you have the data and the label in separated vectors.

because you have split your original data into train and test parts. so there would be four outcomes.
1 (X_train, Y_train) where X_train are the training points while Y_train are their respective class labels. Now this is your training data which will be used to train your model with any classical models like K-NN, logistic regression , Decision Tress.
2 (X_test,Y_test) where X_test represents your test data point and y_train are your respective class labels for these test points.Now once you have trained your model and calculated your training error/accuracy, then you can use these points to see whether the trained model predicts the data correctly or not.The lower the difference between your training and test error the better it is.
That is why you get 4 outcomes with pairs of 2 each.
Hope this helps.

Why should we perform a Kfold cross validation on test set??

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)

TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.

If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.