Splitting test/training data for scikit?

Splitting test/training data for scikit? - python

I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?

The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation

First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.

Related

Adding predict_disable_shape_check=true as a parameter on LightGBM

In short, my initial df has a column that has probabilities from an external predictive model that I would like to compare to the predictions generated from my lightGBM model. First I used the train test split on my data, which included my column old_predictions
X = A, B, C, old_predictions
Y = outcome
seed=47
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2,random_state=seed)
I do not, however, want old_predictions to be included as a feature in my lightGBM model, so I made a separate df from the X_test data (which I will later append the light GBM prediction probabilities to) and dropped the old_predictions from the X_test and X_train
pred_df = X_test
X_test.drop(['old_predictions'], axis = 1, inplace = True)
X_train.drop(['old_predictions'], axis = 1, inplace = True)
When I try to train my model, however, I receive the following error:
LightGBMError: The number of features in data (4) is not the same as it was in training data (3).
You can set ``predict_disable_shape_check=true`` to discard this error, but please be aware what you are doing.
The two questions I have are
using the logic I described about why I dropped this variable, would you agree that it is indeed fine to disregard this error?
Where do I add predict_disable_shape_check=true to disregard the error? I have tried the below, but none have been successful and the same error reappears. I tried reading the docs but am having trouble finding clarity
model = lgb.LGBMClassifier(**parameters1, predict_disable_shape_check=True)
y_pred=model.predict(X_test, predict_disable_shape_check=True)
predictions = model.predict_proba(X_test, predict_disable_shape_check=True)[:, 1]
predictions_train = model.predict_proba(X_train, predict_disable_shape_check=True)[:, 1]
I have also added it directly to the parameters list and this did not work either.

This should work.
y_pred=model.predict(X_test, predict_disable_shape_check=True)
Make sure X_test contains only the columns which were used to train the model.

xgboost - feature mismatch when I predict on my test data

I"m using xgboost to train some data and then I want to score it on a test set.
My data is a combination of categorical and numeric variables, so I used pd.get_dummies to dummy all my categorical variables. training is fine, but the problem happens when I score the model on the test set.
I get an error of "feature_names_mismatch" and it lists the columns that are missing. My dataset is already in a matrix (numpy array format).
the mismatch in feature name is valid since some dummy-categories may not be present in the test set. So if this happens, is there a way for the model to still work?

If I understood your problem correctly; you have some categorical values which appears in train set but not in test set. This usually happens when you create dummy variables (converting categorical features using one hot coding etc) separately for train and test instead of doing it on entire dataset. Following code can help
for col in featurs_object:
X[col]=pd.Categorical(X[col],categories=df[col].dropna().unique())
X_col = pd.get_dummies(X[col])
X = X.drop(col,axis=1)
X_col.columns = X_col.columns.tolist()
frames = [X_col, X]
X = pd.concat(frames,axis=1)
X = pd.concat([X,df_continous],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 1)
featurs_object : consists of all categorical columns which you want to include for model building.
df : your entire dataset (post cleanup)
df_continous : Subset of df, with only continuous features.

Train, Test, Validate split Python. Three sets

Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.
In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.
With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!

Try this. Feed the output of one of the train_test_split into a second one as input
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_test, X_validate, y_test, y_validate = train_test_split(X_test, y_test, test_size=0.5)
The function randomly splits 2 arrays into 4 arrays, and test_size determines the size of the split allocated to the test output vs train. The y input is meant to be a target for building a machine learning model and X is meant to be the features for the model. If you want them combined, then just concat the equivalent X and y outputs.

How to split the data set without train_test_split()?

I need to split my dataset into training and testing.
I need the last 20% of the values for testing and the first 80% for training.
I have currently used the 'train_test_split()' but it picks the data randomly instead of the last 20%. How can I get the last 20% for testing and the first 80% for training?
My code is as follows:
numpy_array = df.as_matrix()
X = numpy_array[:, 1:26]
y = numpy_array[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20) #I do not want the data to be random.
Thanks

train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
It's one of those situations where it's just better not to involve sklearn helpers. Very straightforward, readable, and not dependent on knowing internal options of sklearn helpers, which code readers may not have experience with.

I think this Stackoverflow topic answers your question :
How to get a non-shuffled train_test_split in sklearn
And especially this piece of text :
in scikit-learn version 0.19, you can pass the parameter shuffle=False to train_test_split to obtain a non-shuffled split.
From the documentation :
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then >stratify must be None.
Please tell me if I didn't understand your question correctly

Why should we perform a Kfold cross validation on test set??

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)

TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.

If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.