I need to split my dataset into training and testing.
I need the last 20% of the values for testing and the first 80% for training.
I have currently used the 'train_test_split()' but it picks the data randomly instead of the last 20%. How can I get the last 20% for testing and the first 80% for training?
My code is as follows:
numpy_array = df.as_matrix()
X = numpy_array[:, 1:26]
y = numpy_array[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=20) #I do not want the data to be random.
Thanks
train_pct_index = int(0.8 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]
It's one of those situations where it's just better not to involve sklearn helpers. Very straightforward, readable, and not dependent on knowing internal options of sklearn helpers, which code readers may not have experience with.
I think this Stackoverflow topic answers your question :
How to get a non-shuffled train_test_split in sklearn
And especially this piece of text :
in scikit-learn version 0.19, you can pass the parameter shuffle=False to train_test_split to obtain a non-shuffled split.
From the documentation :
shuffle : boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then >stratify must be None.
Please tell me if I didn't understand your question correctly
Related
How to retrieve the random state of sklearn.model_selection.train_test_split?
Without setting the random_state, I split my dataset with train_test_split. Because the machine learning model trained on the split dataset performs quite well, I want to retrieve the random_state that was used to split the dataset. Is there something like numpy.random.get_state()
If you trace through the call stack of train_test_split, you'll find the random_state parameters is used like this:
from sklearn.utils import check_random_state
rng = check_random_state(self.random_state)
print(rng)
The relevant part of check_random_state is
def check_random_state(seed):
if seed is None or seed is np.random:
return np.random.mtrand._rand
If random_state=None, you get the default numpy.random.RandomState singleton, which you can use to generate new random numbers, e.g.:
print(rng.permutation(10))
print(rng.randn(10))
See these questions for more information:
Difference between np.random.seed() and np.random.RandomState()
Consistently create same random numpy array
What do you mean?
If you wanna know which random_state you are using, you have to use random_state while running the function, for example:
X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
by default its set to none see the docs.
Here are also further information to random_state.
Or do you mean this?
If you only have an old notebook showing a slice of one+ of the train/test subsets (eg X_test[0:5], y_train[-5:], etc), but you know the other parameters (eg [test_size | train_size, shuffle, stratify]) of the train_test_split() call and can perfectly recreate X and y, you could try brute-forcing it by generating new splits with different random_state seeds and comparing the split to your known subset-slice and recording any random_state values producing matching (or close-enough that differences could just be floating-point weirdness) subset-slice values.
target_y_train = np.array([-5.482, -11.165, -13.926, -7.534, -8.323])
possible_random_state_values = []
for i in range(0, 1000):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=i)
if all(np.isclose(y_train[0:5], target_y_train)):
possible_random_state_values.append(i)
print(f"Possible random state value found: {i}")
If you don't get any possible seeds from the (0, 1000] range, increase the higher range. And when you get values, you can plug them into train_test_split(), compare other subset_slices if you have any, rerun your model training pipeline, and compare your output metrics.
I was given some starter code, but I'm not sure how to split it up when calling train_test_split (which I was explicitly told to use). Essentially, where does it come into play when I'm already given an X_train, Y_train, and X_test split?
The starter code looks like so:
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
print(train_df[train_df.isnull().any(axis=1)])
##SVM
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
I need to change the acc_svc variable to be using X_test and Y_test, however. X_test is given to us, but how do I come up with a Y_test? I know the Y_test should correspond to labels, and I'm having some size mismatching going on when I attempt to do so. Should be a simple question, anyone mind pointing me in the right direction?
The test_preprocessed.csv shouldn't be used to check your model performance. Split your train_df using train_test_split() in scikit-learn into train and validation datasets. You have to check your model performance on validation dataset i.e. y of validation. Please refer to: scikit-learn documentation
First of all, you have to understand and clarify your target variable. Your "Y_test" seems to be your already existed "Y_pred" variable, which seems to correspond to the "Survived" label (in your test set). However, although you are dropping it from the "X_train" so that you can use it as a target, you don't seem to do the same in the "Y_train", where instead you are dropping "PassengerId".
Another basic concept here is that your dataset is already split into train-test subsets (your CSV files). I assume that your test set has already one less column compared to the train set, and that should be the "Survived" variable as a continuation from the train CSV file. Otherwise, you should drop it to avoid mismatching and keep that as your test target variable. You don't have to come up with a "Y_test", the result from your equation "Y_pred = svc.predict(X_test)" will give you the "Y_test" which would be the result of the "Y_pred".
One possible reason you get size mismatching is that the number of columns (x-axis) in your train set is not equal with that of the test set.
If you want to split into train/test subsets based on Scikit-learn you would first merge your CSV files, then do the data analysis in the merged dataset, and finally, do the split. One way to keep track of these changes and maintain the same original size of the train-test split could be to keep key-value pairs originated from the train-test merge. One way to do that could be via the pandas.concat, using the parameter "keys".
Incorporating the above, one recommended simple solution might be:
# reading csv files
train_df = pd.read_csv('./train_preprocessed.csv')
test_df = pd.read_csv('./test_preprocessed.csv')
# merge train and test sets
merged_data = pd.concat([train_df, test_df], keys=[0,1])
# data preprocessing can take place in the below assigned variable
# here also you could do feature engineering etc.
# e.g. check null values for all dataset
print(merged_data[merged_data.isnull().any(axis=1)])
# now you can eject the train and test sets, using the key-value pairs from the train-test merge
X_train = merged_data.xs(0)
X_test = merged_data.xs(1)
# setting up predictors - target
X= X_train.loc[:, X_train.columns!="Survived"]
y= X_train.loc[:, "Survived"]
# train-test split
# If train_size is None, it will be set to 0.25 based on the documentation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
##SVM
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
print("svm accuracy is:", acc_svc)
In my opinion, after understanding the above you could further estimate and compare your model's performance using the cross_val_score function, in a way #SunilG mentions. For e.g. a 3-fold (CV=3) cross validation, you could:
from sklearn.model_selection import cross_val_score
cross_val_score(svc, X_train, y_train.values, cv=3, scoring='accuracy')
If you do not want to proceed to the above and you want to be close to your starter code, then you should delete your 5th line of code and I suppose it would run (if your test set does not include your target variable, otherwise drop it). However in this case you would not be able to split your train-test on your own, since it is already split, hence the title of your main question/post should be altered.
I'm trying to apply the kfold method, but I don't know how to access the training and testing sets generated. After going through several blogs and scikitlearn user guide, the only thing people do is to print the training and testing sets. This could work for a small dataframe, but it's not useful when it comes to larger dataframes. Can anyone help me?
The data I'm using: https://github.com/ageron/handson-ml/tree/master/datasets/housing
Where I'm currently at:
X = housing[['total_rooms', 'total_bedrooms']]
y = housing['median_house_value']
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
But this is only useful to get the last dataset generated. I should be able to get all.
Thanks in advance.
AFAIK, KFold (and in fact everything related to the cross validation process) is meant to provide temporary datasets, so that one is able, as you say, to use them on the fly for fitting & evaluating models as shown in Cross-validation metrics in scikit-learn for each data split.
Nevertheless, since Kfold.split() results in a Python generator, you can use the indices generated in order to get permanent subsets, albeit with some manual work. Here is an example with the Boston data:
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
n_splits = 3
kf = KFold(n_splits=n_splits, shuffle=True)
folds = [next(kf.split(X)) for i in range(n_splits)]
Now, for every k in range(n_splits), folds[k][0] contains the training indices and folds[k][1] the corresponding validation indices, so you can do:
X_train_1 = X[folds[0][0]]
X_test_1 = X[folds[0][1]]
and so on. Notice that the same indices are applicable to the labels y too.
I'm running RandomForestRegressor(). I'm using R-squared for scoring. Why do I get dramatically different results with .score versus cross_val_score? Here is the relevant code:
X = df.drop(['y_var'], axis=1)
y = df['y_var']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Random Forest Regression
rfr = RandomForestRegressor()
model_rfr = rfr.fit(X_train,y_train)
pred_rfr = rfr.predict(X_test)
result_rfr = model_rfr.score(X_test, y_test)
# cross-validation
rfr_cv_r2 = cross_val_score(rfr, X, y, cv=5, scoring='r2')
I understand that cross-validation is scoring multiple times versus one for .score, but the results are so radically different, that something is clearly wrong. Here are the results:
R2-dot-score: .99072
R2-cross-val: [0.5349302 0.65832268 0.52918704 0.74957719 0.45649582]
What am I doing wrong? Or what might explain this discrepancy?
EDIT:
OK, I may have solved this. It seems as if cross_val_score does not shuffle the data, which may be leading to worse predictions when data is grouped together. The easiest solution I found (via this answer) to this was to simply shuffle the dataframe before running the model:
shuffled_df = df.reindex(np.random.permutation(df.index))
After I did that, I started getting similar results between .score and cross_val_score:
R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142 0.9922923 0.99259524 0.99195022]
OK, I may have solved this. It seems as if cross_val_score does not randomize the data, which may be leading to worse predictions when similar data is grouped together. The easiest solution I found (via this answer) to this was to simply shuffle the dataframe before running the model:
shuffled_df = df.reindex(np.random.permutation(df.index))
After I did that, I started getting similar results between .score and cross_val_score:
R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142 0.9922923 0.99259524 0.99195022]
I'm trying to learn scikit-learn and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd, I am trying to predict the y based on the first instance of X_train.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174, instead of 20-30, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174 value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
You can use inverse_transform using your scalery object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game:
Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...