writing a train_test_split function with numpy - python

I am trying to write my own train test split function using numpy instead of using sklearn's train_test_split function. I am splitting the data into 70% training and 30% test. I am using the boston housing data set from sklearn.
This is the shape of the data:
housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
This is my code:
city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data
def shuffle_split_data(X, y):
split = np.random.rand(X.shape[0]) < 0.7
X_Train = X[split]
y_Train = y[split]
X_Test = X[~split]
y_Test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_Train, y_Train, X_Test, y_Test
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
print "Successful"
except:
print "Fail"
The print output i got is:
362 362 144 144
"Successful"
But i know it was not successful because i get a different numbers for the length when i run it again Versus just using SKlearn's train test function and always get 354 for the length of X_train.
#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train)
#354
What am i missing my my function?

Because you're using np.random.rand which gives you random numbers and it'll be close to 70% for 0.7 limit for very big numbers. You could use np.percentile for that to get value for 70% and then compare with that value as you did:
def shuffle_split_data(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 70)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_train, y_train, X_test, y_test
EDIT
Alternatively you could use np.random.choice to select indices with your desired amount. For your case:
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))

Related

How to perform train-val-test (3 way) split on multi-label data

I'm trying to split a multi-label dataset into train, val and test datasets. I want to do something similar to
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
def iterative_train_test_split(X, y, test_size):
stratifier = IterativeStratification(
n_splits=2, order=1, sample_distribution_per_fold=[test_size, 1-test_size])
train_indices, test_indices = next(stratifier.split(X, y))
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
return X_train, X_test, y_train, y_test
but with n_splits=3. When I try to set n_splits=3 I still only get 2 sets of indices out. Am I doing something wrong?

How to split dataframe for scikit

I have a big dataframe, how can I divide it into 80% and 20% for test and train
Thanks!
I tried split but it didn't work
from sklearn.model_selection import train_test_split
X = #define X columns
y = #defone y columns(target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train and y_train, which contain 80% of the data, and X_test and y_test, which contain the remaining 20%

scikit learn train_test_split() behaving splitting data unexpectedly

I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code.
Case 1: 60K datapoints
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(40200, 8) (40200,)
(19800, 8) (19800,)
Case 2:109,000 data-points
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(109248, 9)
(90552, 8) (90552,)
(1460, 8) (1460,)
Anything more than 60K data-points is being abruptly like in case 2 into 90K and 1.4K. I've tried changing random state, removing random state,moving data set to new location but the issue seems same.

ValueError: Expected 2D array, got 1D array instead: array=[0.31818181 0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data, target, test_size=0.25, random_state=0)
from sklearn.model_selection import cross_val_score, KFold
from scipy.stats import sem
def evaluate_cross_validation(clf, X, y, K):
# create a k-fold cross validation iterator
cv = KFold( K , shuffle=True, random_state=0)
# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print (scores)
print ("Mean score: {0:.3f} (+/-{1:.3f})".format(
np.mean(scores), sem(scores)))
evaluate_cross_validation(svc_1, X_train, y_train, 5)
from sklearn import metrics
def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
clf.fit(X_train, y_train)
print ("Accuracy on training set:")
print (clf.score(X_train, y_train))
print ("Accuracy on testing set:")
print (clf.score(X_test, y_test))
y_pred = clf.predict(X_test)
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred))
print ("Confusion Matrix:")
print (metrics.confusion_matrix(y_test, y_pred))
train_and_evaluate(svc_1, X_train, X_test, y_train, y_test)
random_image_button = Button(description="New image!")
def display_face_and_prediction(b):
index = randint(0, 400)
face = faces.images[index]
display_face(face)
print("this person is smiling: {0}".format(svc_1.predict(faces.data[index, :])==1))
random_image_button.on_click(display_face_and_prediction)
display(random_image_button)
display_face_and_prediction(0)
when i ran the code beginning from random_image_button = Button(description="New image!"), it gives me the error below:
ValueError: Expected 2D array, got 1D array instead: array=[0.31818181
0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]. Reshape your data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
How can i fix this?
your code has problem in here :
def display_face_and_prediction(b):
index = randint(0, 400)
face = faces.images[index]
display_face(face)
print("this person is smiling: {0}".format(svc_1.predict(faces.data[index, :])==1))
you model need to fid 2d array to predict yet you fit faces.data[index,:]
you can reshape your faces.data[index,:] to 2d array

sklearn loocv.split returning a smaller test and train array than expected

As I have a small dataset I'm using LOOCV(leave one out cross validation) in sklearn.
When I ran my classifier I received the following error:
"Number of labels=41 does not match number of samples=42".
I generated the test and training sets using the following code:
otu_trans = test_train.transpose()
# transpose otu table
merged = pd.concat([otu_trans, metadata[status]], axis=1, join='inner')
# merge phenotype column from metadata file with transposed otu table
X = merged.drop([status],axis=1)
# drop status from X
y = merged[status]
encoder = LabelEncoder()
y = pd.Series(encoder.fit_transform(y),
index=y.index, name=y.name)
# convert T and TF lables to 0 and 1 respectively
loocv = LeaveOneOut()
loocv.get_n_splits(X)
for train_index, test_index in loocv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print(X_train, X_test, y_train, y_test)
input data
When I check the shape of X_train and X_test it is 42,41 rather than 41,257 as I believe it should be, thus it appears the data is being partitioned along the wrong axis.
Can anyone explain to me why this is happening?
Thank you
First of all, the initial matrix X will be not affected at all.
It is only used to produce indices and split the data.
The shape of the initial X will be always the same.
Now, here is a simple example using LOOCV spliting:
import numpy as np
from sklearn.model_selection import LeaveOneOut
# I produce fake data with same dimensions as yours.
#fake data
X = np.random.rand(41,257)
#fake labels
y = np.random.rand(41)
#Now check that the shapes are correct:
X.shape
y.shape
This will give you:
(41, 257)
(41,)
Now the splitting:
loocv = LeaveOneOut()
loocv.get_n_splits(X)
for train_index, test_index in loocv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
#classifier.fit(X_train, y_train)
#classifier.predict(X_test)
X_train.shape
X_test.shape
This prints:
(40, 257)
(1, 257)
As you can see, the X_train contains 40 samples and the X_test contains only 1 sample. This is correct since we use LOOCV splitting.
The initial X matrix had 42 samples in total so we use 41 for training and 1 for testing.
This loop will produce a lot of X_train and X_test matrices. To be specific, it will produce N matrices where N = number of samples (in our case: N = 41).
N is equal to the loocv.get_n_splits(X).
Hope this helps

Categories