I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code.
Case 1: 60K datapoints
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(40200, 8) (40200,)
(19800, 8) (19800,)
Case 2:109,000 data-points
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(109248, 9)
(90552, 8) (90552,)
(1460, 8) (1460,)
Anything more than 60K data-points is being abruptly like in case 2 into 90K and 1.4K. I've tried changing random state, removing random state,moving data set to new location but the issue seems same.
Related
I have a big dataframe, how can I divide it into 80% and 20% for test and train
Thanks!
I tried split but it didn't work
from sklearn.model_selection import train_test_split
X = #define X columns
y = #defone y columns(target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train and y_train, which contain 80% of the data, and X_test and y_test, which contain the remaining 20%
I'm trying to split my data into training, validation, and test sets using Fast_ml for a machine learning purpose. Both my input and output data are read from .npy files through np.load. The input "P" is an array with the shape of (100000, 4, 4, 6, 1) and the target "Q" is a vector of shape (100000,). I use the code below:
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(P, Q,
train_size=0.8,
valid_size=0.1,
test_size=0.1)
However, I receive this error:
AttributeError: 'numpy.ndarray' object has no attribute 'drop'
This solved my problem:
from sklearn.model_selection import train_test_split
X_train, X_rem, y_train, y_rem = train_test_split(P,Q, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)
Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.
I have a training set consisting of X and Y, The X is of shape (4000,32,1) and Y is of shape (4000,1).
I would like to create a training/validation set based on split. Here is what I have been trying to do
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Running the program gives the following error message related to the above code segment
for train_index, valid_index in sss.split(X, Y):
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I am not very clear about the above error message, what's the right way to create a training/validation split for the training set as above?
It's a little bit weird because I copy/pasted your code with sklearn's breast cancer dataset as follow
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, Y = cancer.data, cancer.target
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(test_size=0.1, random_state=23)
for train_index, valid_index in sss.split(X, Y):
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = Y[train_index], Y[valid_index]
Here X.shape = (569, 30) and Y.shape = (569,) and I had no error, for example y_valid.shape = 57 or one tenth of 569.
I suggest you to reshape X into (4000,32) (and so Y into (4000)), because Python may see it as a list of ONE big element (I am using python 2-7 by the way).
To answer your question, you can alternatively use train_test_split
from sklearn.model_selection import train_test_split
which according to the help
Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and
``next(ShuffleSplit().split(X, y))`
Basically a wrapper of what you wanted to do. You can then specify the training and the test sizes, the random_state, if you want to stratify your data or to shuffle it etc.
It's easy to use for example:
X_train, X_valid, y_train, y_valid = train_test_split(X,Y, test_size = 0.1, random_state=0)
I am trying to write my own train test split function using numpy instead of using sklearn's train_test_split function. I am splitting the data into 70% training and 30% test. I am using the boston housing data set from sklearn.
This is the shape of the data:
housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
This is my code:
city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data
def shuffle_split_data(X, y):
split = np.random.rand(X.shape[0]) < 0.7
X_Train = X[split]
y_Train = y[split]
X_Test = X[~split]
y_Test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_Train, y_Train, X_Test, y_Test
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
print "Successful"
except:
print "Fail"
The print output i got is:
362 362 144 144
"Successful"
But i know it was not successful because i get a different numbers for the length when i run it again Versus just using SKlearn's train test function and always get 354 for the length of X_train.
#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train)
#354
What am i missing my my function?
Because you're using np.random.rand which gives you random numbers and it'll be close to 70% for 0.7 limit for very big numbers. You could use np.percentile for that to get value for 70% and then compare with that value as you did:
def shuffle_split_data(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 70)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_train, y_train, X_test, y_test
EDIT
Alternatively you could use np.random.choice to select indices with your desired amount. For your case:
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))