How to split dataframe for scikit - python

I have a big dataframe, how can I divide it into 80% and 20% for test and train
Thanks!
I tried split but it didn't work

from sklearn.model_selection import train_test_split
X = #define X columns
y = #defone y columns(target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train and y_train, which contain 80% of the data, and X_test and y_test, which contain the remaining 20%

Related

How to perform train-val-test (3 way) split on multi-label data

I'm trying to split a multi-label dataset into train, val and test datasets. I want to do something similar to
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
def iterative_train_test_split(X, y, test_size):
stratifier = IterativeStratification(
n_splits=2, order=1, sample_distribution_per_fold=[test_size, 1-test_size])
train_indices, test_indices = next(stratifier.split(X, y))
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
return X_train, X_test, y_train, y_test
but with n_splits=3. When I try to set n_splits=3 I still only get 2 sets of indices out. Am I doing something wrong?

scikit learn train_test_split() behaving splitting data unexpectedly

I'm facing this issue where sklearn's train_test_split() is dividing data sets abruptly in case of large data sets. I'm trying to load the entire data set of 118 MB, and it is assigning test data less than 10 times of what is expected of code.
Case 1: 60K datapoints
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv',nrows=60000)
data.shape
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=0)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(40200, 8) (40200,)
(19800, 8) (19800,)
Case 2:109,000 data-points
#loading the data
import pandas
data = pandas.read_csv('preprocessed_data.csv')
print(data1.shape)
y=data['project_is_approved'] #
X=data.drop(['project_is_approved'],axis=1)
X.shape,y.shape
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y,random_state=123)
print(X_train.shape, y_train.shape)
#print(X_cv.shape, y_cv.shape)
print(X_test.shape, y_test.shape)
Output:
(109248, 9)
(90552, 8) (90552,)
(1460, 8) (1460,)
Anything more than 60K data-points is being abruptly like in case 2 into 90K and 1.4K. I've tried changing random state, removing random state,moving data set to new location but the issue seems same.

Using sklearn.model_selection to split unbalanced dataset

I am using the following code to split my dataset into train/val/test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42)
The problem is that my dataset is really unbalanced. Some classes have 500 samples while some have 70 for example. Is this splitting method accurate in this situation? Is the sampling random or does sklearn use seome methods to keep the distribution of the data same in all sets?
You should use the stratify option (see the docs):
X_train, X_test, y_train, y_test =
train_test_split(X_data, y_data, test_size=0.3, random_state=42, stratify=y_data)
X_test, X_val, y_test, y_val =
train_test_split(X_test, y_test, test_size=0.5, random_state=42, stratify=y_test)

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.

writing a train_test_split function with numpy

I am trying to write my own train test split function using numpy instead of using sklearn's train_test_split function. I am splitting the data into 70% training and 30% test. I am using the boston housing data set from sklearn.
This is the shape of the data:
housing_features.shape #(506,13) where 506 is sample size and it has 13 features.
This is my code:
city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data
def shuffle_split_data(X, y):
split = np.random.rand(X.shape[0]) < 0.7
X_Train = X[split]
y_Train = y[split]
X_Test = X[~split]
y_Test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_Train, y_Train, X_Test, y_Test
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
print "Successful"
except:
print "Fail"
The print output i got is:
362 362 144 144
"Successful"
But i know it was not successful because i get a different numbers for the length when i run it again Versus just using SKlearn's train test function and always get 354 for the length of X_train.
#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train)
#354
What am i missing my my function?
Because you're using np.random.rand which gives you random numbers and it'll be close to 70% for 0.7 limit for very big numbers. You could use np.percentile for that to get value for 70% and then compare with that value as you did:
def shuffle_split_data(X, y):
arr_rand = np.random.rand(X.shape[0])
split = arr_rand < np.percentile(arr_rand, 70)
X_train = X[split]
y_train = y[split]
X_test = X[~split]
y_test = y[~split]
print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
return X_train, y_train, X_test, y_test
EDIT
Alternatively you could use np.random.choice to select indices with your desired amount. For your case:
np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))

Categories