Below I split my data on train and test and then load into a tensordataset. Which is a straightforward way to add a validation split?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
There isn't a more straightforward way than the one that you are using right now, at least not without some other framework which sits on top of the pytorch (such as fastai). Unless you want to manually compute the cut points, shuffle your data, and make the corresponding splits (but this simple procedure does not handle the stratification).
Using your approach, you can further split the train set into 2 (train and validation)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, random_state = 0, stratify = y)
One thing to note here is that you will need to adjust the test_size according to your needs (using 0.2 twice in a row will not result in 60/20/20 split).
Once you have X_train, X_valid, and X_test splits, you can simply create the 3rd DataLoader for your validation set.
valid_data = TensorDataset(torch.from_numpy(X_valid), torch.from_numpy(y_valid))
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
Just a few notes:
unlike in the case of the training set, you don't need to shuffle your validation and test sets, therefore you can set shuffle=False in the DataLoader
you don't need to use the same batch size in each DataLoader, using larger batch sizes for validation is preferrable if your HW can handle it; computing with a larger batch size will be faster without side-effects since the gradient descent is not performed on this set
unless you know that you need a DataLoader for your test set, you can simply do the inference on the whole test set at once; this is due to the fact that the test set is not part of the training loop and the inference is usually performed on CPU (unless you know that you need GPU, e.g. for real-time inference, or test set being so large that it does not fit into RAM but this isn't usually the case)
Related
I'm using a custom tensorflow model for an imbalanced classification problem.
For this I need to split up the data in a train and test set and split the train set into batches.
However the batches need to be stratified due to the imbalance problem. For now I'm doing it like this:
X_train, X_test, y_train, y_test = skmodel.train_test_split(
Xscaled, y_new, test_size=0.2, stratify=y_new)
dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(
X_train.shape[0]).batch(batch_size)
But I am not sure if the batches in dataset are stratified or not?
If not, how can I make sure that they are stratified?
Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)
Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.
I am having trouble with fit function when applied to MLPClassifier. I carefully read Scikit-Learn's documentation about that but was not able to determine how validation works.
Is it cross-validation or is there a split between training and validation data ?
Thanks in advance.
The fit function per se does not include cross-validation and also does not apply a train test split.
Fortunately you can do this by your own.
Train Test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) // test set size is 0.33
clf = MLPClassifier()
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
clf = MLPClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
For cross validation multiple functions are available, you can read more about it here. The here stated k-fold is just an example.
EDIT:
Thanks for this answer, but basically how does fit function works
concretely ? It just trains the network on the given data (i.e.
training set) until max_iter is reached and that's it ?
I am assuming your are using the default config of MLPClassifier. In this case the fit function tries to do an optimization on basis of adam optimizer. In this case, indeed, the network trains until max_iter is reached.
Moreover, in the K-Fold cross validation, is the model improving as
long as the loop goes through or just restarts from scratch ?
Actually cross-validation is not used to improve the performance of your network, it's actually a methodology to test how well your algrotihm generalizes on different data. For k-fold, k independent classifiers are trained and tested.
This questions is very specific to the Python library scikit-learn. Please let me know if it's a better idea to post it somewhere else. Thanks!
Now the question...
I have a feed-forward neural network class ffnn based on BaseEstimator which I train with SGD. It's working fine, and I can also train it in parallel using GridSearchCV().
Now I want to implement early stopping in the function ffnn.fit() but for this I also need access to the validation data of the fold. One way of doing this is to change the line in sklearn.grid_search.fit_grid_point() which says
clf.fit(X_train, y_train, **fit_params)
into something like
clf.fit(X_train, y_train, X_test, y_test, **fit_params)
and also change ffnn.fit() to take these arguments. This would also affect other classifiers in sklearn, which is a problem. I can avoid this by checking for some kind of a flag in fit_grid_point() which tells me when to call clf.fit() in either of the above two ways.
Can someone suggest a different way to do this where I don't have to edit any code in the sklearn library?
Alternatively, would it be right to further split X_train and y_train into train/validation sets randomly and check for a good stopping point, then re-train the model on all of X_train?
Thanks!
You could just make you neural network model internally extract a validation set from the passed X_train and y_train by using the train_test_split function for instance.
Edit:
Alternatively, would it be right to further split X_train and y_train into train/validation sets randomly and check for a good stopping point, then re-train the model on all of X_train?
Yes but that would be expensive. You could just find the stopping point and then just a do a single additional pass over the validation data that you used to find the stopping point.
There are two ways:
First:
While taking a x_train and x_test split. You can take a 0.1 split from x_train and keep it for validation x_dev:
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.25)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.1)
clf = GridSearchCV(YourEstimator(), param_grid=param_grid,)
clf.fit(x_train, y_train, x_dev, y_dev)
And your estimator will look like the following and implement early stopping with x_dev, y_dev
class YourEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, param1, param2):
# perform initialization
#
def fit(self, x, y, x_dev=None, y_dev=None):
# perform training with early stopping
#
Second
You would not perform the second split on x_train, but would take out the dev set in the fit method of the Estimator
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.25)
clf = GridSearchCV(YourEstimator(), param_grid=param_grid)
clf.fit(x_train, y_train)
And your estimator will look like the following:
class YourEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, param1, param2):
# perform initialization
#
def fit(self, x, y):
# perform training with early stopping
x_train, x_dev, y_train, y_dev = train_test_split(x, y,
test_size=0.1)