Early-stopping while training neural network in scikit-learn - python

This questions is very specific to the Python library scikit-learn. Please let me know if it's a better idea to post it somewhere else. Thanks!
Now the question...
I have a feed-forward neural network class ffnn based on BaseEstimator which I train with SGD. It's working fine, and I can also train it in parallel using GridSearchCV().
Now I want to implement early stopping in the function ffnn.fit() but for this I also need access to the validation data of the fold. One way of doing this is to change the line in sklearn.grid_search.fit_grid_point() which says
clf.fit(X_train, y_train, **fit_params)
into something like
clf.fit(X_train, y_train, X_test, y_test, **fit_params)
and also change ffnn.fit() to take these arguments. This would also affect other classifiers in sklearn, which is a problem. I can avoid this by checking for some kind of a flag in fit_grid_point() which tells me when to call clf.fit() in either of the above two ways.
Can someone suggest a different way to do this where I don't have to edit any code in the sklearn library?
Alternatively, would it be right to further split X_train and y_train into train/validation sets randomly and check for a good stopping point, then re-train the model on all of X_train?
Thanks!

You could just make you neural network model internally extract a validation set from the passed X_train and y_train by using the train_test_split function for instance.
Edit:
Alternatively, would it be right to further split X_train and y_train into train/validation sets randomly and check for a good stopping point, then re-train the model on all of X_train?
Yes but that would be expensive. You could just find the stopping point and then just a do a single additional pass over the validation data that you used to find the stopping point.

There are two ways:
First:
While taking a x_train and x_test split. You can take a 0.1 split from x_train and keep it for validation x_dev:
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.25)
x_train, x_dev, y_train, y_dev = train_test_split(x_train, y_train, test_size=0.1)
clf = GridSearchCV(YourEstimator(), param_grid=param_grid,)
clf.fit(x_train, y_train, x_dev, y_dev)
And your estimator will look like the following and implement early stopping with x_dev, y_dev
class YourEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, param1, param2):
# perform initialization
#
def fit(self, x, y, x_dev=None, y_dev=None):
# perform training with early stopping
#
Second
You would not perform the second split on x_train, but would take out the dev set in the fit method of the Estimator
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, test_size=0.25)
clf = GridSearchCV(YourEstimator(), param_grid=param_grid)
clf.fit(x_train, y_train)
And your estimator will look like the following:
class YourEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, param1, param2):
# perform initialization
#
def fit(self, x, y):
# perform training with early stopping
x_train, x_dev, y_train, y_dev = train_test_split(x, y,
test_size=0.1)

Related

Should I use training or main dataset for SVM_model.fit

I split my data into training, valuation, test.
I used the validation set in GridSearch to get best parameter C.
Then I used training set as SVM_model.fit(X_train,y_train) with the best C.
Is it correct?
My full code:
X_main, X_test, y_main, y_test = train_test_split(dataTrain, y, test_size=0.2, random_state=10)
X_train, X_val, y_train, y_val = train_test_split(X_main, y_main, test_size=0.2, random_state=10)
GridSearch -> Defining parameter range
svm_linear = {'C': [0.0001,0.001,0.01,0.1 , 10 , 100],'kernel': ['linear']}
parameters = [svm_linear]
svc_mod = GridSearchCV(svm.SVC(), param_grid=parameters , cv=5 ,verbose=50)
svc_mod.fit(X_val,y_val)
#svc_mod.best_estimator_
print('***',svc_mod.best_params_) # -> =C=0.0001
SVM_model = svm.SVC( kernel ='linear',C=0.0001)
SVM_model.fit(X_train,y_train)
your main mistake on your fit on GridSearchCV() :
you must give main train data to GridSearchCV because GridSearchCV must fit on your main training data to find best parameters for this alghorithm and this data
my edit on your code :
svc_mod.fit(X_train,y_train)

How to get predicted labels during training of any classifier?

Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)

What should be passed as input parameter when using train-test-split function twice in python 3.6

Basically i wanted to split my dataset into training,testing and validation set. I therefore have used train_test_split function twice. I have a dataset of around 10-Million rows.
On the first split i have split training and testing dataset into 70-Million training and 30-Million testing. Now to get validation set i am bit confused whether to use splitted testing data or training data as an input parameter of train-test-split in order to get validation set. Give some advise. TIA
X = features
y = target
# dividing X, y into train and test and validation data 70% training dataset with 15% testing and 15% validation set
from sklearn.model_selection import train_test_split
#features and label splitted into 70-30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
#furthermore test data is splitted into test and validation set 15-15
x_test, x_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5)
Don't make a testing set too small. A 20% testing dataset is fine. It would be better, if you splitted you training dataset into training and validation (80%/20% is a fair split). Considering this, you shall change your code in this way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
x_test, x_val, y_test, y_val = train_test_split(X_train, y_train, test_size=0.25)
This is a common practice to split a dataset like this.

How does `fit` function in scikit-learn make validation?

I am having trouble with fit function when applied to MLPClassifier. I carefully read Scikit-Learn's documentation about that but was not able to determine how validation works.
Is it cross-validation or is there a split between training and validation data ?
Thanks in advance.
The fit function per se does not include cross-validation and also does not apply a train test split.
Fortunately you can do this by your own.
Train Test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) // test set size is 0.33
clf = MLPClassifier()
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
clf = MLPClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
For cross validation multiple functions are available, you can read more about it here. The here stated k-fold is just an example.
EDIT:
Thanks for this answer, but basically how does fit function works
concretely ? It just trains the network on the given data (i.e.
training set) until max_iter is reached and that's it ?
I am assuming your are using the default config of MLPClassifier. In this case the fit function tries to do an optimization on basis of adam optimizer. In this case, indeed, the network trains until max_iter is reached.
Moreover, in the K-Fold cross validation, is the model improving as
long as the loop goes through or just restarts from scratch ?
Actually cross-validation is not used to improve the performance of your network, it's actually a methodology to test how well your algrotihm generalizes on different data. For k-fold, k independent classifiers are trained and tested.

Setting the n_estimators argument using **kwargs (Scikit Learn)

I am trying to follow this tutorial to learn the machine learning based prediction but I have got two questions on it?
Ques1. How to set the n_estimators in the below piece of code, otherwise it will always assume the default value.
from sklearn.cross_validation import KFold
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
It is being called as:
from sklearn.svm import SVC
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
Ques2: How to use the already trained model file (e.g. obtained from SVM) so that I can use it to predict more (test) data which I didn't used for training?
For your first question, in the above code you would call run_cv(X,y,SVC,n_classifiers=100), the **kwargs will pass this to the classifier initializer with the step clf = clf_class(**kwargs).
For your second question, the cross validation in the code you've linked is just for model evaluation, i.e. comparing different types of models and hyperparameters, and determining the likely effectiveness of your model in production. Once you've decided on your model, you need to refit the model on the whole dataset:
clf.fit(X,y)
Then you can get predictions with clf.predict or clf.predict_proba.

Categories