How does `fit` function in scikit-learn make validation? - python

I am having trouble with fit function when applied to MLPClassifier. I carefully read Scikit-Learn's documentation about that but was not able to determine how validation works.
Is it cross-validation or is there a split between training and validation data ?
Thanks in advance.

The fit function per se does not include cross-validation and also does not apply a train test split.
Fortunately you can do this by your own.
Train Test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) // test set size is 0.33
clf = MLPClassifier()
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
clf = MLPClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
For cross validation multiple functions are available, you can read more about it here. The here stated k-fold is just an example.
EDIT:
Thanks for this answer, but basically how does fit function works
concretely ? It just trains the network on the given data (i.e.
training set) until max_iter is reached and that's it ?
I am assuming your are using the default config of MLPClassifier. In this case the fit function tries to do an optimization on basis of adam optimizer. In this case, indeed, the network trains until max_iter is reached.
Moreover, in the K-Fold cross validation, is the model improving as
long as the loop goes through or just restarts from scratch ?
Actually cross-validation is not used to improve the performance of your network, it's actually a methodology to test how well your algrotihm generalizes on different data. For k-fold, k independent classifiers are trained and tested.

Related

Scaling and data leakage on cross validation and test set

I have more of a best practice question.
I am scaling my data and I understand that I should fit_transform on my training set and transform on my test set because of potential data leakage.
Now if I want to use both (5 fold) Cross validation on my training set but I use a holdout test set anyway is it necessary to scale each fold independently?
My problem is that I want to use Feature Selection like this:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
efs = EFS(clf_tmp,
min_features=min,
max_features=max,
cv=5,
n_jobs = n_jobs)
efs = efs.fit(X_train, y_train)
Right now I am scaling X_train and X_test independently. But when the whole training set goes into the feature selector there will be some data leakage. Is this a problem for evaluation?
It's definitely best practice to include everything within your cross-validation loop to avoid data leakage. Any scaling should be done on the training set and then applied to the test set within each CV loop.

Using scaler in Sklearn PIpeline and Cross validation

I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.
In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.
Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.

How to get predicted labels during training of any classifier?

Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)

Setting the n_estimators argument using **kwargs (Scikit Learn)

I am trying to follow this tutorial to learn the machine learning based prediction but I have got two questions on it?
Ques1. How to set the n_estimators in the below piece of code, otherwise it will always assume the default value.
from sklearn.cross_validation import KFold
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
It is being called as:
from sklearn.svm import SVC
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
Ques2: How to use the already trained model file (e.g. obtained from SVM) so that I can use it to predict more (test) data which I didn't used for training?
For your first question, in the above code you would call run_cv(X,y,SVC,n_classifiers=100), the **kwargs will pass this to the classifier initializer with the step clf = clf_class(**kwargs).
For your second question, the cross validation in the code you've linked is just for model evaluation, i.e. comparing different types of models and hyperparameters, and determining the likely effectiveness of your model in production. Once you've decided on your model, you need to refit the model on the whole dataset:
clf.fit(X,y)
Then you can get predictions with clf.predict or clf.predict_proba.

Cross validation and model selection

I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.
I split the data into two parts. Train data and test data. Here is the code:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores
Now I need to evaluate the estimator clf on X_test.
clf.score(X_test, y_test)
here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?
cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.
The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.

Categories