Evaluate Loss Function Value Getting From Training Set on Cross Validation Set - python

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.

Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.

This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does

Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

Related

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.
Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
Feeding grid search in the Calibration cv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.
I think that your approach is a little different with your objective. What you objective says is "Find a model with best recall, which confidence should be unbiased", but what you do is "Find a model with best recall, then make the confidence unbiased". So a better (but slower) way to do that is:
Wrap your model with CalibratedClassifierCV, treat this model as the final model you should be optimized on;
Modify your param grid, make sure that you are tuning the model inside CalibratedClassifierCV (change param to something like base_estimator__param, which is the property CalibratedClassifierCV to hold the base estimator)
Feed CalibratedClassifierCV model into your final GridSearchCV, then fit
get best_estimator_, which is your unbiased model with best recall.
I would advise that you do calibrate on a separate set not to bias the estimate.
I see two options. Either you cross validate within a fraction of the folds generated for calibrating, as suggested above, or you set apart an ad-hoc evaluation set that you would use only for calibration, after performing cross validation on training set.
In any case, I would recommend that you finally evaluate on a test set.

Training a model by looping through the train_test_split and training without looping

I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.

Using statsmodels OLS on a test-set

I would like to use a technique from Scikit Learn, namely the ShuffleSplit to benchmark my linear regression model with a sequence of randomized test and train sets. This is well established and works great for the LinearModel in Scikit Learn using:
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
train_score = LM.score(X[train_index], Y[train_index])
test_score = LM.score(X[test_index], Y[test_index])
The score one gets here is only the R² values and nothing more. Using the statsmodel OLS implementation for linear models gives a very rich set of scores among whcih are adjusted R² and AIC, BIC etc. However here on can only fit the model with the training data to get these scores. Is there a way to get them also for the test set?
so in my example:
from sklearn.model_selection import ShuffleSplit
from statsmodels.regression.linear_model import OLS
ss = ShuffleSplit(n_splits=40, train_size=0.15, random_state=42)
for train_index, test_index in ss.split(X):
regr = OLS( Y.[train_index], X.[train_index]).fit()
train_score_AIC = regr.aic
is there a way to add something like
test_score_AIC = regr.test(Y.[test_index], X.[test_index]).aic
Most of those measure are goodness of fit measures that are build into the model/results classes and only available for the training data or estimation sample.
Many of those measures are not well defined for out of sample, predictive accuracy measures, or I have never seen definitions that would fit that case.
Specifically, loglike is a method of the model and can only be evaluated at the attached training sample.
related issues:
https://github.com/statsmodels/statsmodels/issues/2572
https://github.com/statsmodels/statsmodels/issues/1282
It would be possible to partially work around the current limitations of statsmodels but none of those are currently supported and unit tested.

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Scikit learn: measure of goodness of fit, better splitting the dataset or use all of it?

Sort of taking inspiration from here.
My problem
So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.
From the question above people use R^2 for this purpose. But I am not sure I understand..
Can I just fit the model and then calculate the Mean Squared Error?
Should I use train/test split?
All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.
For instance this is my idea
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))
However I often see people doing things like
diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
In the first instance we get: 3890.4565854612724 while in the second case we get 2548.07. Which is the most correct one?
IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!
Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?
No, you will run the risk of overfitting the model. That's the reason for the data to be split into train and test (or, even validation datasets). So, that the model doesn't just 'memorize' what it sees but learns to perform even on newer, unseen samples.
It's always preferred to evaluate the performance of the model on a new set of data that wasn't observed during training. If you're going to optimize hyper-parameters or choosing among several models, an additional validation data is a right choice.
However, sometimes the data is scarce and entirely removing data from the training process is prohibitive. In these cases, I strongly recommend you to use more efficient ways of validating your models such as k-fold cross-validation (see KFold and StratifiedKFold in scikit-learn).
Finally, it is a good idea to ensure that your partitions behave in a similar way in the training and test sets. I recommend you to sample the data uniformly on the target space so you can ensure that you train/validate your model with the same distribution of target values.

Categories