Using statsmodels OLS on a test-set - python

I would like to use a technique from Scikit Learn, namely the ShuffleSplit to benchmark my linear regression model with a sequence of randomized test and train sets. This is well established and works great for the LinearModel in Scikit Learn using:
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
train_score = LM.score(X[train_index], Y[train_index])
test_score = LM.score(X[test_index], Y[test_index])
The score one gets here is only the R² values and nothing more. Using the statsmodel OLS implementation for linear models gives a very rich set of scores among whcih are adjusted R² and AIC, BIC etc. However here on can only fit the model with the training data to get these scores. Is there a way to get them also for the test set?
so in my example:
from sklearn.model_selection import ShuffleSplit
from statsmodels.regression.linear_model import OLS
ss = ShuffleSplit(n_splits=40, train_size=0.15, random_state=42)
for train_index, test_index in ss.split(X):
regr = OLS( Y.[train_index], X.[train_index]).fit()
train_score_AIC = regr.aic
is there a way to add something like
test_score_AIC = regr.test(Y.[test_index], X.[test_index]).aic

Most of those measure are goodness of fit measures that are build into the model/results classes and only available for the training data or estimation sample.
Many of those measures are not well defined for out of sample, predictive accuracy measures, or I have never seen definitions that would fit that case.
Specifically, loglike is a method of the model and can only be evaluated at the attached training sample.
related issues:
https://github.com/statsmodels/statsmodels/issues/2572
https://github.com/statsmodels/statsmodels/issues/1282
It would be possible to partially work around the current limitations of statsmodels but none of those are currently supported and unit tested.

Related

Improve speed of scikit-learn multinomial logistic regression

i am trying to train a logistic regression model in scikit learn and it is taking very long to train, around 2 hours. The size of the dataset is 21613 x 19. I am new to scikit learn, as such i dont know whether my code is wrong or that it just takes very long to train. Any suggestion on how to improve the training speed would be very much appreciated!
code used to train is below
# get the LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
# training the model
# apply algorithm to data using fit()
clf = LogisticRegression(solver='newton-cg',multi_class='multinomial')
clf.fit(X_train,y_train)
If you have a specific reason for using this solver, one thing you can do is parallelize the computations by setting the n_jobs=-1 argument.
If you're open to using other solvers, you can use faster solvers with a one-versus-rest strategy. For instance:
clf = LogisticRegression(solver='liblinear', multi_class='ovr')
It's all in the documentation, which can help you guide your choice of solver:
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not support setting penalty='none'
It's probably that slow because of the solver you have chosen. The newton-cg is a newton method. It's slow for large datasets because it computes the second derivatives. Use a different solver like sag or saga, they are fast for big datasets.
You might want to change your solver. The documentation says that scikit-learn has 5 different solvers: 'sag', 'saga', 'newton-cg', 'lbfgs', and 'liblinear' (not suitable for multinomial).
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=19, n_classes = 8, n_informative=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Solvers
solvers = ['newton-cg', 'sag', 'saga', 'lbfgs']
for sol in solvers:
start = time.time()
logreg = LogisticRegression(solver=sol, multi_class='multinomial')
logreg.fit(X_train, y_train)
end = time.time()
print(sol + " Fit Time: ",end-start)
Output (from 16GB RAM 8 Core Macbook):
Choosing the right solver for a problem can save a lot of time (code adapted from here). To determine which solver is right for your problem, you can check out the table from the documentation to learn more (notice that 'newton-cg' is not faster for large datasets).

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.
Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
Feeding grid search in the Calibration cv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.
I think that your approach is a little different with your objective. What you objective says is "Find a model with best recall, which confidence should be unbiased", but what you do is "Find a model with best recall, then make the confidence unbiased". So a better (but slower) way to do that is:
Wrap your model with CalibratedClassifierCV, treat this model as the final model you should be optimized on;
Modify your param grid, make sure that you are tuning the model inside CalibratedClassifierCV (change param to something like base_estimator__param, which is the property CalibratedClassifierCV to hold the base estimator)
Feed CalibratedClassifierCV model into your final GridSearchCV, then fit
get best_estimator_, which is your unbiased model with best recall.
I would advise that you do calibrate on a separate set not to bias the estimate.
I see two options. Either you cross validate within a fraction of the folds generated for calibrating, as suggested above, or you set apart an ad-hoc evaluation set that you would use only for calibration, after performing cross validation on training set.
In any case, I would recommend that you finally evaluate on a test set.

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.
This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does
Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Scikit learn: measure of goodness of fit, better splitting the dataset or use all of it?

Sort of taking inspiration from here.
My problem
So I have a dataset with 3 features and n observations. I also have n responses. Basically I want to see if this model is a good fit or not.
From the question above people use R^2 for this purpose. But I am not sure I understand..
Can I just fit the model and then calculate the Mean Squared Error?
Should I use train/test split?
All of these seem to have in common prediction, but here I just want to see how good it is at fitting it.
For instance this is my idea
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
diabetes = datasets.load_diabetes()
#my idea
regr = linear_model.LinearRegression()
regr.fit(diabetes_X, diabetes.target)
print(np.mean((regr.predict(diabetes_X)-diabetes.target)**2))
However I often see people doing things like
diabetes_X = diabetes.data[:, np.newaxis, 2]
# split X
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# split y
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# instantiate and fit
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
# MSE but based on the prediction on test
print('Mean squared error: %.2f' % np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
In the first instance we get: 3890.4565854612724 while in the second case we get 2548.07. Which is the most correct one?
IMPORTANT: I WANT THIS TO WORK IN MULTIPLE REGRESSION, THIS IS JUST A MWE!
Can I just fit the model and then calculate the Mean Squared Error? Should I use train/test split?
No, you will run the risk of overfitting the model. That's the reason for the data to be split into train and test (or, even validation datasets). So, that the model doesn't just 'memorize' what it sees but learns to perform even on newer, unseen samples.
It's always preferred to evaluate the performance of the model on a new set of data that wasn't observed during training. If you're going to optimize hyper-parameters or choosing among several models, an additional validation data is a right choice.
However, sometimes the data is scarce and entirely removing data from the training process is prohibitive. In these cases, I strongly recommend you to use more efficient ways of validating your models such as k-fold cross-validation (see KFold and StratifiedKFold in scikit-learn).
Finally, it is a good idea to ensure that your partitions behave in a similar way in the training and test sets. I recommend you to sample the data uniformly on the target space so you can ensure that you train/validate your model with the same distribution of target values.

Categories