How to use scikit's preprocessing/normalization along with cross validation?

How to use scikit's preprocessing/normalization along with cross validation? - python

As an example of cross-validation without any preprocessing, I can do something like this:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
I would like to preprocess my data using something like
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?

Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
More relevant information on pipelines available here.

Related

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

What I want to do is to derive a classifier which is optimal in its parameters with respect to a given metric (for example the recall score) but also calibrated (in the sense that the output of the predict_proba method can be directly interpreted as a confidence level, see https://scikit-learn.org/stable/modules/calibration.html). Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV, that is, to fit a classifier via GridSearchCV, and then pass the GridSearchCV output to the CalibratedClassifierCV object? If I'm correct, the CalibratedClassifierCV object would fit a given estimator cv times, and the probabilities for each of the folds are then averaged for prediction. However, the results of the GridSearchCV could be different for each of the folds.

Yes you can do this and it would work. I don't know if it makes sense to do this, but the least I can do is explain what I believe would happen.
We can compare doing this to the alternative which is getting the best estimator from the grid search and feeding that to the calibration.
Simply getting the best estimator and feeding it to calibrationcv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
calibration_clf = CalibratedClassifierCV(clf.best_estimator_)
calibration_clf.fit(iris.data, iris.target)
calibration_clf.predict_proba(iris.data[0:10])
array([[0.91887427, 0.07441489, 0.00671085],
[0.91907451, 0.07417992, 0.00674558],
[0.91914982, 0.07412815, 0.00672202],
[0.91939591, 0.0738401 , 0.00676399],
[0.91894279, 0.07434967, 0.00670754],
[0.91910347, 0.07414268, 0.00675385],
[0.91944594, 0.07381277, 0.0067413 ],
[0.91903299, 0.0742324 , 0.00673461],
[0.91951618, 0.07371877, 0.00676505],
[0.91899007, 0.07426733, 0.00674259]])
Feeding grid search in the Calibration cv
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets
from sklearn.calibration import CalibratedClassifierCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
cal_clf = CalibratedClassifierCV(clf)
cal_clf.fit(iris.data, iris.target)
cal_clf.predict_proba(iris.data[0:10])
array([[0.900434 , 0.0906832 , 0.0088828 ],
[0.90021418, 0.09086583, 0.00891999],
[0.90206035, 0.08900572, 0.00893393],
[0.9009212 , 0.09012478, 0.00895402],
[0.90101953, 0.0900889 , 0.00889158],
[0.89868497, 0.09242412, 0.00889091],
[0.90214948, 0.08889812, 0.0089524 ],
[0.8999936 , 0.09110965, 0.00889675],
[0.90204193, 0.08896843, 0.00898964],
[0.89985101, 0.09124147, 0.00890752]])
Notice that the output of the probabilities are slightly different between the two.
The difference between each method is:
Using the best estimator is only doing the calibration across 5 splits (the default cv). It uses the same estimator in all 5 splits.
Using grid search, is doing going to fit a grid search on each of the 5 CV splits from calibration 5 times. You are essentially doing cross validation on 4/5 of the data each time choosing the best estimator for the 4/5 of the data and then doing the calibration with that best estimator on the last 5th. You could have slightly different models running on each set of test data depending on what the grid search chooses.
I think the grid search and calibration are different goals so in my opinion I would probably work on each separately and go with the first way specified above get a model that works the best and then feed that in the calibration curve.
However, I don't know your specific goals so I can't say that the 2nd way described here is the WRONG way. You can always try both ways and see what gives you better performance and go with the one that works best.

I think that your approach is a little different with your objective. What you objective says is "Find a model with best recall, which confidence should be unbiased", but what you do is "Find a model with best recall, then make the confidence unbiased". So a better (but slower) way to do that is:
Wrap your model with CalibratedClassifierCV, treat this model as the final model you should be optimized on;
Modify your param grid, make sure that you are tuning the model inside CalibratedClassifierCV (change param to something like base_estimator__param, which is the property CalibratedClassifierCV to hold the base estimator)
Feed CalibratedClassifierCV model into your final GridSearchCV, then fit
get best_estimator_, which is your unbiased model with best recall.

I would advise that you do calibrate on a separate set not to bias the estimate.
I see two options. Either you cross validate within a fraction of the folds generated for calibrating, as suggested above, or you set apart an ad-hoc evaluation set that you would use only for calibration, after performing cross validation on training set.
In any case, I would recommend that you finally evaluate on a test set.

Using statsmodels OLS on a test-set

I would like to use a technique from Scikit Learn, namely the ShuffleSplit to benchmark my linear regression model with a sequence of randomized test and train sets. This is well established and works great for the LinearModel in Scikit Learn using:
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
train_score = LM.score(X[train_index], Y[train_index])
test_score = LM.score(X[test_index], Y[test_index])
The score one gets here is only the R² values and nothing more. Using the statsmodel OLS implementation for linear models gives a very rich set of scores among whcih are adjusted R² and AIC, BIC etc. However here on can only fit the model with the training data to get these scores. Is there a way to get them also for the test set?
so in my example:
from sklearn.model_selection import ShuffleSplit
from statsmodels.regression.linear_model import OLS
ss = ShuffleSplit(n_splits=40, train_size=0.15, random_state=42)
for train_index, test_index in ss.split(X):
regr = OLS( Y.[train_index], X.[train_index]).fit()
train_score_AIC = regr.aic
is there a way to add something like
test_score_AIC = regr.test(Y.[test_index], X.[test_index]).aic

Most of those measure are goodness of fit measures that are build into the model/results classes and only available for the training data or estimation sample.
Many of those measures are not well defined for out of sample, predictive accuracy measures, or I have never seen definitions that would fit that case.
Specifically, loglike is a method of the model and can only be evaluated at the attached training sample.
related issues:
https://github.com/statsmodels/statsmodels/issues/2572
https://github.com/statsmodels/statsmodels/issues/1282
It would be possible to partially work around the current limitations of statsmodels but none of those are currently supported and unit tested.

Skip some of the transform steps (related to over and under sampling) in imbalanced-learn pipeline when predicting on test data set [duplicate]

I'm dealing with an imbalanced dataset and want to do a grid search to tune my model's parameters using scikit's gridsearchcv. To oversample the data, I want to use SMOTE, and I know I can include that as a stage of a pipeline and pass it to gridsearchcv.
My concern is that I think smote will be applied to both train and validation folds, which is not what you are supposed to do. The validation set should not be oversampled.
Am I right that the whole pipeline will be applied to both dataset splits? And if yes, how can I turn around this?
Thanks a lot in advance

Yes, it can be done, but with imblearn Pipeline.
You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here.
When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer.
You can confirm that by looking at the source code here:
if hasattr(transform, "fit_sample"):
pass
else:
Xt = transform.transform(Xt)
So for this to work correctly, you need the following:
from imblearn.pipeline import Pipeline
model = Pipeline([
('sampling', SMOTE()),
('classification', LogisticRegression())
])
grid = GridSearchCV(model, params, ...)
grid.fit(X, y)
Fill the details as necessary, and the pipeline will take care of the rest.

Python RandomForest classifier (how to test it)

I have been able to create a RandomForestClassifier on a dataset.
clf = RandomForestClassifier(n_estimators=100, random_state = 101)
I can then use it on the test data like this:
prediction = pd.DataFrame(clf.predict(x)) # x = Matrix of predictor values
So my question is, how can I test clf.predict outside of Python, how can I see the values that is using and how can I test it "manually" for example if you get the betas in a Regression you can then use those values in Excel and replicate the model. How to do this with RandomForests in Python?
Also is there a similar metric to Rsquared to test the model's explication power?
Thanks!

The RandomForestClassifier is an ensemble of trees which means it is composed by multiple trees.
To be able to test the trees I would suggest to do it in Python itself, you can access all the trees in the estimators_ attribute of the classifier and subsequently export them as graphs with export_graphviz from sklearn.tree module.
If you insist on exporting the trees you will need to export all the rules that each tree is composed by. For that, you can follow this instructions from the sklearn docs.
Regarding the metrics, for a classification problem you could use accuracy_score from sklearn.metrics module.

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use scikit's preprocessing/normalization along with cross validation? - python

Related

Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?

Using statsmodels OLS on a test-set

Skip some of the transform steps (related to over and under sampling) in imbalanced-learn pipeline when predicting on test data set [duplicate]

Python RandomForest classifier (how to test it)

Machine learning procedure splitting the data into 3 sets

Categories

Resources