Scikit has a very useful classifier wrappers called CalibratedClassifer and CalibratedClassifierCV, which try to make sure that the predict_proba function of a classifier really predicts a probability and not just an arbitrary number (albeit perhaps well-ranked) between zero and one.
However, when using random forests it is customary to use oob_decision_function_ to determine the performance on the training data, but this is no longer available when using the the calibrated models. The calibration should therefore work well for new data but not for the training data. How can we evaluate performance on the training data to determine, e.g., overfitting?
Apparently there really was no solution to this, and so I made a pull request to scikit-learn.
The problem was that the out-of-bag predictions are created during learning. Therefore, in the CalibratedClassifierCV each of the sub-classifiers does have its own oob decision function. However, this decision function is calculated on a fold of the data. Therefore, it is necessary to store each oob prediction (keeping nan values for samples that are not in the fold), then convert all the predictions using the calibration transformation, and then average the calibrated oob predictions to create an updated oob prediction.
As mentioned, I created a pull request at https://github.com/scikit-learn/scikit-learn/pull/11175. It will probably be a while before it is merged into the package, though, so if anyone really needs to use it then feel free to use my fork of scikit-learn at https://github.com/yishaishimoni/scikit-learn.
Related
I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all
A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.
It seems that GridSearchCV of scikit-learn collects the scores of its (inner) cross-validation folds and then averages across the scores of all folds. I was wondering about the rationale behind this. At first glance, it would seem more flexible to instead collect the predictions of its cross-validation folds and then apply the chosen scoring metric to the predictions of all folds.
The reason I stumbled upon this is that I use GridSearchCV on an imbalanced data set with cv=LeaveOneOut() and scoring='balanced_accuracy' (scikit-learn v0.20.dev0). It doesn't make sense to apply a scoring metric such as balanced accuracy (or recall) to each left-out sample. Rather, I would want to collect all predictions first and then apply my scoring metric once to all predictions. Or does this involve an error in reasoning?
Update: I solved it by creating a custom grid search class based on GridSearchCV with the difference that predictions are first collected from all inner folds and the scoring metric is applied once.
GridSearchCVuses the scoring to decide what internal hyperparameters to set in the model.
If you want to estimate the performance of the "optimal" hyperparameters, you need to do an additional step of cross validation.
See http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
EDIT to get closer to answering the actual question:
For me it seems reasonable to collect predictions for each fold and then score them all, if you want to use LeaveOneOut and balanced_accuracy. I guess you need to make your own grid searcher to do that. You could use model_selection.ParameterGrid and model_selection.KFold for that.
The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?
In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.
I'm trying to implement out-of-bag samples so that I won't have to partition my data into a training set and test set for random forest. Looking around, it seems that RandomForestClassifier takes in a boolean parameter oob_score, but I'm not sure if this is what will help me. (As far as I know, this only outputs an R^2 estimate?)
In R, using the randomForest package, predict.RandomForest automatically does oob if you don't input a data set to predict. Is there an equivalent way in Python?