I am implementing custom estimators thanks to the scikit library and its Pipeline, BaseEstimators, TransformerMixin and other base classes. (you can check the API here)
Given an pipeline, you can call pipeline.fit(X) then pipeline.predict(X) or you can use pipeline.fit_predict(X) which is a bit faster because it applies necessary transformations once instead of twice (one for the fit and one for the predict). So it is used to get an optimization when you want to predict on the same dataset you used to fit.
But some models, like classifiers or clusterers, have a method called predict_proba that return the probability of the classification or labelization.
From the scikit glossary (link):
fit_predict
Used especially for unsupervised, transductive estimators, this fits
the model and returns the predictions (similar to predict) on the
training data. In clusterers, these predictions are also stored in the
labels_ attribute, and the output of .fit_predict(X) is usually
equivalent to .fit(X).predict(X). The parameters to fit_predict
are the same as those to fit.
predict_proba
A method in classifiers and clusterers that are able to return
probability estimates for each class/cluster. Its input is usually
only some observed data, X.
If the estimator was not already fitted, calling this method should
raise a exceptions.NotFittedError.
Output conventions are like those for decision_function except in the
binary classification case, where one column is output for each class
(while decision_function outputs a 1d array). For binary and
multiclass predictions, each row should add to 1.
Like other methods, predict_proba should only be present when the
estimator can make probabilistic predictions (see duck typing). This
means that the presence of the method may depend on estimator
parameters (e.g. in linear_model.SGDClassifier) or training data
(e.g. in model_selection.GridSearchCV) and may only appear after
fitting.
I am looking for a way to get a fit_predict_proba method which has the same advantages of the fit_predict but that return probabilities
Related
Hi all I am having trouble understanding how to use the output of sklearn.calibration.CalibratedClassifierCV.
I have calibrated my binary classifier using this method, and results are greatly improved. However I am not sure how to interpret the results.
sklearn guide states that, after calibration,
the output of predict_proba method can be directly interpreted as a
confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class.
Now I would like to reduce false positive by applying a cutoff at .6 for the model to predict label True. Without the calibration, I would have simply used my_model.predict_proba() > .6.
However, it seems that after calibration the meaning of predict_proba has changed, so I am not sure if I can do that anymore.
From a quick testing it seems that predict and predict_proba follow the same logic I would expect before calibration. The output of:
pred = my_model.predict(valid_x)
proba= my_model.predict_proba(valid_x)
pd.DataFrame({"label": pred, "proba": proba[:,1]})
is the following:
Where everything that has a probability of above .5 gets to be classifed as True, and everything below .5 as False.
Can you confirm that, after calibration, I can still use predict_proba to apply a different cutoff to identify my labels?
2 https://scikit-learn.org/stable/modules/calibration.html#calibration
For me, you can actually use predict_proba() after calibration to apply a different cutoff.
What happens within class CalibratedClassifierCV (as you noticed) is effectively that the output of predict() is based on the output of predict_proba() (see here for reference), i.e. np.argmax(self.predict_proba(X), axis=1) == self.predict(X).
On the other side, for the non-calibrated classifier that you're passing to CalibratedClassifierCV (depending on whether it is a probabilistic classifier or not) the above equality may or may not hold (e.g. it does not for an SVC() classifier - see here, for instance, for some other details on this).
There are two methods when we make a model on sklearn.cluster.KMeans. First is fit() and other is fit_predict(). My understanding is that when we use fit() method on KMeans model, it gives an attribute labels_ which basically holds the info on which observation belong to which cluster. fit_predict() also have labels_ attribute.
So my question are,
If fit() fulfills the need then why their is fit_predict()?
Are fit() and fit_predict() interchangeable while writing code?
KMeans is just one of the many models that sklearn has, and many share the same API. The basic functions ae fit, which teaches the model using examples, and predict, which uses the knowledge obtained by fit to answer questions on potentially new values.
KMeans will automatically predict the cluster of all the input data during the training, because doing so is integral to the algorithm. It keeps them around for efficiency, because predicting the labels for the original dataset is very common. Thus, fit_predict adds very little: it calls fit, then returns .labels_. fit_predict is just a convenience method that calls fit, then returns the labels of the training dataset. (fit_predict doesn't have a labels_ attribute, it just gives you the labels.)
However, if you want to train your model on one set of data and then use this to quickly (and without changing the established cluster boundaries) get an answer for a data point that was not in the original data, you would need to use predict, not fit_predict.
In other models (for example sklearn.neural_network.MLPClassifier), training can be a very expensive operation so you may not want to re-train a model every time you want to predict something; also, it may not be a given that the prediction result is generated as a part of the prediction. Or, as discussed above, you just don't want to change the model in response to new data. In those cases, you cannot get predictions from the result of fit: you need to call predict with the data you want to get a prediction on.
Also note that labels_ is marked with an underscore, a Python convention for "don't touch this, it's private" (in absence of actual access control). Whenever possible, you should use the established API instead.
In scikit-learn, there are similar things such as fit and fit_transform.
Fit and predict or labels_ are essential for clustering.
Thus fit_predict is just efficient code, and its result is the same as the result from fit and predict (or labels).
In addition, the fitted clustering model is used only once when determining cluster labels of samples.
I am trying to use keras to fit a CNN model to classify images. The data set has much more images form certain classes, so its unbalanced.
I read different thing on how to weight the loss to account for this in Keras, e.g.:
https://datascience.stackexchange.com/questions/13490/how-to-set-class-weights-for-imbalanced-classes-in-keras, which is nicely explained. But, its always explaining for the fit() function, not the fit_generator() one.
Indeed, in the fit_generator() function we dont have the 'class_weights' parameter, but instead we have 'weighted_metrics', which I dont understand its description: "weighted_metrics: List of metrics to be evaluated and weighted by sample_weight or class_weight during training and testing."
How can I pass from 'class_weights' to 'weighted_metrics'? Would any one have a simple example?
We have class_weight in fit_generator (Keras v.2.2.2) According to docs:
Class_weight: Optional dictionary mapping class indices (integers) to
a weight (float) value, used for weighting the loss function (during
training only). This can be useful to tell the model to "pay more
attention" to samples from an under-represented class.
Assume you have two classes [positive and negative], you can pass class_weight to fit_generator with:
model.fit_generator(gen,class_weight=[0.7,1.3])
I'm trying to compute the AUC score for a multiclass problem using the sklearn's roc_auc_score() function.
I have prediction matrix of shape [n_samples,n_classes] and a ground truth vector of shape [n_samples], named np_pred and np_label respectively.
What I'm trying to achieve is the set of AUC scores, one for each classes that I have.
To do so I would like to use the average parameter option None and multi_class parameter set to "ovr", but if I run
roc_auc_score(y_score=np_pred, y_true=np_label, multi_class="ovr",average=None)
I get back
ValueError: average must be one of ('macro', 'weighted') for multiclass problems
This error is expected from the sklearn function in the case of the multiclass; but if you take a look at the roc_auc_score function source code, you can see that if the multi_class parameter is set to "ovr", and the average is one of the accepted one, the multiClass case is treated as a multiLabel one and the internal multiLabel function accepts None as average parameter.
So, by looking at the code, it seems that I should be able to execute a multiclass with a None average in a One vs Rest case but the ifs in the source code do not allow such combination.
Am I wrong?
In case I'm wrong, from a theoretical point of view should I fake a multilabel case just to have the different AUCs for the different classes or should I write my own function that cycles the different classes and outputs the AUCs?
Thanks
According to sklearn documentation, the default parameter for multi_class is 'raised', In documentation it is mentioned, that the default parameter will throw an exception, so you have to mention ovr or ovo explicitly multi_class='ovr'.
Refer to the attached screenshot
As you already know, right now sklearn multiclass ROC AUC only handles the macro and weighted averages. But it can be implemented as it can then individually return the scores for each class.
Theoretically speaking, you could implement OVR and calculate per-class roc_auc_score, as:
roc = {label: [] for label in multi_class_series.unique()}
for label in multi_class_series.unique():
selected_classifier.fit(train_set_dataframe, train_class == label)
predictions_proba = selected_classifier.predict_proba(test_set_dataframe)
roc[label] += roc_auc_score(test_class, predictions_proba[:,1])
I'm using SGDClassifier with loss function = "hinge". But hinge loss does not support probability estimates for class labels.
I need probabilities for calculating roc_curve. How can I get probabilities for hinge loss in SGDClassifier without using SVC from svm?
I've seen people mention about using CalibratedClassifierCV to get the probabilities but I've never used it and I don't know how it works.
I really appreciate the help. Thanks
In the strict sense, that's not possible.
Support vector machine classifiers are non-probabilistic: they use a hyperplane (a line in 2D, a plane in 3D and so on) to separate points into one of two classes. Points are only defined by which side of the hyperplane they are on., which forms the prediction directly.
This is in contrast with probabilistic classifiers like logistic regression and decision trees, which generate a probability for every point that is then converted to a prediction.
CalibratedClassifierCV is a sort of meta-estimator; to use it, you simply pass your instance of a base estimator to its constructor, so this will work:
base_model = SGDClassifier()
model = CalibratedClassifierCV(base_model)
model.fit(X, y)
model.predict_proba(X)
What it does is perform internal cross-validation to create a probability estimate. Note that this is equivalent to what sklearn.SVM.SVC does anyway.