Related
I have built a number of sklearn classifier models to perform multi-label classification and I would like to calibrate their predict_proba outputs so that I can obtain confidence scores. I would also like to use metrics such as sklearn.metrics.recall_score to evaluate them.
I have 4 labels to predict and the true labels are multi-hot encoded (e.g. [0, 1, 1, 1]). As a result, CalibratedClassifierCV does not directly accept my data:
clf = tree.DecisionTreeClassifier(max_depth=15)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
This would return an error:
ValueError: classes [[0 1]
[0 1]
[0 1]
[0 1]] mismatch with the labels [0 1 2 3] found in the data
Thus, I tried to wrap it in a OneVsRestClassifier:
clf = OneVsRestClassifier(tree.DecisionTreeClassifier(max_depth=15), n_jobs=4)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
Note that MultiOutputClassifier and ClassifierChain do not work even though they possibly suit my problem better.
It works, but the predict output of the calibrated classifier is multi-class instead of multi-label because of its implementation. There are four classes ([0 1 2 3]) but if there is no need to put a label, it still predicts a 0.
Upon further inspection by means of calibration curves, it turns out the base estimator wrapped inside the calibrated classifier is not calibrated at all. That is, (calibrated_clf.calibrated_classifiers_)[0].base_estimator returns the same clf as before calibration.
I would like to observe the performance of my (calibrated) models doing deterministic (predict) and probabilistic (predict_proba) predictions. How should I design my model/wrap things in other containers to get both calibrated probabilities for each label and comprehensible label predictions?
In your example, you're using a DecisionTreeClassifier which by default support targets of dimension (n, m) where m > 1.
However if you want to have as result the marginal probability of each class then use the OneVsRestClassifier.
Notice that CalibratedClassifierCV expects target to be 1d so the "trick" is to extend it to support Multilabel Classification with MultiOutputClassifier.
Full Example
from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Generate a sample multilabel target
X, y = make_multilabel_classification(n_classes=4, random_state=0)
y
>>>
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
...
[0, 0, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]])
# Split in train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.9, random_state=42
)
# Splits Stratify target variable
cv = StratifiedKFold(n_splits=2)
# Decision tree support by default multiclass target or use OneVsRest if marginal probabilities
clf = OneVsRestClassifier(DecisionTreeClassifier(max_depth=10))
# Calibrate estimator probabilities
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=cv)
# calibrated_clf target is one dimensional, extend classifier to multi-target classification.
multioutput_clf = MultiOutputClassifier(calibrated_clf).fit(X_train, y_train)
# Check predict
multioutput_clf.predict(X_test[-5:])
>>>
array([[0, 0, 1, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
# Check predict_proba
multioutput_clf.predict_proba(X_test[-5:])
>>>
[array([[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685]]),
array([[0.59166537, 0.40833463],
[0.59166537, 0.40833463],
[0.40833361, 0.59166639],
[0.59166537, 0.40833463],
[0.59166537, 0.40833463]]),
array([[0.61666922, 0.38333078],
[0.61666427, 0.38333573],
[0.80000098, 0.19999902],
[0.61666427, 0.38333573],
[0.61666427, 0.38333573]]),
array([[0.26874774, 0.73125226],
[0.26874774, 0.73125226],
[0.45208444, 0.54791556],
[0.26874774, 0.73125226],
[0.26874774, 0.73125226]])]
Notice that the result from predict_proba is a list with 4 arrays, each array is the probability to belong to the class i. For example, inside the first sample of the first array is the probability that first sample belongs to class 1 and so on.
Regarding the calibration curves, scikit-learn provides examples to plot probability path for two dimension and three dimension targets.
Lasso regression solution in R
The above link contains the the code for solution of Lasso regression in R. I am trying to solve it in python. Can someone help me out to solve it python??
Output
Output of it is as in the above picture.
Try using the below approach
from imp import new_module from sklearn.linear_model import LassoCV,
Lasso new_module = LassoCV(cv=5, random_state=0, max_iter=10000)
new_module.fit(train_x, train_y) new_module.alpha_
BestLassofit = Lasso(alpha=model.alpha_) BestLassofit.fit(train_x,
train_y) importance = np.abs(BestLassofit.coef_)[1:] importance[:10]
Col = np.array(df.Col)[importance > 0] x =
sm.add_constant(df[Col])
train_x, test_x, train_y, test_y =
sklearn.model_selection.train_test_split(
x, crimerate_df, test_size=0.2, random_state=123 )
train_x_tmp = sm.add_constant(train_x)
lmod = sm.OLS(train_y, train_x_tmp).fit() lmod.summary()
lmod.predict()[:10]
lmod.get_prediction().summary_frame()[:10]
sm.qqplot(lmod.resid,line="q") plt.title("Q-Q plot of Standardized
Residuals")
plt.show()
I'm a huge fan of scikit-learn's linear models module, where you can find sklearn.linear_model.Lasso for an out-of-the-box Lasso regression implementation.
Example from the docs:
>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1)
>>> print(clf.coef_)
[0.85 0. ]
>>> print(clf.intercept_)
0.15...
The link you sent seems to want you to tune the "shrinkage" parameter (which I imagine is alpha), so you could create a loop where you iterate over values of alpha, record the score (i.e. dataset error), and create the plot they display in the link.
I have been trying to clustering based on the SGD model parameters (Coefficient and Intercept). coef_ holds the weights w and intercept_ holds b.
How can those parameters be used with clustering (KMedoids) on a group of the learned model?
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier()
clf.fit(X, Y)
So I want to make clustering based on clf.coef_ (array([[19.47419669, 9.73709834]])) and clf.intercept_ (array([-10.])) for each learned model.
Build your X dataset for clustering by appending the coeffs and intercept arrays every time after you train a model, ie.:
X = np.vstack((X, np.hstack((clf.coeff_, clf.intercept_))))
Once you have all your data in X feed it a KMedoids model, ie.:
from sklearn_extra.cluster import KMedoids
kmed = KMedoids(n_clusters=N).fit(X)
Note that you have specify N and you should probably test the clustering results for a number of values of N before choosing the best one based on one or more of clustering metrics.
okay so when I use the following code, what exactly does that "clf" part mean? is that a variable? I know that's a classifier but is classifier a function in python or it's just a variable named that way or what exactly? I am new to python and programming well.
thanks already!
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
From the docs:
[GNB] can perform online updates to model parameters via partial_fit method
Example:
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
>>> from sklearn.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB(priors=None, var_smoothing=1e-09)
>>> print(clf.predict([[-0.8, -1]]))
[1]
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, np.unique(Y))
GaussianNB(priors=None, var_smoothing=1e-09)
>>> print(clf_pf.predict([[-0.8, -1]]))
[1]
What is a classifier, one may ask? According to Wikipedia, a classifier is
an algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.