sklearn - how do you get a probability instead of a label? - python

I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().

Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))

You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.

Related

Sklearn: Calibrate a multi-label classification with CalibratedClassifierCV

I have built a number of sklearn classifier models to perform multi-label classification and I would like to calibrate their predict_proba outputs so that I can obtain confidence scores. I would also like to use metrics such as sklearn.metrics.recall_score to evaluate them.
I have 4 labels to predict and the true labels are multi-hot encoded (e.g. [0, 1, 1, 1]). As a result, CalibratedClassifierCV does not directly accept my data:
clf = tree.DecisionTreeClassifier(max_depth=15)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
This would return an error:
ValueError: classes [[0 1]
[0 1]
[0 1]
[0 1]] mismatch with the labels [0 1 2 3] found in the data
Thus, I tried to wrap it in a OneVsRestClassifier:
clf = OneVsRestClassifier(tree.DecisionTreeClassifier(max_depth=15), n_jobs=4)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
Note that MultiOutputClassifier and ClassifierChain do not work even though they possibly suit my problem better.
It works, but the predict output of the calibrated classifier is multi-class instead of multi-label because of its implementation. There are four classes ([0 1 2 3]) but if there is no need to put a label, it still predicts a 0.
Upon further inspection by means of calibration curves, it turns out the base estimator wrapped inside the calibrated classifier is not calibrated at all. That is, (calibrated_clf.calibrated_classifiers_)[0].base_estimator returns the same clf as before calibration.
I would like to observe the performance of my (calibrated) models doing deterministic (predict) and probabilistic (predict_proba) predictions. How should I design my model/wrap things in other containers to get both calibrated probabilities for each label and comprehensible label predictions?
In your example, you're using a DecisionTreeClassifier which by default support targets of dimension (n, m) where m > 1.
However if you want to have as result the marginal probability of each class then use the OneVsRestClassifier.
Notice that CalibratedClassifierCV expects target to be 1d so the "trick" is to extend it to support Multilabel Classification with MultiOutputClassifier.
Full Example
from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Generate a sample multilabel target
X, y = make_multilabel_classification(n_classes=4, random_state=0)
y
>>>
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
...
[0, 0, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]])
# Split in train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.9, random_state=42
)
# Splits Stratify target variable
cv = StratifiedKFold(n_splits=2)
# Decision tree support by default multiclass target or use OneVsRest if marginal probabilities
clf = OneVsRestClassifier(DecisionTreeClassifier(max_depth=10))
# Calibrate estimator probabilities
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=cv)
# calibrated_clf target is one dimensional, extend classifier to multi-target classification.
multioutput_clf = MultiOutputClassifier(calibrated_clf).fit(X_train, y_train)
# Check predict
multioutput_clf.predict(X_test[-5:])
>>>
array([[0, 0, 1, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
# Check predict_proba
multioutput_clf.predict_proba(X_test[-5:])
>>>
[array([[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685]]),
array([[0.59166537, 0.40833463],
[0.59166537, 0.40833463],
[0.40833361, 0.59166639],
[0.59166537, 0.40833463],
[0.59166537, 0.40833463]]),
array([[0.61666922, 0.38333078],
[0.61666427, 0.38333573],
[0.80000098, 0.19999902],
[0.61666427, 0.38333573],
[0.61666427, 0.38333573]]),
array([[0.26874774, 0.73125226],
[0.26874774, 0.73125226],
[0.45208444, 0.54791556],
[0.26874774, 0.73125226],
[0.26874774, 0.73125226]])]
Notice that the result from predict_proba is a list with 4 arrays, each array is the probability to belong to the class i. For example, inside the first sample of the first array is the probability that first sample belongs to class 1 and so on.
Regarding the calibration curves, scikit-learn provides examples to plot probability path for two dimension and three dimension targets.

Lasso regression on boston crime dataset in python

Lasso regression solution in R
The above link contains the the code for solution of Lasso regression in R. I am trying to solve it in python. Can someone help me out to solve it python??
Output
Output of it is as in the above picture.
Try using the below approach
from imp import new_module from sklearn.linear_model import LassoCV,
Lasso new_module = LassoCV(cv=5, random_state=0, max_iter=10000)
new_module.fit(train_x, train_y) new_module.alpha_
BestLassofit = Lasso(alpha=model.alpha_) BestLassofit.fit(train_x,
train_y) importance = np.abs(BestLassofit.coef_)[1:] importance[:10]
Col = np.array(df.Col)[importance > 0] x =
sm.add_constant(df[Col])
train_x, test_x, train_y, test_y =
sklearn.model_selection.train_test_split(
x, crimerate_df, test_size=0.2, random_state=123 )
train_x_tmp = sm.add_constant(train_x)
lmod = sm.OLS(train_y, train_x_tmp).fit() lmod.summary()
lmod.predict()[:10]
lmod.get_prediction().summary_frame()[:10]
sm.qqplot(lmod.resid,line="q") plt.title("Q-Q plot of Standardized
Residuals")
plt.show()
I'm a huge fan of scikit-learn's linear models module, where you can find sklearn.linear_model.Lasso for an out-of-the-box Lasso regression implementation.
Example from the docs:
>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1)
>>> print(clf.coef_)
[0.85 0. ]
>>> print(clf.intercept_)
0.15...
The link you sent seems to want you to tune the "shrinkage" parameter (which I imagine is alpha), so you could create a loop where you iterate over values of alpha, record the score (i.e. dataset error), and create the plot they display in the link.

Clustering based on model parameters

I have been trying to clustering based on the SGD model parameters (Coefficient and Intercept). coef_ holds the weights w and intercept_ holds b.
How can those parameters be used with clustering (KMedoids) on a group of the learned model?
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier()
clf.fit(X, Y)
So I want to make clustering based on clf.coef_ (array([[19.47419669, 9.73709834]])) and clf.intercept_ (array([-10.])) for each learned model.
Build your X dataset for clustering by appending the coeffs and intercept arrays every time after you train a model, ie.:
X = np.vstack((X, np.hstack((clf.coeff_, clf.intercept_))))
Once you have all your data in X feed it a KMedoids model, ie.:
from sklearn_extra.cluster import KMedoids
kmed = KMedoids(n_clusters=N).fit(X)
Note that you have specify N and you should probably test the clustering results for a number of values of N before choosing the best one based on one or more of clustering metrics.

How to make and use Naive Bayes Classifier with Scikit

I'm following a book about machine learning in python and I just don't understand this code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from utilities import visualize_classifier
# Input file containing data
input_file = 'data_multivar_nb.txt'
# Load data from input file
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
# Create Naive Bayes classifier
classifier = GaussianNB()
# Train the classifier
classifier.fit(X, y)
# Predict the values for training data
y_pred = classifier.predict(X)
# Compute accuracy
accuracy = 100.0 * (y == y_pred).sum() / X.shape[0]
print("Accuracy of Naive Bayes classifier =", round(accuracy, 2), "%")
I just have a few questions:
What does data[:, :-1] and data[:, -1] do?
The input file is in the form of:
2.18,0.57,0
4.13,5.12,1
9.87,1.95,2
4.02,-0.8,3
1.18,1.03,0
4.59,5.74,1
How does the computing accuracy part work?
What is X.shape[0]?
Lastly how do I use the classifier to predict the y for new values?
When you index a numpy array you use square brackets similar to a list.
my_list[-1] returns the last item in the list.
For example.
my_list = [1, 2, 3, 4]
my_list[-1]
4
If you're familiar with list indexing then you will know what a slice is.
my_list[:-1] returns all items from the beginning to the last-but-one.
my_list[:-1]
[1, 2, 3]
In your code, data[:, :-1] is simply indexing with slices in 2-dimensions. Lookup the documentation on numpy arrays for more information. Understanding ndarrays is a pre-requisite for using sklearn.

How to use ExtraTreeClassifier to predict multiclass classifications

I'm quite new to machine learning techniques, and I'm having trouble following some of the scikit-learn documentation and other stackoverflow posts.. I'm trying to create a simple model from a bunch of medical data that will help me predict which of three classes a patient could fall into.
I load the data via pandas, convert all the objects to integers (Male = 0, Female=1 for example), and run the following code:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.ensemble import ExtraTreesClassifier
# Upload data file with all integers:
data = pd.read_csv('datafile.csv')
y = data["Target"]
features = list(data.columns[:-1]) # Last column being the target data
x = data[features]
ydata = label_binarize(y, classes=[0, 1, 2])
n_classes = ydata.shape[1]
X_train, X_test, y_train, y_test = train_test_split(x, ydata, test_size=.5)
model2 = ExtraTreesClassifier()
model2.fit(X_train, y_train)
out = model2.predict(X_test)
print np.min(out),np.max(out)
The predicted values of out range between 0.0 and 1.0, but the classes I am trying to predict are 0,1, and 2. What am I missing?
That's normal behaviour in scikit-learn.
There are two approaches possible:
A:You use "label binarize"
Binarizing transforms y=[n_samples, ] -> y[n_samples, n_classes] (1 dimension added; integers in range(0, X) get transformed to binary values)
Because of this input to fit, classifier.predict() will also return results of the form [n_predict_samples, n_classes] (with 0 and 1 as the only values) / That's what you observe!
Example output: [[0 0 0 1], [1 0 0 0], [0 1 0 0]] = predictions for class: 3, 0, 1
B: You skip "label binarize" (multi-class handling automatically done by sklearn)
Without binarizing (assuming your data is using integer-markers for classes): y=[n_samples, ]
Because of this input to fit, classifier.predict() will also return results of the form [n_predict_samples, ] (with possibly other values than 0, 1)
Example output conform to above example: [3 0 1]
Both outputs are mentioned in the docs here:
predict(X)
Returns:
y : array of shape = [n_samples] or [n_samples, n_outputs]
The predicted classes.
Remark: the above behaviour should be valid for most/all classifiers! (not only ExtraTreesClassifier)

Categories