Clustering based on model parameters - python

I have been trying to clustering based on the SGD model parameters (Coefficient and Intercept). coef_ holds the weights w and intercept_ holds b.
How can those parameters be used with clustering (KMedoids) on a group of the learned model?
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier()
clf.fit(X, Y)
So I want to make clustering based on clf.coef_ (array([[19.47419669, 9.73709834]])) and clf.intercept_ (array([-10.])) for each learned model.

Build your X dataset for clustering by appending the coeffs and intercept arrays every time after you train a model, ie.:
X = np.vstack((X, np.hstack((clf.coeff_, clf.intercept_))))
Once you have all your data in X feed it a KMedoids model, ie.:
from sklearn_extra.cluster import KMedoids
kmed = KMedoids(n_clusters=N).fit(X)
Note that you have specify N and you should probably test the clustering results for a number of values of N before choosing the best one based on one or more of clustering metrics.

Related

Sklearn: Calibrate a multi-label classification with CalibratedClassifierCV

I have built a number of sklearn classifier models to perform multi-label classification and I would like to calibrate their predict_proba outputs so that I can obtain confidence scores. I would also like to use metrics such as sklearn.metrics.recall_score to evaluate them.
I have 4 labels to predict and the true labels are multi-hot encoded (e.g. [0, 1, 1, 1]). As a result, CalibratedClassifierCV does not directly accept my data:
clf = tree.DecisionTreeClassifier(max_depth=15)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
This would return an error:
ValueError: classes [[0 1]
[0 1]
[0 1]
[0 1]] mismatch with the labels [0 1 2 3] found in the data
Thus, I tried to wrap it in a OneVsRestClassifier:
clf = OneVsRestClassifier(tree.DecisionTreeClassifier(max_depth=15), n_jobs=4)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
Note that MultiOutputClassifier and ClassifierChain do not work even though they possibly suit my problem better.
It works, but the predict output of the calibrated classifier is multi-class instead of multi-label because of its implementation. There are four classes ([0 1 2 3]) but if there is no need to put a label, it still predicts a 0.
Upon further inspection by means of calibration curves, it turns out the base estimator wrapped inside the calibrated classifier is not calibrated at all. That is, (calibrated_clf.calibrated_classifiers_)[0].base_estimator returns the same clf as before calibration.
I would like to observe the performance of my (calibrated) models doing deterministic (predict) and probabilistic (predict_proba) predictions. How should I design my model/wrap things in other containers to get both calibrated probabilities for each label and comprehensible label predictions?
In your example, you're using a DecisionTreeClassifier which by default support targets of dimension (n, m) where m > 1.
However if you want to have as result the marginal probability of each class then use the OneVsRestClassifier.
Notice that CalibratedClassifierCV expects target to be 1d so the "trick" is to extend it to support Multilabel Classification with MultiOutputClassifier.
Full Example
from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Generate a sample multilabel target
X, y = make_multilabel_classification(n_classes=4, random_state=0)
y
>>>
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
...
[0, 0, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]])
# Split in train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.9, random_state=42
)
# Splits Stratify target variable
cv = StratifiedKFold(n_splits=2)
# Decision tree support by default multiclass target or use OneVsRest if marginal probabilities
clf = OneVsRestClassifier(DecisionTreeClassifier(max_depth=10))
# Calibrate estimator probabilities
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=cv)
# calibrated_clf target is one dimensional, extend classifier to multi-target classification.
multioutput_clf = MultiOutputClassifier(calibrated_clf).fit(X_train, y_train)
# Check predict
multioutput_clf.predict(X_test[-5:])
>>>
array([[0, 0, 1, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
# Check predict_proba
multioutput_clf.predict_proba(X_test[-5:])
>>>
[array([[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685]]),
array([[0.59166537, 0.40833463],
[0.59166537, 0.40833463],
[0.40833361, 0.59166639],
[0.59166537, 0.40833463],
[0.59166537, 0.40833463]]),
array([[0.61666922, 0.38333078],
[0.61666427, 0.38333573],
[0.80000098, 0.19999902],
[0.61666427, 0.38333573],
[0.61666427, 0.38333573]]),
array([[0.26874774, 0.73125226],
[0.26874774, 0.73125226],
[0.45208444, 0.54791556],
[0.26874774, 0.73125226],
[0.26874774, 0.73125226]])]
Notice that the result from predict_proba is a list with 4 arrays, each array is the probability to belong to the class i. For example, inside the first sample of the first array is the probability that first sample belongs to class 1 and so on.
Regarding the calibration curves, scikit-learn provides examples to plot probability path for two dimension and three dimension targets.

GridSearchCV: get attributes of a model

I use GridSearchCV to fit SVM, and I want to know the number of support vectors for all the fitted models. For now I can only access this SVM's attribute for the best model.
Toy example:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
clf = SVC()
params = {'C': [0.01, 0.1, 1]}
search = GridSearchCV(estimator=clf, cv=2, param_grid=params, return_train_score=True)
search.fit(X, y);
with the number of support vectors for the best model:
search.best_estimator_.n_support_
How to get the n_support_ for all models? Just as we get train/test error separately for each parameter C.

How to make and use Naive Bayes Classifier with Scikit

I'm following a book about machine learning in python and I just don't understand this code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from utilities import visualize_classifier
# Input file containing data
input_file = 'data_multivar_nb.txt'
# Load data from input file
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1]
# Create Naive Bayes classifier
classifier = GaussianNB()
# Train the classifier
classifier.fit(X, y)
# Predict the values for training data
y_pred = classifier.predict(X)
# Compute accuracy
accuracy = 100.0 * (y == y_pred).sum() / X.shape[0]
print("Accuracy of Naive Bayes classifier =", round(accuracy, 2), "%")
I just have a few questions:
What does data[:, :-1] and data[:, -1] do?
The input file is in the form of:
2.18,0.57,0
4.13,5.12,1
9.87,1.95,2
4.02,-0.8,3
1.18,1.03,0
4.59,5.74,1
How does the computing accuracy part work?
What is X.shape[0]?
Lastly how do I use the classifier to predict the y for new values?
When you index a numpy array you use square brackets similar to a list.
my_list[-1] returns the last item in the list.
For example.
my_list = [1, 2, 3, 4]
my_list[-1]
4
If you're familiar with list indexing then you will know what a slice is.
my_list[:-1] returns all items from the beginning to the last-but-one.
my_list[:-1]
[1, 2, 3]
In your code, data[:, :-1] is simply indexing with slices in 2-dimensions. Lookup the documentation on numpy arrays for more information. Understanding ndarrays is a pre-requisite for using sklearn.

How calculate OLS regression with Survey Weights in Python.

I want to do a linear regression on survey data with survey weights.
The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.)
This weight is described as:
"The European Weight, variable 6, produces a representative sample of
the European Community as a whole when used in analysis. This variable
adjusts the size of each national sample according to each nation's
contribution to the population of the European Community."
To do my calculation I'm using sklearn.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X,y, sample_weight = weights)
X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series.
Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?
TL DR; Yes.
Here is a very simple example of it working,
import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([10, 20, 60]).reshape(-1, 1)
weights = np.array([1, 1, 1])
def weighted_lr(X, y, weights):
"""Quick function to run weighted linear regression and return a
plot and some predictions"""
regr.fit(X,y, sample_weight=weights)
y_pred = regr.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title('Weights: %s' % ', '.join(str(i) for i in weights))
plt.show()
return y_pred
y_pred = weighted_lr(X, y, weights)
print(y_pred)
weights = np.array([1000, 1000, 1])
y_pred = weighted_lr(X, y, weights)
print(y_pred)
[[ 7.14285714]
[ 24.28571429]
[ 58.57142857]]
[[ 9.96051333]
[ 20.05923001]
[ 40.25666338]]
On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model.
Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.

sklearn - how do you get a probability instead of a label?

I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.

Categories