okay so when I use the following code, what exactly does that "clf" part mean? is that a variable? I know that's a classifier but is classifier a function in python or it's just a variable named that way or what exactly? I am new to python and programming well.
thanks already!
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
From the docs:
[GNB] can perform online updates to model parameters via partial_fit method
Example:
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> Y = np.array([1, 1, 1, 2, 2, 2])
>>> from sklearn.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
GaussianNB(priors=None, var_smoothing=1e-09)
>>> print(clf.predict([[-0.8, -1]]))
[1]
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, np.unique(Y))
GaussianNB(priors=None, var_smoothing=1e-09)
>>> print(clf_pf.predict([[-0.8, -1]]))
[1]
What is a classifier, one may ask? According to Wikipedia, a classifier is
an algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
Related
I have built a number of sklearn classifier models to perform multi-label classification and I would like to calibrate their predict_proba outputs so that I can obtain confidence scores. I would also like to use metrics such as sklearn.metrics.recall_score to evaluate them.
I have 4 labels to predict and the true labels are multi-hot encoded (e.g. [0, 1, 1, 1]). As a result, CalibratedClassifierCV does not directly accept my data:
clf = tree.DecisionTreeClassifier(max_depth=15)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
This would return an error:
ValueError: classes [[0 1]
[0 1]
[0 1]
[0 1]] mismatch with the labels [0 1 2 3] found in the data
Thus, I tried to wrap it in a OneVsRestClassifier:
clf = OneVsRestClassifier(tree.DecisionTreeClassifier(max_depth=15), n_jobs=4)
clf = clf.fit(train_X, train_Y)
calibrated_clf = CalibratedClassifierCV(clf, cv="prefit", method="sigmoid")
calibrated_clf.fit(dev_X, dev_Y)
Note that MultiOutputClassifier and ClassifierChain do not work even though they possibly suit my problem better.
It works, but the predict output of the calibrated classifier is multi-class instead of multi-label because of its implementation. There are four classes ([0 1 2 3]) but if there is no need to put a label, it still predicts a 0.
Upon further inspection by means of calibration curves, it turns out the base estimator wrapped inside the calibrated classifier is not calibrated at all. That is, (calibrated_clf.calibrated_classifiers_)[0].base_estimator returns the same clf as before calibration.
I would like to observe the performance of my (calibrated) models doing deterministic (predict) and probabilistic (predict_proba) predictions. How should I design my model/wrap things in other containers to get both calibrated probabilities for each label and comprehensible label predictions?
In your example, you're using a DecisionTreeClassifier which by default support targets of dimension (n, m) where m > 1.
However if you want to have as result the marginal probability of each class then use the OneVsRestClassifier.
Notice that CalibratedClassifierCV expects target to be 1d so the "trick" is to extend it to support Multilabel Classification with MultiOutputClassifier.
Full Example
from sklearn.datasets import make_multilabel_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Generate a sample multilabel target
X, y = make_multilabel_classification(n_classes=4, random_state=0)
y
>>>
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0],
...
[0, 0, 0, 0],
[0, 1, 1, 1],
[1, 1, 0, 1]])
# Split in train/test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.9, random_state=42
)
# Splits Stratify target variable
cv = StratifiedKFold(n_splits=2)
# Decision tree support by default multiclass target or use OneVsRest if marginal probabilities
clf = OneVsRestClassifier(DecisionTreeClassifier(max_depth=10))
# Calibrate estimator probabilities
calibrated_clf = CalibratedClassifierCV(base_estimator=clf, cv=cv)
# calibrated_clf target is one dimensional, extend classifier to multi-target classification.
multioutput_clf = MultiOutputClassifier(calibrated_clf).fit(X_train, y_train)
# Check predict
multioutput_clf.predict(X_test[-5:])
>>>
array([[0, 0, 1, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1]])
# Check predict_proba
multioutput_clf.predict_proba(X_test[-5:])
>>>
[array([[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685],
[0.78333315, 0.21666685]]),
array([[0.59166537, 0.40833463],
[0.59166537, 0.40833463],
[0.40833361, 0.59166639],
[0.59166537, 0.40833463],
[0.59166537, 0.40833463]]),
array([[0.61666922, 0.38333078],
[0.61666427, 0.38333573],
[0.80000098, 0.19999902],
[0.61666427, 0.38333573],
[0.61666427, 0.38333573]]),
array([[0.26874774, 0.73125226],
[0.26874774, 0.73125226],
[0.45208444, 0.54791556],
[0.26874774, 0.73125226],
[0.26874774, 0.73125226]])]
Notice that the result from predict_proba is a list with 4 arrays, each array is the probability to belong to the class i. For example, inside the first sample of the first array is the probability that first sample belongs to class 1 and so on.
Regarding the calibration curves, scikit-learn provides examples to plot probability path for two dimension and three dimension targets.
I use GridSearchCV to fit SVM, and I want to know the number of support vectors for all the fitted models. For now I can only access this SVM's attribute for the best model.
Toy example:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
clf = SVC()
params = {'C': [0.01, 0.1, 1]}
search = GridSearchCV(estimator=clf, cv=2, param_grid=params, return_train_score=True)
search.fit(X, y);
with the number of support vectors for the best model:
search.best_estimator_.n_support_
How to get the n_support_ for all models? Just as we get train/test error separately for each parameter C.
GridSearchCV's documentations states that I can pass a scoring function.
scoring : string, callable or None, default=None
I would like to use a native accuracy_score as a scoring function.
So here is my attempt. Imports and some data:
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn import neighbors
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([0, 1, 0, 0, 0, 1])
Now when I use just k-fold cross-validation without my scoring function, everything works as intended:
parameters = {
'n_neighbors': [2, 3, 4],
'weights':['uniform', 'distance'],
'p': [1, 2, 3]
}
model = neighbors.KNeighborsClassifier()
k_fold = KFold(len(Y), n_folds=6, shuffle=True, random_state=0)
clf = GridSearchCV(model, parameters, cv=k_fold) # TODO will change
clf.fit(X, Y)
print clf.best_score_
But when I change the line to
clf = GridSearchCV(model, parameters, cv=k_fold, scoring=accuracy_score) # or accuracy_score()
I get the error: ValueError: Cannot have number of folds n_folds=10 greater than the number of samples: 6. which in my opinion does not represent the real problem.
In my opinion the problem is that accuracy_score does not follow the signature scorer(estimator, X, y), which is written in the documentation
So how can I fix this problem?
It will work if you change scoring=accuracy_score to scoring='accuracy' (see the documentation for the full list of scorers you can use by name in this way.)
In theory, you should be able to pass custom scoring functions like you're trying, but my guess is that you're right and accuracy_score doesn't have the right API.
Here is an example of using Weighted Kappa as scoring metric for GridSearchCV for a simple Random Forest model. The key learning for me was to use the parameters related to the scorer in the 'make_scorer' function.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import cohen_kappa_score, make_scorer
kappa_scorer = make_scorer(cohen_kappa_score,weights="quadratic")
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_features': range(2,10), # try features from 2 to 10
'min_samples_leaf': [3, 4, 5],
'n_estimators' : [100,300,500],
'max_depth': [5]
}
# Create a based model
random_forest = RandomForestClassifier(class_weight ="balanced_subsample",random_state=1)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, scoring = kappa_scorer) # search for best model using roc_auc
# Fit the grid search to the data
grid_search.fit(final_tr, yTrain)
I am building a WPP using Modified Markov Model but i need to build an SVM with a classifier to train the data of the webpages.
Whats the easiest way to implement?
You can use Scikit-learn
Example:
from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC
clf.fit(X, y)
And then for a prediction:
clf.predict([[2., 2.]])
I know the SVM (specifically linear SVC) has an option namely when probability = True as an optional parameter when you instantiate, model.predict_proba() is supposed to give the probability of each of its predictions along with the label (1 or 0). However I keep getting the numpy error "use all() on an 1 dimensional array" when I call predict_proba() and I can only figure out how to get a prediction in the form of a label (1 or 0) using model.predict().
Documentation example works fine for me setting the flag probability=True. The problem has to be in your input data. Try this very simple example:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
from sklearn.svm import SVC
clf = SVC(probability=True)
clf.fit(X, y)
print(clf.predict([[-0.8, -1]]))
print(clf.predict_proba([[-0.8, -1]]))
You can use CallibratedClassifierCV.
from sklearn.calibration import CalibratedClassifierCV
model_svc = LinearSVC()
model = CalibratedClassifierCV(model_svc)
model.fit(X_train, y_train)
pred_class = model.predict(y_test)
probability = model.predict_proba(predict_vec)
You will get predicted probability score in array values.