SKLearn how to get decision probabilities for LinearSVC classifier - python

I am using scikit-learn's linearSVC classifier for text mining. I have the y value as a label 0/1 and the X value as the TfidfVectorizer of the text document.
I use a pipeline like below
pipeline = Pipeline([
('count_vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LinearSVC())
])
For a prediction, I would like to get the confidence score or probability of a data point being classified as
1 in the range (0,1)
I currently use the decision function feature
pipeline.decision_function(test_X)
However it returns positive and negative values that seem to indicate confidence. I am not too sure about what they mean either.
However, is there a way to get the values in range 0-1?
For example here is the output of the decision function for some of the data points
-0.40671879072078421,
-0.40671879072078421,
-0.64549376401063352,
-0.40610652684648957,
-0.40610652684648957,
-0.64549376401063352,
-0.64549376401063352,
-0.5468745098794594,
-0.33976011539714374,
0.36781572474117097,
-0.094943829974515004,
0.37728641897721765,
0.2856211778200019,
0.11775493140003235,
0.19387473663623439,
-0.062620918785563556,
-0.17080866610522819,
0.61791016307670399,
0.33631340372946961,
0.87081276844501176,
1.026991628346146,
0.092097790098391641,
-0.3266704728249083,
0.050368652422013376,
-0.046834129250376291,

You can't.
However you can use sklearn.svm.SVC with kernel='linear' and probability=True
It may run longer, but you can get probabilities from this classifier by using predict_proba method.
clf=sklearn.svm.SVC(kernel='linear',probability=True)
clf.fit(X,y)
clf.predict_proba(X_test)

If you insist on using the LinearSVC class, you can wrap it in a sklearn.calibration.CalibratedClassifierCV object and fit the calibrated classifier which will give you a probabilistic classifier.
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn import datasets
#Load iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Using only two features
y = iris.target #3 classes: 0, 1, 2
linear_svc = LinearSVC() #The base estimator
# This is the calibrated classifier which can give probabilistic classifier
calibrated_svc = CalibratedClassifierCV(linear_svc,
method='sigmoid', #sigmoid will use Platt's scaling. Refer to documentation for other methods.
cv=3)
calibrated_svc.fit(X, y)
# predict
prediction_data = [[2.3, 5],
[4, 7]]
predicted_probs = calibrated_svc.predict_proba(prediction_data) #important to use predict_proba
print predicted_probs
Here is the output:
[[ 9.98626760e-01 1.27594869e-03 9.72912751e-05]
[ 9.99578199e-01 1.79053170e-05 4.03895759e-04]]
which shows probabilities for each class for each data point.

Related

How to get multi-class roc_auc in cross validate in sklearn?

I have a classification problem where I want to get the roc_auc value using cross_validate in sklearn. My code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = ('accuracy', 'roc_auc'))
However, I get the following error.
ValueError: multiclass format is not supported
Please note that I selected roc_auc specifically is that it supports both binary and multiclass classification as mentioned in: https://scikit-learn.org/stable/modules/model_evaluation.html
I have binary classification dataset too. Please let me know how to resolve this error.
I am happy to provide more details if needed.
By default multi_class='raise' so you need explicitly to change this.
From the docs:
multi_class {‘raise’, ‘ovr’, ‘ovo’}, default=’raise’
Multiclass only. Determines the type of configuration to use. The
default value raises an error, so either 'ovr' or 'ovo' must be passed
explicitly.
'ovr':
Computes the AUC of each class against the rest [3] [4]. This treats
the multiclass case in the same way as the multilabel case. Sensitive
to class imbalance even when average == 'macro', because class
imbalance affects the composition of each of the ‘rest’ groupings.
'ovo':
Computes the average AUC of all possible pairwise combinations of
classes [5]. Insensitive to class imbalance when average == 'macro'.
Solution:
Use make_scorer (docs):
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier(random_state = 0, class_weight="balanced")
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, multi_class='ovo',needs_proba=True)
from sklearn.model_selection import cross_validate
cross_validate(clf, X, y, cv=10, scoring = myscore)

How to set a value for a specific threshold in SVC model and generate a confusion matrix?

I need to set a value to a specific threshold and generate a confusion matrix. The data is in a csv file (11,1 MB), this link for download is: https://drive.google.com/file/d/1cQFp7HteaaL37CefsbMNuHqPzkINCVzs/view?usp=sharing?
First, i received a error message: ""AttributeError: predict_proba is not available when probability=False""
So i used this for correction:
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
I saw a lot on the internet and I didn't quite understand how a specific threshold value is being persolanized. Sounds pretty hard.
Now, i see a wrong output:
array([[ 0, 0],
[5359, 65]])
I have no idea whats is somenthing wrong.
i need help and i'm new in that.
thanks
from sklearn.model_selection import train_test_split
df = pd.read_csv('fraud_data.csv')
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
svc = SVC(C=1e9,gamma= 1e-07)
scv_calibrated = CalibratedClassifierCV(svc)
svc_model = scv_calibrated.fit(X_train, y_train)
# set threshold as -220
y_pred = (svc_model.predict_proba(X_test)[:,1] >= -220)
conf_matrix = confusion_matrix(y_pred, svc_model.predict(X_test))
return conf_matrix
answer_four()
This function should return a confusion matrix, a 2x2 numpy array with 4 integers.
This code produces the expected output, in addition to the fact that in the previous code I was using the confusion matrix incorrectly I should have also used decision_function and getting the output filtering the 220 threshold.
def answer_four():
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import train_test_split
#SVC without mencions of kernel, the default is rbf
svc = SVC(C=1e9, gamma=1e-07).fit(X_train, y_train)
#decision_function scores: Predict confidence scores for samples
y_score = svc.decision_function(X_test)
#Set a threshold -220
y_score = np.where(y_score > -220, 1, 0)
conf_matrix = confusion_matrix(y_test, y_score)
####threshold###
#input threshold in the model after trained this model
#threshold is a limiar of separation of class
return conf_matrix
answer_four()
#output:
array([[5320, 24],
[ 14, 66]])
You are using the confusion matrix in a wrong way.
The idea behind the confusion matrix is to have a picture as to how good our predictions y_pred are compared with the ground truth y_true, usually in a test set.
What you actually do here is computing a "confusion matrix" between your predictions with the custom threshold of -220 (y_pred), compared to some other predictions with the default threshold (the output of svc_model.predict(X_test)), which does not make any sense.
Your ground truth for the test set is y_test; so, to get the confusion matrix with the default threshold, you should use
confusion_matrix(y_test, svc_model.predict(X_test))
To get the confusion matrix with your custom threshold of -220, you should use
confusion_matrix(y_test, y_pred)
See the documentation for more details in the usage (which is your best friend, and should always be the first place to look at, when having issues or doubts).

Determine what features to drop / select using GridSearch in scikit-learn

How does one determine what features/columns/attributes to drop using GridSearch results?
In other words, if GridSearch returns that max_features should be 3, can we determine which EXACT 3 features should one use?
Let's take the classic Iris data set with 4 features.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn import datasets
iris = datasets.load_iris()
all_inputs = iris.data
all_labels = iris.target
decision_tree_classifier = DecisionTreeClassifier()
parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}
cross_validation = StratifiedKFold(n_splits=10)
grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
Let's say we get that max_features is 3. How do I find out which 3 features were the most appropriate here?
Putting in max_features = 3 will work for fitting, but I want to know which attributes were the right ones.
Do I have to generate the possible list of all feature combinations myself to feed GridSearch or is there an easier way ?
max_features is one hyperparameter of your decision tree.
it does not drop any of your features before training nor does it find good or bad features.
Your decisiontree looks at all features to find the best feature to split up your data based on your labels. If you set maxfeatures to 3 as in your example, your decision tree just looks at three random features and takes the best features of those to make the split. This makes your training faster and adds some randomness to your classifier (might also help against overfitting).
Your classifier determines which is the feature by a criterion (like gini index or information gain(1-entropy)). So you can either take such a measurement for feature importance or
use an estimator that has the attribute feature_importances_
as #gorjan mentioned.
If you use an estimator that has the attribute feature_importances_ you can simply do:
feature_importances = grid_search.best_estimator_.feature_importances_
This will return a list (n_features) of how important each feature was for the best estimator found with grid search. Additionally, if you want to use let's say a linear classifier (logistic regression), that doesn't have the attribute feature_importances_ what you could do is:
# Get the best estimator's coefficients
estimator_coeff = grid_search.best_estimator_.coef_
# Multiply the model coefficients by the standard deviation of the data
coeff_magnitude = np.std(all_inputs, 0) * estimator_coeff)
which is also an indication of the feature importance. If a model's coefficient is >> 0 or << 0, that means, in layman's terms, that the model is trying hard to capture the signal present in that feature.

Why doesn't fit_transform work in this sklearn Pipeline example?

I an new to sklearn Pipeline and following a sample code. I saw in other examples that we can do pipeline.fit_transform(train_X), so I tried the same thing on the pipeline here pipeline.fit_transform(X), but it gave me an error
" return self.fit(X, **fit_params).transform(X)
TypeError: fit() takes exactly 3 arguments (2 given)"
If I remove the svm part and defined the pipeline as pipeline = Pipeline([("features", combined_features)]), I still saw the error.
Does anyone know why fit_transform doesn't work here?
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
param_grid = dict(features__pca__n_components=[1, 2, 3],
features__univ_select__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
You get an error in the above example because you also need to pass the labels to your pipeline. You should be calling pipeline.fit_transform(X,y). The last step in your pipeline is a classifier, SVC and the fit method of a classifier also requires the labels as a mandatory argument. The fit method of all classifiers also require labels because the classification algorithms use these labels to train the weights in your classifier.
Similarly, even if you remove the SVC, you still get an error because the fit method of SelectKBest class also requires both X and y.

What is the theorical foundation for scikit-learn dummy classifier?

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.
This classifier is useful as a simple baseline to compare with other
(real) classifiers. Do not use it for real problems.
What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:
generates predictions by respecting the training set’s class
distribution.
Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.
The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.
Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.
Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.
If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:
from sklearn.dummy import DummyClassifier
import numpy as np
two_dimensional_values = []
class_labels = []
for i in xrange(90):
two_dimensional_values.append( [1,1] )
class_labels.append(1)
for i in xrange(10):
two_dimensional_values.append( [0,0] )
class_labels.append(0)
#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )
#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )
#this produces 100 predictions that say "1"
for i in two_dimensional_values:
print( dummy_classifier.predict( [i]) )
#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )
#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
print( new_dummy_classifier.predict( [i]) )
A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.
Using the Doc To illustrate DummyClassifier, first let’s create an imbalanced dataset:
>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
>>> clf = SVC(gamma='scale', kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...
We see that the accuracy was boosted to almost 100%. So this is better.

Categories