Does sklearn provide a VotingClassifier for already fit estimators? - python

I have n binary classifiers, and I'd like to see what results I get by combining their predictions, just like a VotingClassifier would do, using majority voting or the sum of the individual predict_proba to determine the final prediction.
(Looking on google I already found here some guy's implementation of what I need and the code seems correct, I'm just looking for the "sklearn way" of achieving this)
My problem is that sklearn.ensemble.VotingClassifier is a Soft Voting/Majority Rule classifier for unfitted estimators.
Does sklearn provide something for fitted estimators?
By looking at sklearn.ensemble's documentation it would seem it does not, but isn't weird that such a functionality would be missing from such a complete library? Am I missing something? Can I get that functionality using features from sklearn or do I have to implement it by myself?

Related

How to get CatBoost model's coefficients?

I need to get the parameters to use the model in another program.
I tried cat_model.coef_, cat_model.intercept_ or what I think. is that possible to catch the params ?
I totally solved this problem, what i was tryna do is named 'saving model'.
cat_model.save_model('cat_model.cbm')
Attributes .coef_ and .intercept_ only exist in sklearn applications of linear regression and logistic regression and will give you the slopes and the intercept (if fitted). You can use .feature_importances_ instead.
For catboost, your model has something called feature importances, given that it's a gradient boosting tree model what you get back is how heavy certain features are in splitting the tree up.
cat_model.feature_importances_
will tell you that. Though you should do more research into how the model works and what it will give you back because interpreting these features can be somewhat deceptive.

Unstable accuracy of Gaussian Mixture Model classifier from sklearn

I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))
Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation
method, and expecting that its components will magically align with
your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians
into your data, but there is nothing forcing it to place them
according to the labeling (which is not even provided in the fit
call). From time to time this will work - but only for trivial
problems, where classes are so well separated that even Naive Bayes
would work, in general however it is simply invalid tool for the
problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you
are using "GMM classifier" correctly is impossible to answer. Using
GMM as a classifier is incorrect by definition, there is no "valid"
way of using it in such a problem as it is not what this model is
designed to do. What you could do is to build a proper generative
model per class. In other words construct your own classifier where
you fit one GMM per label and then use assigned probability to do
actual classification. Then it is a proper classifier. See
github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train) - all usage is in the form of classifier.fit(X_train), i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit method:
fit (self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...
In sklearn, the labels of clusters in gmm do not mean anything. So, each time you run a gmm, the labels may vary. It might be one reason the results are not robust.

Multilabel classification ML-kNN vs KNN

This might be a stupid question but I was just wondering what the difference between ML-KNN implemented in scikit.ml and scikit-learn's KNeighborsClassifier is. According to sklearn's docs KNeighborsClassifier has support for multilabel classification. ML-KNN however is KNN adapted for multilabel classification built on top of sklearn's architecture based on it's docs.
When searching for sample multilabel problems, MLkNN mostly appears but I do not understand if there's any advantage of using it over the base implementation of sklearn if it already supports it. Is it only a late adaptation in sklearn's side or are there more differences in the implementation?
Any input is appreciated. Thanks!
scikit-multilearn's ML-KNN implementations is an improved version of scikit-learn's KNeighborsClassifier. It is actually built on top of it. After the k nearest neighbors in the training data are found, it uses maximum a posteriori principle to label a new instance to achieve a better performance. Also, since it operates on sparse matrices internally using SciPy sparse matrix library, it is highly memory-efficient. More info here and here.

Does scikit-learn use One-Vs-Rest by default in multi-class classification?

I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?
Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.

sklearn logistic regression with unbalanced classes

I'm solving a classification problem with sklearn's logistic regression in python.
My problem is a general/generic one. I have a dataset with two classes/result (positive/negative or 1/0), but the set is highly unbalanced. There are ~5% positives and ~95% negatives.
I know there are a number of ways to deal with an unbalanced problem like this, but have not found a good explanation of how to implement properly using the sklearn package.
What I've done thus far is to build a balanced training set by selecting entries with a positive outcome and an equal number of randomly selected negative entries. I can then train the model to this set, but I'm stuck with how to modify the model to then work on the original unbalanced population/set.
What are the specific steps to do this? I've poured over the sklearn documentation and examples and haven't found a good explanation.
Have you tried to pass to your class_weight="auto" classifier? Not all classifiers in sklearn support this, but some do. Check the docstrings.
Also you can rebalance your dataset by randomly dropping negative examples and / or over-sampling positive examples (+ potentially adding some slight gaussian feature noise).
#agentscully Have you read the following paper,
[SMOTE] (https://www.jair.org/media/953/live-953-2037-jair.pdf).
I have found the same very informative. Here is the link to the Repo.
Depending on how you go about balancing your target classes, either you can use
'auto': (is deprecated in the newer version 0.17) or 'balanced' or specify the class ratio yourself {0: 0.1, 1: 0.9}.
'balanced': This mode adjusts the weights inversely proportional to class frequencies n_samples / (n_classes * np.bincount(y)
Let me know, if more insight is needed.

Categories