Encoding labels for multi-class problems in sckit-learn - python

When utilizing classifiers from scikit-learn for multi-class problems, is it necessary to encode the labels with one hot encoding? For example, I have 3 classes and simply labeled them as 0, 1, and 2 when feeding this data into the different classifiers for training. As far as I can tell, it seems to be working normally. But is there any reason this kind of basic encoding is not recommended?
Some algorithms, like random forests, handle categorical values natively. For methods such as logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest, the methods appear to handle categorical values natively, if I'm not mistaken. Is that assessment correct? Which of scikit-learn's classifiers do not handle these inputs natively and are influenced by ordinality?

All scikit estimators handle multi-class problems automatically.
Internally they will be converted to appropriately, either simple encoding to 0,1,2 etc if the algorithm supports native multi-class problems or one-hot encodings if the algorithm handles multi-class problems by transforming to binary.
Please refer to the documentation to see this:
All scikit-learn classifiers are capable of multiclass classification,...
You can see that "logistic regression, multilayer perceptron, Gaussian naive Bayes, and random forest" are under the heading "Inherently multiclass".
Others like SGD, or LinearSVC use one-vs-rest approach to handle multi-class, but that as I said above will be handled internally by scikit, so you as a user don't need to do anything and can pass multi-class labels (even as strings) in a single array of y to all classification estimators.
Only thing where the user needs to explicitly convert labels to one-hot encoding is the multi-label problem, where more than one label can be predicted for a sample. But I think your question is not about that.

Related

How does Tensorflow's Decision Forests handle categorical data?

I'm evaluating two different unsupervised ML algorithms, Isolation Forest and LSTM Autoencoder model, to identify anomalies in a large time series data. This dataset includes mostly categorical data such as Ip Adresses, cloud subscription Ids,tenant Ids, userAgents, and client Application Ids.
When reading a tutorial on an implementation of a Tensorflow's Decision Tree (TF-DF) model, it mentions that the model handles non-label categorical values natively and
there is no need for preprocessing in the form of one-hot encoding, normalization or extra is_present feature.
Does anybody know how Tensorflow handles the categorical features behind the scenes (assuming they do some transformation into a numeric representation)?
Tl;dr: There is a natural way of using categorical features in decision trees/forests that requires no encoding. Tensorflow Decision Forests uses this and a number of standard transformations to handle categorical features.
Tensorflow Decision Forest (TF-DF) constructs decision tree / decision forest models. A single decision tree recursively splits the dataset along its features. Splits along categorical features can naturally be performed through so-called in-set conditions. For instance, a tree can express a condition like userAgents \in \{“Mozilla/5.0”, “InternetExplorer/10.0”\}. Other types of conditions are also possible. Tensorflow Decision Forests (TF-DF) can construct in-set conditions if the dataset contains categorical features.
More specifically, Tensorflow Decision Forests uses the C++ library Yggdrasil Decision Forests (YDF) under the hood for any advanced computations. YDF offers three different algorithms for finding a good categorical split of the data. For example, the Random algorithm will just try out many possible splits at random and pick the best one.
For performance and quality reasons, YDF also preprocesses categorical features: If a categorical value is very rare, YDF may consider it “out-of-dictionary”, the threshold for “rare” being user-configurable. Furthermore, YDF maps the categorical features to integers by decreasing item frequency, with the mapping stored as part of the model. Note that this is purely an internal encoding; the algorithms are aware that a feature is categorical, hence typical issues with integer encodings do not apply.
Finally, Tensorflow Decision Forests (TF-DF) uses Keras, which expects classification tasks to have an integer label. Therefore, TF-DF users have to encode the label themselves or use the built-in pd_dataframe_to_tf_dataset.
Note that this answer only applies to Tensorflow Decision Forests. Other parts of Tensorflow may need manual encoding.

Unstable accuracy of Gaussian Mixture Model classifier from sklearn

I have some data (MFCC features for speaker recognition), from two different speakers. 60 vectors of 13 features for each person (in total 120). Each of them has their label (0 and 1). I need to show the results on confusion matrix. But GaussianMixture model from sklearn is unstable. For each program run i receive different scores (sometimes accuracy is 0.4, sometimes 0.7 ...). I don't know what I am doing wrong, because analogically i created SVM and k-NN models and they are working fine (stable accuracy around 0.9). Do you have any idea what am I doing wrong?
gmmclf = GaussianMixture(n_components=2, covariance_type='diag')
gmmclf.fit(X_train, y_train) #X_train are mfcc vectors, y_train are labels
ygmm_pred_class = gmmclf.predict(X_test)
print(accuracy_score(y_test, ygmm_pred_class))
print(confusion_matrix(y_test, ygmm_pred_class))
Short answer: you should simply not use a GMM for classification.
Long answer...
From the answer to a relevant thread, Multiclass classification using Gaussian Mixture Models with scikit learn (emphasis in the original):
Gaussian Mixture is not a classifier. It is a density estimation
method, and expecting that its components will magically align with
your classes is not a good idea. [...] GMM simply tries to fit mixture of Gaussians
into your data, but there is nothing forcing it to place them
according to the labeling (which is not even provided in the fit
call). From time to time this will work - but only for trivial
problems, where classes are so well separated that even Naive Bayes
would work, in general however it is simply invalid tool for the
problem.
And a comment by the respondent himself (again, emphasis in the original):
As stated in the answer - GMM is not a classifier, so asking if you
are using "GMM classifier" correctly is impossible to answer. Using
GMM as a classifier is incorrect by definition, there is no "valid"
way of using it in such a problem as it is not what this model is
designed to do. What you could do is to build a proper generative
model per class. In other words construct your own classifier where
you fit one GMM per label and then use assigned probability to do
actual classification. Then it is a proper classifier. See
github.com/scikit-learn/scikit-learn/pull/2468
(For what it may worth, you may want to notice that the respondent is a research scientist in DeepMind, and the very first person to be awarded the machine-learning gold badge here at SO)
To elaborate further (and that's why I didn't simply flag the question as a duplicate):
It is true that in the scikit-learn documentation there is a post titled GMM classification:
Demonstration of Gaussian mixture models for classification.
which I guess did not exist back in 2017, when the above response was written. But, digging into the provided code, you will realize that the GMM models are actually used there in the way proposed by lejlot above; there is no statement in the form of classifier.fit(X_train, y_train) - all usage is in the form of classifier.fit(X_train), i.e. without using the actual labels.
This is exactly what we would expect from a clustering-like algorithm (which is indeed what GMM is), and not from a classifier. It is true again that scikit-learn offers an option for providing also the labels in the GMM fit method:
fit (self, X, y=None)
which you have actually used here (and again, probably did not exist back in 2017, as the above response implies), but, given what we know about GMMs and their usage, it is not exactly clear what this parameter is there for (and, permit me to say, scikit-learn has its share on practices that may look sensible from a purely programming perspective, but which made very little sense from a modeling perspective).
A final word: although fixing the random seed (as suggested in a comment) may appear to "work", trusting a "classifier" that gives a range of accuracies between 0.4 and 0.7 depending on the random seed is arguably not a good idea...
In sklearn, the labels of clusters in gmm do not mean anything. So, each time you run a gmm, the labels may vary. It might be one reason the results are not robust.

Python classification define feature importance

I am wondering if it is possbile to define feature importances/weights in Pyhton Classification methods? For example:
model = tree.DecisionTreeClassifier(feature_weight = ...)
I've seen in RandomForest there is an attribute feature_importance, which shows the importance of features based on analysis. But is it possible that I could define the feature importance for analysis in advance?
Thank you very much for your help in advance!
The feature importance determination in random forest classifiers uses a random forest-specific method (invert all binary tests over the feature, and get the additional classification error).
Feature importance is thus a concept that relates to the predictive ability of the model, not the training phase. Now, if you want to make it so that your model favours some feature over others, you will have to find some trick that depends on the model.
Regarding sklearn's DecisionTreeClassifier, such a trick does not appear to be trivial. You could custom your class weights, if you know some classes will be more easily predicted by some features that you want to favour; but this seems pretty dirty.
In other types of models, such as ones using kernels, you can do this more easily, by setting hyperparameters which directly relate to features.
If you are trying to limit an overfitting, I would also simply suggest that you remove the features you know to be less important.

Does scikit-learn use One-Vs-Rest by default in multi-class classification?

I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?
Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.
I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.
This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here
Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

Categories