Nature and redundancy of classifiers - python

I am applying a set of linear and non-linear classification models in a classification task. The input data are language vectors (CountVectorizer, Word2Vec) and binary labels. In scikit-learn, I selected following estimators:
LogisticRegression(),
LinearSVC(),
XGBClassifier(),
SGDClassifier(),
SVC(), # Radial basis function kernel
BernoulliNB(), # Naive Bayes seems widely used for LV models
KNeighborsClassifier(),
RandomForestClassifier(),
MLPClassifier()
Question: Am I correct that LinearSVC() is a linear
classifier, at least for the case of a binary estimator?
Question: In view of experts, is there any significant redundancy among the classifiers?
Thanks for clarification.

LogisticRegression(), LinearSVC(), SGDClassifier() and BernoulliNB() are linear models.
With the default loss function SGDClassifier() works as a linear SVM, with log loss as a logistic regression, so one of these three is redundant. Also you could substitute LogisticRegression() for LogisticRegressionCV() which has built-in optimization for regularization hyperparameter.
XGBClassifier() and all the others are non-linear.
The list seems to include all the major sklearn classifiers.

Related

How to get probabilities for SGDClassifier (LinearSVM)

I'm using SGDClassifier with loss function = "hinge". But hinge loss does not support probability estimates for class labels.
I need probabilities for calculating roc_curve. How can I get probabilities for hinge loss in SGDClassifier without using SVC from svm?
I've seen people mention about using CalibratedClassifierCV to get the probabilities but I've never used it and I don't know how it works.
I really appreciate the help. Thanks
In the strict sense, that's not possible.
Support vector machine classifiers are non-probabilistic: they use a hyperplane (a line in 2D, a plane in 3D and so on) to separate points into one of two classes. Points are only defined by which side of the hyperplane they are on., which forms the prediction directly.
This is in contrast with probabilistic classifiers like logistic regression and decision trees, which generate a probability for every point that is then converted to a prediction.
CalibratedClassifierCV is a sort of meta-estimator; to use it, you simply pass your instance of a base estimator to its constructor, so this will work:
base_model = SGDClassifier()
model = CalibratedClassifierCV(base_model)
model.fit(X, y)
model.predict_proba(X)
What it does is perform internal cross-validation to create a probability estimate. Note that this is equivalent to what sklearn.SVM.SVC does anyway.

Online version of Ridge Regression Classifier in ski-kitlearn?

I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’

Working of Regression in sklearn.linear_model.LogisticRegression

How does scikit-learn's sklearn.linear_model.LogisticRegression class work with regression as well as classification problems?
As given on the Wikipedia page as well as a number of sources, since the output of Logistic Regression is based on the sigmoid function, it returns a probability. Then how does the sklearn class work as both a classifier and regressor?
Logistic regression is a method for classification, not regression. This goes for scikit-learn as for anywhere else.
If you have entered continuous values as the target vector y, then LogisticRegression will most probably fail, as it interprets the unique values of y, i.e. np.unique(y) as different classes. So you may end up having as many classes as samples.
TL;DR: Logistic regression needs a categorical target variable, because it is a classification method.

How to increase the presicion of text classification with the RBM?

I am learning about text classification and I classify with my own corpus with linnear regression as follows:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression(penalty='l2', C=7)
classifier.fit(training_matrix, y_train)
prediction = classifier.predict(testing_matrix)
I would like to increase the classification report with a Restricted Boltzman Machine that scikit-learn provide, from the documentation I read that this could be use to increase the classification recall, f1-score, accuracy, etc. Could anybody help me to increase this is what I tried so far, thanks in advance:
vectorizer = TfidfVectorizer(max_df=0.5,
max_features=None,
ngram_range=(1, 1),
norm='l2',
use_idf=True)
X_train = vectorizer.fit_transform(X_train_r)
X_test = vectorizer.transform(X_test_r)
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
logistic = LogisticRegression()
rbm= BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
classifier.fit(X_train, y_train)
First, you have to understand the concepts here. RBM can be seen as a powerful clustering algorithm and clustering algorithms are unsupervised, i.e., they don't need labels.
Perhaps, the best way to use RBM in your problem is, first to train an RBM (which only needs data without labels) and then use the RBM weights to initialize a Neural network. To get a logistic regression in the output, you have to add an output layer with logistic reg. cost function to this neural net and train this neural network. This setting may result in performance improvement.
There are a couple of things that could be wrong.
1. You haven't properly calibrated the RBM
Look at the example on the scikit-learn site: http://scikit-learn.org/stable/auto_examples/plot_rbm_logistic_classification.html
In particular, these lines:
rbm.learning_rate = 0.06
rbm.n_iter = 20
# More components tend to give better prediction performance, but larger
# fitting time
rbm.n_components = 100
You don't set these anywhere. In the example, these are obtained through cross validation using a grid search. You should do the same and try to obtain (close to) optimal parameters for your own problem.
Additionally, you might want to try using cross validation to determine other parameters as well, such as the ngram range (using higher level ngrams as well usually helps, if you can afford the memory and execution time. For some problems, character level ngrams do better than word level) and logistic regression parameters.
2. You are just unlucky
There is nothing that says using an RBM in an intermediate step will definitely improve any performance measure. It can, but it's not a rule, it may very well do nothing or very little for your problem. You have to be prepared for this.
It's worth trying because it shouldn't take long to implement, but be prepare to have to look elsewhere.
Also look at the SGDClassifier and the PassiveAggressiveClassifier. These might improve performance.

Evaluation of multilabel and multiclass data labels

are there any evaluation metrics available for multiclass-multilabel classification?
for example, I'm taking part in the following competition at kaggle and it requires ROC AUC as evaluation metric.: http://www.kaggle.com/c/mlsp-2013-birds
Is it possible to do this using sklearn?
There's this library from Kaggle's Director of Engineering:
https://github.com/benhamner/Metrics/tree/master/Python
As of 2021, sklearn.metrics includes several functions you can use for evaluating multiclass-multilabel classification models. For example accuracy_score can calculate the fraction of correct (i.e. all predicted labels are correct) predictions. The hamming_loss function can calculate the Hamming Loss, or fraction of labels that are incorrectly predicted, in a given test set. You can find an in-depth discussion of the available metrics here.

Categories