`sample_weight` in sklearn LogisticRegression: How It Works? - python

fit method of LogisticRegression has a optional sample_weight parameter. I follow the python code and find it just does some trivial things and dispatches to underlying solvers (e.g. liblinear). How sample_weight works? does it work thought oversampling or some other method?
update
as #Alexander McFarlane said, it isn't immediately obvious that the sample weight in the decision tree is the same as the logistic regression unless you look at the source.

Related

Why RandomForestClassifier doesn't have cost_complexity_pruning_path method?

In trying to prevent my Random Forest model from overfitting on the training dataset, I looked at the ccp_alpha parameter.
I do notice that it is possible to tune it with a hyperparameter search method (as GridSearchCV).
I discovered that there is a Scikit-Learn tutorial for tuning this ccp_alpha parameter for Decision Tree models.
The methodology described uses the cost_complexity_pruning_path method of the Decision Tree model. This section explains well how the method works. I understand that it seeks to find a sub-tree of the generated model that reduces overfitting, while using values of ccp_alpha determined by the cost_complexity_pruning_path method.
clf = DecisionTreeClassifier()
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
However, I wonder why the Random Forest model type in Scikit-Learn does not implement these ccp_alpha selection and pruning concept.
Would it be possible to do this with a little tinkering?
It seems more logical to me than trying to find a good value by searching for hyperparameters (whatever one you use..)

Should I use LassoCV or GridSearchCV to find an optimal alpha for Lasso?

From my understanding, when using Lasso regression, you can use GridSearchCV or LassoCV in sklearn to find the optimal alpha, the regularization parameter. Which one is preferred over the other?
You can get the same results with both. LassoCV makes it easier by letting you pass an array of alpha-values to alphas as well as a cross validation parameter directly into the classifier.
To do the same thing with GridSearchCV, you would have to pass it a Lasso classifier a grid of alpha-values (i.e. {'alpha':[.5, 1, 5]}) and the CV parameter.
I would not recommend one over the other though. The only advantage I can see is that you can access results_ as well as many other attributes if you use GridSearchCV. This may be helpful if you want a summary of all the models returned by the alphas you tried. On the other hand, as pointed out by #amiola, LassoCV can take advantage of using pre-computed results in previous steps of the cross-validation process (aka warm-starting), which may result in faster fitting times.

Does scikit-learn use One-Vs-Rest by default in multi-class classification?

I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?
Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.

scikit-learn: what is the difference between SVC and SGD?

SVM: http://scikit-learn.org/stable/modules/svm.html#classification
SGD: http://scikit-learn.org/stable/modules/sgd.html#classification
seem to do pretty much the same to my eyes,as they write "an SGD implements a linear model". Can someone explain the differences between them?
SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.
In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.
sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.
Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).
SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).
sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.
Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)
SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!
It's a very complex topic, but some simplified rules:
Specialized SVM-methods
scale worse with the number of samples
do not need hyper-parameter tuning
SGD-based methods
scale better for huge-data in general
need hyper-parameter tuning
solve only a subset of the tasks approachable by the the above (no kernel-methods!)
My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!
Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!
And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.
SVC(SVM) uses kernel based optimisation, where, the input data is transformed to complex data(unravelled) which is expanded thus identifying more complex boundaries between classes. SVC can perform Linear and Non-Linear classification
SVC can perform Linear classification by setting the kernel parameter to 'linear'
svc = SVC(kernel='linear')
SVC can perform non-linear classification by setting the kernel parameter to 'poly' , 'rbf'(default)
svc = SVC(kernel='poly')
svc = SVC(kernel='rbf')
SGDClassifier uses gradient descent optimisation technique, where, the optimum coefficients are identified by iteration process. SGDClassifier can perform only linear classification
SGDClassifer can use Linear SVC(SVM) model when the parameter loss is set to 'hinge'(which is the default) i.e SGDClassifier(loss='hinge')

Does sklearn have group lasso?

I'm interested in using group lasso for a problem I have. Here is a link to the algorithm. I know R has a slick implementation, but am curious to see if python has something similar.
I think sklearn.linear_model.MultiTaskLasso might be kind of similar, but am not sure. Can anyone shed some light on this?
Whether or not to implement the Group Lasso in sklearn is discussed in this issue in the sklearn repo, where the conclusion so far is that it is too much of a niche model to justify the maintenance it would need if included in master.
Therefore, I have implemented a GroupLasso class which passes sklearn's check_estimator() in my python/cython package celer, which acts as a dropin replacement for sklearn's Lasso, MultitaskLasso, sparse Logistic regression with faster solvers.
The solver uses coordinate descent, working set methods and extrapolation, which should allow it to scale to problems with millions of features.
It supports sparse and dense data, along with centering and normalization (centering sparse data is not trivial as it breaks the sparsity of the design matrix), and comes with a GroupLassoCV class to perform cross-validation.
In celer's documentation, there is an example showing how to use it.
I've also looked into this, as far as I know scikit-learn does not provide this implementation.
The MultiTaskLasso does something else. From the documentation:
"The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks."
In other words, the MultiTaskLasso is an implementation of the Lasso which is able to predict multiple targets at the same time (hence y is a 2D array). Another way this problem is known is 'multi-output regression' or 'multi-target regression'. If the tasks are related, such methods can improve methods which try to model every task or target separately.

Categories