scikit-learn: what is the difference between SVC and SGD?

scikit-learn: what is the difference between SVC and SGD? - python

SVM: http://scikit-learn.org/stable/modules/svm.html#classification
SGD: http://scikit-learn.org/stable/modules/sgd.html#classification
seem to do pretty much the same to my eyes,as they write "an SGD implements a linear model". Can someone explain the differences between them?

SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.
In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.
sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.
Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).
SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).
sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.
Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)
SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!
It's a very complex topic, but some simplified rules:
Specialized SVM-methods
scale worse with the number of samples
do not need hyper-parameter tuning
SGD-based methods
scale better for huge-data in general
need hyper-parameter tuning
solve only a subset of the tasks approachable by the the above (no kernel-methods!)
My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!
Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!
And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.

SVC(SVM) uses kernel based optimisation, where, the input data is transformed to complex data(unravelled) which is expanded thus identifying more complex boundaries between classes. SVC can perform Linear and Non-Linear classification
SVC can perform Linear classification by setting the kernel parameter to 'linear'
svc = SVC(kernel='linear')
SVC can perform non-linear classification by setting the kernel parameter to 'poly' , 'rbf'(default)
svc = SVC(kernel='poly')
svc = SVC(kernel='rbf')
SGDClassifier uses gradient descent optimisation technique, where, the optimum coefficients are identified by iteration process. SGDClassifier can perform only linear classification
SGDClassifer can use Linear SVC(SVM) model when the parameter loss is set to 'hinge'(which is the default) i.e SGDClassifier(loss='hinge')

Related

How to do regression as opposed to classification using logistic regression and scikit learn

The target variable that I need to predict are probabilities (as opposed to labels). The corresponding column in my training data are also in this form. I do not want to lose information by thresholding the targets to create a classification problem out of it.
If I train the logistic regression classifier with binary labels, sk-learn logistic regression API allows obtaining the probabilities at prediction time. However, I need to train it with probabilities. Is there a way to do this in scikits-learn, or a suitable Python package that scales to 100K data points of 1K dimension.

I want the regressor to use the structure of the problem. One such
structure is that the targets are probabilities.
You can't have cross-entropy loss with non-indicator probabilities in scikit-learn; this is not implemented and not supported in API. It is a scikit-learn's limitation.
In general, according to scikit-learn's docs a loss function is of the form Loss(prediction, target), where prediction is the model's output, and target is the ground-truth value.
In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
For logistic regression you can approximate probabilities as target by oversampling instances according to probabilities of their labels. e.g. if for given sample class_1 has probability 0.2, and class_2 has probability0.8, then generate 10 training instances (copied sample): 8 withclass_2as "ground truth target label" and 2 withclass_1`.
Obviously it is workaround and is not extremely efficient, but it should work properly.
If you're ok with upsampling approach, you can pip install eli5, and use eli5.lime.utils.fit_proba with a Logistic Regression classifier from scikit-learn.
Alternative solution is to implement (or find implementation?) of LogisticRegression in Tensorflow, where you can define loss function as you like it.
In compiling this solution I worked using answers from scikit-learn - multinomial logistic regression with probabilities as a target variable and scikit-learn classification on soft labels. I advise those for more insight.

This is an excellent question because (contrary to what people might believe) there are many legitimate uses of logistic regression as.... regression!
There are three basic approaches you can use if you insist on true logistic regression, and two additional options that should give similar results. They all assume your target output is between 0 and 1. Most of the time you will have to generate training/test sets "manually," unless you are lucky enough to be using a platform that supports SGD-R with custom kernels and X-validation support out-of-the-box.
Note that given your particular use case, the "not quite true logistic regression" options may be necessary. The downside of these approaches is that it is takes more work to see the weight/importance of each feature in case you want to reduce your feature space by removing weak features.
Direct Approach using Optimization
If you don't mind doing a bit of coding, you can just use scipy optimize function. This is dead simple:
Create a function of the following type:
y_o = inverse-logit (a_0 + a_1x_1 + a_2x_2 + ...)
where inverse-logit (z) = exp^(z) / (1 + exp^z)
Use scipy minimize to minimize the sum of -1 * [y_t*log(y_o) + (1-y_t)*log(1 - y_o)], summed over all datapoints. To do this you have to set up a function that takes (a_0, a_1, ...) as parameters and creates the function and then calculates the loss.
Stochastic Gradient Descent with Custom Loss
If you happen to be using a platform that has SGD regression with a custom loss then you can just use that, specifying a loss of y_t*log(y_o) + (1-y_t)*log(1 - y_o)
One way to do this is just to fork sci-kit learn and add log loss to the regression SGD solver.
Convert to Classification Problem
You can convert your problem to a classification problem by oversampling, as described by #jo9k. But note that even in this case you should not use standard X-validation because the data are not independent anymore. You will need to break up your data manually into train/test sets and oversample only after you have broken them apart.
Convert to SVM
(Edit: I did some testing and found that on my test sets sigmoid kernels were not behaving well. I think they require some special pre-processing to work as expected. An SVM with a sigmoid kernel is equivalent to a 2-layer tanh Neural Network, which should be amenable to a regression task structured where training data outputs are probabilities. I might come back to this after further review.)
You should get similar results to logistic regression using an SVM with sigmoid kernel. You can use sci-kit learn's SVR function and specify the kernel as sigmoid. You may run into performance difficulties with 100,000s of data points across 1000 features.... which leads me to my final suggestion:
Convert to SVM using Approximated Kernels
This method will give results a bit further away from true logistic regression, but it is extremely performant. The process is the following:
Use a sci-kit-learn's RBFsampler to explicitly construct an approximate rbf-kernel for your dataset.
Process your data through that kernel and then use sci-kit-learn's SGDRegressor with a hinge loss to realize a super-performant SVM on the transformed data.
The above is laid out with code here

Instead of using predict in the scikit learn library use predict_proba function
refer here:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba

Elastic net regression or lasso regression with weighted samples (sklearn)

Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!

You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!

Can I use SQP(Sequential quadratic programming) in scipy for neural network regression optimization?

As title, after training and testing my neural network model in python.
Can I use SQP function in scipy for neural network regression problem optimization?
For example, I am using temperature,humid,wind speed ,these three feature for input,predicting energy usage in some area.
So I use neural network to model these input and output's relationship, now I wanna know some energy usage lowest point, what input feature are(i.e. what temperature,humid,wind seed are).This just example so may sound unrealistic.
Because as far as I know, not so many people just use scipy for neural network optimization. But in some limitation , scipy is the most ideal optimization tool what I have by now(p.s.: I can't use cvxopt).
Can someone give me some advice? I will be very appreciate!

Sure, that's possible, but your question is too broad to give a complete answer as all details are missing.
But: SLSQP is not the right tool!
There is a reason, NN training is dominated by first-order methods like SGD and all it's variants
Fast calculation of gradients and easy to do in mini-batch mode (not paying for the full gradient; less memory)
Very different convergence theory for Stochastic-Gradient-Descent which is usually much better for large-scale problems
In general: fast iteration speed (e.g. time per epoch) while possibly needing more epochs (for full convergence)
NN is unconstrained continuous optimization
SLSQP is a very general optimization able to tackle constraints and you will pay for that (performance and robustness)
LBFGS is actually the only tool (which i saw) sometimes used to do that (and also available in scipy)
It's a bound-constrained optimizer (no general constraints as SLSQP)
It approximates the inverse-hessian and therefore memory-usage is greatly reduced compared to BFGS and also SLSQP
Both methods are full-batch methods (opposed to the online/minibatch nature of SGD
They are also using Line-searches or something similar which results less hyper-parameters to tune: no learning-rates!
I think you should stick to SGD and it's variants.
If you want to go for the second-order approach: learn from sklearn's implementation using LBFGS

What are the initial estimates taken in Logistic regression in Scikit-learn for the first iteration?

I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?

First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.

Optimal SVM parameters for high recall

I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.
0: Does not belong to class A
1: Belongs to class A
Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.
I use a SVM with linear kernel.
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)
How should I choose other SVM parameters like C? Also, what is the difference between SVC with a linear kernel and LinearSVC?

The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels.
Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.
Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).
And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing.

Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)

You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.
According to linearSVC documentation -
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.