python scikits learn - SVM options

python scikits learn - SVM options - python

Just curious about two options in scikits learn SVM class. What does Scale_C and shrinking do? There wasn't much in the documentation. Scale C seems to be able to scale the C paramter appropriately for the training data.
Thanks

scale_C=True (deprecated in the dev version and scheduled for removal in 0.12) causes the regularization parameter C to be divided by the number of samples before it is handed to the underlying LibSVM implementation.
shrinking enables or disables the "shrinking heuristic", described by Joachims 1999, that should speed up SVM training.

Related

How to choose parameters for svm in sklearn

I'm trying to use SVM from sklearn for a classification problem. I got a highly sparse dataset with more than 50K rows and binary outputs.
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C.
For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ?
Same goes for C and gamma.
Thanks !

Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms.
Try the linear and Gaussian kernels for starters. If linear doesn't work well and Gaussian does, then try the other kernels.
Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc.
I haven't yet had an application that got more than trivial benefit from altering the cost.

Optimal SVM parameters for high recall

I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.
0: Does not belong to class A
1: Belongs to class A
Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.
I use a SVM with linear kernel.
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)
How should I choose other SVM parameters like C? Also, what is the difference between SVC with a linear kernel and LinearSVC?

The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels.
Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.
Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).
And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing.

Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)

You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.
According to linearSVC documentation -
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Large training and testing data in libsvm

I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?

LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM

As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.

easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).

Learning and using augmented Bayes classifiers in python

I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.

I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.

I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)

R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf

There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.

I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.

scikit-learn OpenMP libsvm

I am using scikit-learn SVC to classify some data. I would like to increase the training performance.
clf = svm.SVC(cache_size=4000, probability=True, verbose=True)
Since sckikit-learn interfaces with libsvm and libsvm uses OpenMp I was hoping that:
export OMP_NUM_THREADS=16
would run on multiple cores.
Unfortunately this did not help.
Any Ideas?
Thanks

There is no OpenMP support in the current binding for libsvm in scikit-learn. However it is very likely that if you have performance issues with sklearn.svm.SVC should you use a more scalable model instead.
If your data is high dimensional it might be linearly separable. In that case it is advised to first try simpler models such as naive bayes models or sklearn.linear_model.Perceptron that are known to be very speedy to train. You can also try sklearn.linear_model.LogisticRegression and sklearn.svm.LinearSVC both implemented using liblinear that is more scalable than libsvm albeit less memory efficients than other linear models in scikit-learn.
If your data is not linearly separable, you can try sklearn.ensemble.ExtraTreesClassifier (adjust the n_estimators parameter to trade-off training speed vs. predictive accuracy).
Alternatively you can try to approximate a RBF kernel using the RBFSampler transformer of scikit-learn + fitting a linear model on the output:
http://scikit-learn.org/dev/modules/kernel_approximation.html

If you are using cross validation or grid search in scikit-learn then you can use multiple CPUs with the n_jobs parameter:
GridSearchCV(..., n_jobs=-1)
cross_val_score(..., n_jobs=-1)
Note that cross_val_score only needs a job per forld so if your number of folds is less than your CPUs you still won't be using all of your processing power.
LibSVM can use OpenMP if you can compile it and use it directly as per these instructions in the LibSVM FAQ. So you could export your scaled data in LibSVM format (here's a StackOverflow question on how to do that) and use LibSVM directly to train your data. But that will only be of benefit if you're grid searching or wanting to know accuracy scores, as far as I know the model LibSVM creates cannot be used in scikit-learn.
There is also a GPU accelerated version of LibSVM which I have tried and is extremely fast, but is not based on the current LibSVM version. I have talked to the developers and they say they hope to release a new version soon.

Although this thread is a year+ old, I thought it is worth answering.
I wrote a patch for openmp support on scikit-learn for both libsvm and liblinear (linearSVC) that's available here - https://github.com/fidlr/sklearn-openmp.
It is based on libsvm's FAQ on how to add OpenMP support, and the multi-core implementation of liblinear.
Just clone the repo and run sklearn-build-openmp.sh to apply the patch and build it.
Timing OMP_NUM_THREADS=4 python plot_permutation_test_for_classification.py:
svmlib with linear kernel timinig dropped by a factor of 2.3
RBF kernel - same.
Liblinear with 4 thread dropped by x1.6
Details about and usage information can be found here -
http://fidlr.org/post/137303264732/scikit-learn-017-with-libsvm-openmp-support

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.