Large training and testing data in libsvm - python

I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?

LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM

As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.

easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).

Related

SciKit Learn SVR runs very long

I'm facing the following problem, I'm running a SVR from the scikit-learn library on a training set with about 46500 obsevations and it runs more than six hours, until now.
I'm using the linear kernel.
def build_linear(self):
model = SVR(kernel='linear', C=1)
return model
I already tried changing the "C" value between 1e-3 and 1000 nothing changes.
The poly kernel runs in about 5 minutes, but I need the values for an evaluation and can skip this part...
Does anyone got an idea how to speed this up?
Thanks a lot!
SVMs are known to scale badly with the number of samples!
Instead of SVR with a linear-kernel, use LinearSVR or for huge data: SGDClassifier
LinearSVR is more restricted in terms of what it can compute (no non-linear kernels) and more restricted algorithms usually have more assumptions and use these to speed-up things (or save memory).
SVR is based on libsvm, while LinearSVR is based on liblinear. Both are well-tested high-quality implementations.
(It might be valuable to add: don't waste time in general cases like these waiting 6 hours. Sub-sample your data and try smaller, less small, ... examples and deduce runtime or problems from that. edit: it seems you did that already, good!).

Elastic net regression or lasso regression with weighted samples (sklearn)

Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!

How to choose parameters for svm in sklearn

I'm trying to use SVM from sklearn for a classification problem. I got a highly sparse dataset with more than 50K rows and binary outputs.
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C.
For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ?
Same goes for C and gamma.
Thanks !
Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms.
Try the linear and Gaussian kernels for starters. If linear doesn't work well and Gaussian does, then try the other kernels.
Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc.
I haven't yet had an application that got more than trivial benefit from altering the cost.

Handling K-means with large dataset 6gb with scikit-learn?

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

Multiprocessing scikit-learn

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT:
Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.
For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.
However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.
Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

Categories