I'm trying to use SVM from sklearn for a classification problem. I got a highly sparse dataset with more than 50K rows and binary outputs.
The problem is I don't know quite well how to efficiently choose the parameters, mainly the kernel, gamma anc C.
For the kernels for example, am I supposed to try all kernels and just keep the one that gives me the most satisfying results or is there something related to our data that we can see in the first place before choosing the kernel ?
Same goes for C and gamma.
Thanks !
Yes, this is mostly a matter of experimentation -- especially as you've told us very little about your data set: separability, linearity, density, connectivity, ... all the characteristics that affect classification algorithms.
Try the linear and Gaussian kernels for starters. If linear doesn't work well and Gaussian does, then try the other kernels.
Once you've found the best 1 or 2 kernels, then play with the cost and gamma parameters. Gamma is a "slack" parameter: it gives the kernel permission to make a certain proportion of raw classification errors as a trade-off for other benefits: width of the gap, simplicity of the partition function, etc.
I haven't yet had an application that got more than trivial benefit from altering the cost.
Related
There seems to be many techniques for reducing dimensionality (pca, svd etc) in order to escape the curse of dimensionality. But how do you know that your dataset in fact suffers from high-dimensionality problems? Is there a best practice, like visualizations or can one even use KNN to find out?
I have a dataset with 99 features and 1 continuous label (price) and 30 000 instances.
The curse of dimensionality refers to a problem that dictates the relation between feature dimension and your data size. It has been pointed out that as your feature size/dimension grows, the amount of data in order to successfully model your problem will also grow exponentially.
The problem actually arises when there is exponential growth in your data. Because you have to think of how to handle it properly ( storage/ computation power needed).
So we usually experiment to figure out the right size of the dimension we need for our problem (maybe using cross-validation) and only select those features. Also, keep in mind that using lots of features comes with a high risk of overfitting.
You can use either Feature selection or feature extraction for dimension reduction.LASSO can be used for feature selection or PCA, LDA for feature extraction.
Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!
Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?
First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline
I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?
I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.
Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.
I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?
LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM
As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.
easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).