I'm facing the following problem, I'm running a SVR from the scikit-learn library on a training set with about 46500 obsevations and it runs more than six hours, until now.
I'm using the linear kernel.
def build_linear(self):
model = SVR(kernel='linear', C=1)
return model
I already tried changing the "C" value between 1e-3 and 1000 nothing changes.
The poly kernel runs in about 5 minutes, but I need the values for an evaluation and can skip this part...
Does anyone got an idea how to speed this up?
Thanks a lot!
SVMs are known to scale badly with the number of samples!
Instead of SVR with a linear-kernel, use LinearSVR or for huge data: SGDClassifier
LinearSVR is more restricted in terms of what it can compute (no non-linear kernels) and more restricted algorithms usually have more assumptions and use these to speed-up things (or save memory).
SVR is based on libsvm, while LinearSVR is based on liblinear. Both are well-tested high-quality implementations.
(It might be valuable to add: don't waste time in general cases like these waiting 6 hours. Sub-sample your data and try smaller, less small, ... examples and deduce runtime or problems from that. edit: it seems you did that already, good!).
Related
I'm using H2O's Random Forest Regression model in python. While evaluating its performance in terms of speed, I've compared it to scikit-learn's RandomForestRegressor.
The dataset consists of ~20,000 rows and 20 columns of data.
My machine runs windows with python 3.6 on it.
import time
import h2o
import pandas as pd
from h2o.estimators.random_forest import H2ORandomForestEstimator
from sklearn.ensemble import RandomForestRegressor
def timing(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print('%s function took %0.3f ms' % (f.__name__ , (time2-time1)*1000.0))
return ret
return wrap
#timing
def predict_row(model, row):
return model.predict(row)
# read data
h2o.init(max_mem_size = "8G")
h2o.remove_all()
h2o_df = h2o.import_file(csv_path)
train, valid, test = h2o_df.split_frame([0.6, 0.2], seed=1234)
X = h2o_df.col_names[:-1]
y = h2o_df.col_names[-1]
# random forest at H2O
rf_h2o = H2ORandomForestEstimator(model_id="rf_h2o_v1", ntrees=100, stopping_rounds=2, score_each_iteration=True, seed=1000000,
col_sample_rate_per_tree=0.3)
rf_h2o.train(X, y, training_frame=train, validation_frame=valid)
# random forest at sklearn
train_df = train[:-1].as_data_frame(use_pandas=True)
y_df = h2o_df[:,-1].as_data_frame(use_pandas=True)
y_df = y_df[y_df.index.isin(train_df.index)]
rf_sklearn = RandomForestRegressor(n_estimators=100, min_samples_leaf=3, oob_score=True, max_features=0.25)
rf_sklearn.fit(train_df, y_df)
# prediction comparison, let's take row #5 from the test set for example
row_h20 = test[5,:-1]
row_array = test[5,:-1].as_data_frame(use_pandas=True).values
rf_sklearn.predict(row_array)
# run 20 times and test performance on all iterations
for i in range(20):
predict_row(rf_sklearn, row_array)
predict_row(rf_h2o, row_h20)
When I ran this code, I get that the Random Forest by scikit-learn in X4-5 times faster than H2O's model (6-7 milliseconds vs. ~30 milliseconds on the fastest iterations in both cases).
My questions are:
is that really the case? Can anyone verify that I'm not doing anything wrong?
I know H2O is built upon JVM, which is not supposed to be lightning fast, but I thought they would do at least as good as scikit-learn.
I need a super fast implementation of a predict in a regression algorithm. Does anyone have any suggestions? scikit-learn / H2O are not mandatory for me, but I need something reliable.
H2O uses data parallelism to achieve scale and efficiency, and the number of rows here is very small, so the number of parallel threads doing work is probably low.
H2O compares favorably speed-and-scale wise when there are more rows (millions +).
You can see how many threads are working by using the built-in Water Meter (on Linux), or "top" or another performance monitoring tool. And if you open the H2O Flow Web UI, after parsing the data, you can see a detailed Frame Distribution Summary breakdown (see below).
For this example, the dataset I ingested has 43,978 rows and 31 columns. You can see the number of chunks per column is 1. This means my data parallelism level is only 1, and only 1 thread will be working. H2O was designed to be efficient on larger data sets.
There are a few things that I'd consider wrong with the benchmark:
In H2O, you have col_sample_rate_per_tree=0.3, and in sklearn, you have max_features=0.25. You should set col_sample_rate in H2O to the same thing as as max_features in sklearn. Using fewer features in sklearn will cause sklearn to train faster.
In sklearn, you have min_samples_leaf=3, but min_rows (the H2O equivalent) defaults to 1. These should be set to the same value. They both default to 1, so you should probably just leave them both at the default value. Setting min_samples_leaf=3 will force sklearn to stop building the tree sooner (so training will be faster).
Do not set stopping_rounds=2 in H2O.
There are a number of other parameters that have different default values in sklearn vs H2O.
I'd recommend running the benchmark with and without scoring on each iteration to see how that affects performance (referring to score_each_iteration in H2O and oob_score in sklearn).
I'm assuming that your data contains only numeric features, but if you use a dataset that contains categorical features, sklearn will require you to one-hot encode those features, which could increase your training time quite a bit, depending on the number of levels for each categorical feature.
That said, it's still possible that sklearn can train faster than H2O on small datasets. H2O is designed with scalability in mind, so you'll start seeing the value of H2O (over other tools) when you use a bit more training data. There is a nice plot of how H2O Random Forest compares to other tools as training size increases in this benchmark (screenshot below). This is only on a single dataset, so it can't be generalized to all situations, but it's a good demonstration of how tools like sklearn start to break down as training size increases (sklearn RF runs out of memory after 1M rows in this particular benchmark).
The 6-7 milliseconds vs. ~30 milliseconds difference is probably due to the overhead of making REST API calls.
For making predictions in production environments, where 23ms really matters, I believe it is recommended practice to use the POJO or MOJO. You are using random forest, so should use MOJOs, as they are giving a notable speed improvement according to https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/MOJO_QuickStart.md
If you are doing tests just to evaluate the speed of H2O, you should also do a comparison on a very large data set, or a more complicated model. E.g. if the difference is then 2.006s vs. 2.030s you can stop worrying about it, and move on to the more important things.
I ran into a similar issue. I had coded up my H2O predict to do one prediction at a time. The performance was disappointing. That's because there's a fair amount of overhead to set things up for H2O, including building the H2O dataframe.
To make things reasonable, I batched up several data rows to be predicted on at once, and now I get good performance.
The graph that Dr LeDell posted, clearly shows the various and sometimes large overhead that various platforms encounter.
The OP needs to set up the benchmark to reflect how they really expect to use the model: whether if the overhead is more important (unlikely), or if the time per prediction is more important.
I'm currently working on a project in estimating signal by using some classification learning algorithms, such as logistics regression and random forest using scikit-learn.
I'm now using the confusion matrix to estimate the performance of different algorithms in prediction, and I found there was common problem for both algorithms. That is, in all cases, although the accuracy of algorithms seems relatively good (around 90% - 93%), the total number of FN are pretty high comparing to TP (FNR < 3% ). Does any one has clue about why I'm having this kind of issue in my prediction problem. If possible, can you give me some hints regarding how to possibly solve this problem?
Thanks for reply and help in advance.
Updates:
The dataset is extremely imbalanced (8:1), with in total around 180K observations. I already tested several re-sampling methods, such as OSS, SMOTE(+Tomek or +ENN), but neither of them returns good results. In both cases, although the recall goes up from 2.5% to 20%, the precision decreases significantly (from 60% to 20%).
You probably have an imbalanced dataset, where one of your classes has many more examples than your other class.
One solution is to give an higher cost of misclassifying the class with less examples.
This question in Cross Validated covers many approaches to your problem:
https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning
EDIT:
Given that you are using scikit-learn you can, as a first approach, set the parameter class_weight to balanced on your Logistic regression.
I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...
The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.
Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.
Background / Question
I am trying to create a SVM using Scikit-learn. I have a training set (here is the link to it https://dl.dropboxusercontent.com/u/9876125/training_patients.txt) which I load and then use to train the SVM. The training set is 3600 lines long. When I use all 3600 tuples the SVM never finishes training.... BUT when I only use the first 3594 tuples it finishes training in under a minute. I've tried using a variety of different sized training sets and the same thing continues to happen... depending on how many tuples I use the SVM either trains very quickly or it never completes. This has led me to the conclusion that the SVM is having difficulty converging on an answer depeding on the data.
Is my assumption about this being a convergence problem correct? If so, what is the solution? If not, what other problem could it be?
Code
import pylab as pl # #UnresolvedImport
from sklearn.datasets import load_svmlight_file
print(doc)
import numpy as np
from sklearn import svm, datasets
print "loading training setn"
X_train, y_train = load_svmlight_file("training_patients.txt")
h = .02 # step size in the mesh
C = 1.0 # SVM regularization parameter
print "creating svmn"
poly_svc = svm.SVC(kernel='poly', cache_size=600, degree=40, C=C).fit(X_train, y_train)
print "all done"
The optimization algorithm behind SVM has cubic (O(n^3)) complexity assuming relatively high cost (C) and high-dimensional feature space (polynomial kernel with d=40 implies ~1600 dimensional feature space). I would not call this "problems with convergence", as for over 3000 samples it can take a while to train such a model, and it is normal. The fact that for some subsets you achieve much faster convergence is the effect of very rich feature projection (the same can happen with RBF kernel) - and it is a common phenomenon, it is true even for very simple data from UCI library. As mentioned in the comments, setting "verbose=True" may give you additional information regarding your optimization process - it will output the number of iterations, the number of support vectors (higher the number of SVs, more is SVM overfitting, which can be also a reason for slow convergence).
I would also add to #lejlot's answer that standardizing the input variables (centering and scaling to unit variance or rescaling to some range such as [0, 1] or [-1, 1]) can make the optimization problem much easier and speed up the convergence as well.
By having a look at your data, it seems that some features have min and max values significantly larger than others. Maybe the MinMaxScaler can help. Have a look at the preprocessing doc in general.
I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?
LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM
As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.
easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).