I'm using H2O's Random Forest Regression model in python. While evaluating its performance in terms of speed, I've compared it to scikit-learn's RandomForestRegressor.
The dataset consists of ~20,000 rows and 20 columns of data.
My machine runs windows with python 3.6 on it.
import time
import h2o
import pandas as pd
from h2o.estimators.random_forest import H2ORandomForestEstimator
from sklearn.ensemble import RandomForestRegressor
def timing(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print('%s function took %0.3f ms' % (f.__name__ , (time2-time1)*1000.0))
return ret
return wrap
#timing
def predict_row(model, row):
return model.predict(row)
# read data
h2o.init(max_mem_size = "8G")
h2o.remove_all()
h2o_df = h2o.import_file(csv_path)
train, valid, test = h2o_df.split_frame([0.6, 0.2], seed=1234)
X = h2o_df.col_names[:-1]
y = h2o_df.col_names[-1]
# random forest at H2O
rf_h2o = H2ORandomForestEstimator(model_id="rf_h2o_v1", ntrees=100, stopping_rounds=2, score_each_iteration=True, seed=1000000,
col_sample_rate_per_tree=0.3)
rf_h2o.train(X, y, training_frame=train, validation_frame=valid)
# random forest at sklearn
train_df = train[:-1].as_data_frame(use_pandas=True)
y_df = h2o_df[:,-1].as_data_frame(use_pandas=True)
y_df = y_df[y_df.index.isin(train_df.index)]
rf_sklearn = RandomForestRegressor(n_estimators=100, min_samples_leaf=3, oob_score=True, max_features=0.25)
rf_sklearn.fit(train_df, y_df)
# prediction comparison, let's take row #5 from the test set for example
row_h20 = test[5,:-1]
row_array = test[5,:-1].as_data_frame(use_pandas=True).values
rf_sklearn.predict(row_array)
# run 20 times and test performance on all iterations
for i in range(20):
predict_row(rf_sklearn, row_array)
predict_row(rf_h2o, row_h20)
When I ran this code, I get that the Random Forest by scikit-learn in X4-5 times faster than H2O's model (6-7 milliseconds vs. ~30 milliseconds on the fastest iterations in both cases).
My questions are:
is that really the case? Can anyone verify that I'm not doing anything wrong?
I know H2O is built upon JVM, which is not supposed to be lightning fast, but I thought they would do at least as good as scikit-learn.
I need a super fast implementation of a predict in a regression algorithm. Does anyone have any suggestions? scikit-learn / H2O are not mandatory for me, but I need something reliable.
H2O uses data parallelism to achieve scale and efficiency, and the number of rows here is very small, so the number of parallel threads doing work is probably low.
H2O compares favorably speed-and-scale wise when there are more rows (millions +).
You can see how many threads are working by using the built-in Water Meter (on Linux), or "top" or another performance monitoring tool. And if you open the H2O Flow Web UI, after parsing the data, you can see a detailed Frame Distribution Summary breakdown (see below).
For this example, the dataset I ingested has 43,978 rows and 31 columns. You can see the number of chunks per column is 1. This means my data parallelism level is only 1, and only 1 thread will be working. H2O was designed to be efficient on larger data sets.
There are a few things that I'd consider wrong with the benchmark:
In H2O, you have col_sample_rate_per_tree=0.3, and in sklearn, you have max_features=0.25. You should set col_sample_rate in H2O to the same thing as as max_features in sklearn. Using fewer features in sklearn will cause sklearn to train faster.
In sklearn, you have min_samples_leaf=3, but min_rows (the H2O equivalent) defaults to 1. These should be set to the same value. They both default to 1, so you should probably just leave them both at the default value. Setting min_samples_leaf=3 will force sklearn to stop building the tree sooner (so training will be faster).
Do not set stopping_rounds=2 in H2O.
There are a number of other parameters that have different default values in sklearn vs H2O.
I'd recommend running the benchmark with and without scoring on each iteration to see how that affects performance (referring to score_each_iteration in H2O and oob_score in sklearn).
I'm assuming that your data contains only numeric features, but if you use a dataset that contains categorical features, sklearn will require you to one-hot encode those features, which could increase your training time quite a bit, depending on the number of levels for each categorical feature.
That said, it's still possible that sklearn can train faster than H2O on small datasets. H2O is designed with scalability in mind, so you'll start seeing the value of H2O (over other tools) when you use a bit more training data. There is a nice plot of how H2O Random Forest compares to other tools as training size increases in this benchmark (screenshot below). This is only on a single dataset, so it can't be generalized to all situations, but it's a good demonstration of how tools like sklearn start to break down as training size increases (sklearn RF runs out of memory after 1M rows in this particular benchmark).
The 6-7 milliseconds vs. ~30 milliseconds difference is probably due to the overhead of making REST API calls.
For making predictions in production environments, where 23ms really matters, I believe it is recommended practice to use the POJO or MOJO. You are using random forest, so should use MOJOs, as they are giving a notable speed improvement according to https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/MOJO_QuickStart.md
If you are doing tests just to evaluate the speed of H2O, you should also do a comparison on a very large data set, or a more complicated model. E.g. if the difference is then 2.006s vs. 2.030s you can stop worrying about it, and move on to the more important things.
I ran into a similar issue. I had coded up my H2O predict to do one prediction at a time. The performance was disappointing. That's because there's a fair amount of overhead to set things up for H2O, including building the H2O dataframe.
To make things reasonable, I batched up several data rows to be predicted on at once, and now I get good performance.
The graph that Dr LeDell posted, clearly shows the various and sometimes large overhead that various platforms encounter.
The OP needs to set up the benchmark to reflect how they really expect to use the model: whether if the overhead is more important (unlikely), or if the time per prediction is more important.
Related
For research purposes, I find myself needing to traing SVM via SGD on a large DS (that is, a large number of examples). This makes using scikit-learn's implementation (SGDClassifier) problematic, as it requires loading the entire DS at once.
The algorithm I am familiar with uses n step of SGD to obtain n different separators w_i, and then averages them (specifics can be seen in slide 12 of https://www.cse.huji.ac.il/~shais/Lectures2014/lecture8.pdf).
This made me think that maybe I can use scikit-learn to train multiple such classifiers and then take the average of the resulting linear separators (assume no bias).
Is this a reasonable line of thinking, or does scikit-learn's implementation not fall under my logic?
Edit: I am well aware of the alternatives for training SVM in different ways, but this is for a specific research purpose. I would just like to know if this line of thinking is possible with scikit-learn's implementation, or if you are aware of an alternative that will allow me to train SVM using SGD without loading an entire DS to memory.
SGDClassifier have a partial_fit method, and one of the primary objectives of partial_fit method is to scale sklearn models to large-scale datasets. Using this, you can load a part of the dataset into RAM, feed it to SGD, and keep repeating this unless full dataset is used.
In code below, I use KFold mainly to imitate loading chunk of dataset.
class GD_SVM(BaseEstimator, ClassifierMixin):
def __init__(self):
self.sgd = SGDClassifier(loss='hinge',random_state=42,fit_intercept=True,l1_ratio=0,tol=.001)
def fit(self,X,y):
cv = KFold(n_splits=10,random_state=42,shuffle=True)
for _,test_id in cv.split(X,y):
xt,yt = X[test_id],y[test_id]
self.sgd = self.sgd.partial_fit(xt,yt,classes=np.unique(y))
def predict(self,X):
return self.sgd.predict(X)
To test this against regular (linear) SVM:
X,y = load_breast_cancer(return_X_y=True)
X = StandardScaler().fit_transform(X) #For simplicity, Pipeline is better choice
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=5,random_state=43)
sgd = GD_SVM()
svm = LinearSVC(loss='hinge',max_iter=1,random_state=42,
C=1.0,fit_intercept=True,tol=.001)
r = cross_val_score(sgd,X,y,cv=cv) #cross_val_score(svm,X,y,cv=cv)
print(r.mean())
This returned 95% accuracy for above GD_SVM, and 96% for SVM. In Digits dataset SVM had 93% accuracy, while GD_SVM had 91%. While these performances are broadly similar, as these measurements show, please note that they are not identical. This is expected, since these algorithms use pretty different optimization algorithms, but I think careful tuning of hyper-parameter would reduce the gap.
Based on the concern of loading all of the data in memory, if you have access to more compute resources, you may want to use PySpark's SVM implementation: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#linear-support-vector-machine, as that Spark is built for large scale data processing. I don't know if averaging the separators from multiple Scikit-Learn models would work as expected; there isn't a clean way to instantiate a new model with new separators, based on the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), so it would probably have to be implemented as an ensemble approach.
If you insist on using the whole DS for training instead of sampling (btw that is what the slides describe) and you do not care about performance, I would train n classifiers, and then select only their support vectors and retrain final version on those support vectors only. This way you effectively dismiss most of the data and concentrate only on the points that are important for the classification.
I'm facing the following problem, I'm running a SVR from the scikit-learn library on a training set with about 46500 obsevations and it runs more than six hours, until now.
I'm using the linear kernel.
def build_linear(self):
model = SVR(kernel='linear', C=1)
return model
I already tried changing the "C" value between 1e-3 and 1000 nothing changes.
The poly kernel runs in about 5 minutes, but I need the values for an evaluation and can skip this part...
Does anyone got an idea how to speed this up?
Thanks a lot!
SVMs are known to scale badly with the number of samples!
Instead of SVR with a linear-kernel, use LinearSVR or for huge data: SGDClassifier
LinearSVR is more restricted in terms of what it can compute (no non-linear kernels) and more restricted algorithms usually have more assumptions and use these to speed-up things (or save memory).
SVR is based on libsvm, while LinearSVR is based on liblinear. Both are well-tested high-quality implementations.
(It might be valuable to add: don't waste time in general cases like these waiting 6 hours. Sub-sample your data and try smaller, less small, ... examples and deduce runtime or problems from that. edit: it seems you did that already, good!).
Background / Question
I am trying to create a SVM using Scikit-learn. I have a training set (here is the link to it https://dl.dropboxusercontent.com/u/9876125/training_patients.txt) which I load and then use to train the SVM. The training set is 3600 lines long. When I use all 3600 tuples the SVM never finishes training.... BUT when I only use the first 3594 tuples it finishes training in under a minute. I've tried using a variety of different sized training sets and the same thing continues to happen... depending on how many tuples I use the SVM either trains very quickly or it never completes. This has led me to the conclusion that the SVM is having difficulty converging on an answer depeding on the data.
Is my assumption about this being a convergence problem correct? If so, what is the solution? If not, what other problem could it be?
Code
import pylab as pl # #UnresolvedImport
from sklearn.datasets import load_svmlight_file
print(doc)
import numpy as np
from sklearn import svm, datasets
print "loading training setn"
X_train, y_train = load_svmlight_file("training_patients.txt")
h = .02 # step size in the mesh
C = 1.0 # SVM regularization parameter
print "creating svmn"
poly_svc = svm.SVC(kernel='poly', cache_size=600, degree=40, C=C).fit(X_train, y_train)
print "all done"
The optimization algorithm behind SVM has cubic (O(n^3)) complexity assuming relatively high cost (C) and high-dimensional feature space (polynomial kernel with d=40 implies ~1600 dimensional feature space). I would not call this "problems with convergence", as for over 3000 samples it can take a while to train such a model, and it is normal. The fact that for some subsets you achieve much faster convergence is the effect of very rich feature projection (the same can happen with RBF kernel) - and it is a common phenomenon, it is true even for very simple data from UCI library. As mentioned in the comments, setting "verbose=True" may give you additional information regarding your optimization process - it will output the number of iterations, the number of support vectors (higher the number of SVs, more is SVM overfitting, which can be also a reason for slow convergence).
I would also add to #lejlot's answer that standardizing the input variables (centering and scaling to unit variance or rescaling to some range such as [0, 1] or [-1, 1]) can make the optimization problem much easier and speed up the convergence as well.
By having a look at your data, it seems that some features have min and max values significantly larger than others. Maybe the MinMaxScaler can help. Have a look at the preprocessing doc in general.
I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?
LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM
As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.
easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).
I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.
How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.
I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.
Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?
EDIT:
Here is my scenario:
lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?
Thanks.
PS: Please do not close this .
I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.
For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.
Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.
How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.
For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.
However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.
Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.