I have an extremely large dataset and would like to train several random forest models on partitions of the dataset, then average these models to come up with my final classifier. Since random forest is an ensemble method, this is an intuitively sound approach but I'm unsure whether it's possible to do using scikit-learn's random forest classifier. Any ideas?
I'd also be open to using a random forest classifier from another package as well, just not sure where to look.
Here is what I can think of:
Pandas + Scikit:
You can customize your own bootstrap algorithm where you randomly read a reasonably sized sample from the overall data set, and fit scikit trees on them (would be perfect if you randomize features at each node). Then pickle each tree and finally average them out to come up with your random forest.
Graphlab + SFrame Turi has its own big data library (SFrame, similar to Pandas) and machine learning library (graphlab, very similar to scikit). Very beautiful environment.
Blaze-Dask might have a little steeper learning curve for some people, but would be an efficient solution.
You can go with the memory-mapped numpy option also but it's going to be more cumbersome than the first three options, and I've never done it so I'll just leave this option here.
All in all, I would go with option 2.
Related
I use scikit-learn in Python to run RandomForestClassifier(). Because I want to visualize Random Forests to realize the correlation between different features, I use export_graphviz() to achieve this goal.
estimator1 = best_model1.estimators_[0]
from sklearn.tree import export_graphviz
export_graphviz(estimator1,
'tree_from_optimized_forest.dot',
rounded = True,
feature_names=X_train.columns,
class_names = ["No", "Yes"],
filled = True)
from subprocess import call
call(['dot', '-Tpng', 'tree_from_optimized_forest.dot', '-o', 'tree_from_optimized_forest.png', '-Gdpi=200'])
from IPython.display import Image
Image('tree_from_optimized_forest.png', "w")
However, unlike Decision Tree, Random Forests will produce many trees, which are depended on the number of n_estimators in RandomForestClassifier().
best_model1 = RandomForestClassifier(n_estimators= 100,
criterion='gini',
random_state= 42,
)
Besides, because DecisionTreeClassifier() uses all the samples to produce just one tree, we can explain directly the results on this single tree.
In opposite, Random Forests is trained to make several different trees, then voting inside these trees to decide the result. In addition, the content of these trees are different because Random Forests has the methods of Bootstrap, Bagging, Out-of-bag...and so on.
Therefore, I want to ask that if I only visualize one of trees from the result of RandomForestClassifier(), whether this tree has a certain reference value?
Can I directly explain the content of this tree as the analysis result of whole data? if not, whether DecisionTreeClassifier() is the only way to analyze the correlation between features through visualized image?
Thanks a lot!!
There have always been this relation in machine learning between the model's interpret-ability and complexity and your post is directly relating to this.
Some of the models that are quite simple but are used intensively for their interpret ability is the decision trees, but since they are not complex enough (suffer from a bias), they usually fail to learn very complex function, hence people came up with the random forest classifiers. Random forests reduce the bias of the vanilla decision tree and add more variance, but unfortunately in that process they took away the straightforward interpret-ability attribute with it.
Yet, there is still some tools that could help you gain some insight on the learnt function and the contribution of the features, one of those tools is treeinterpreter, you can learn more about it in this article.
I am new to all these methods and am trying to get a simple answer to that or perhaps if someone could direct me to a high level explanation somewhere on the web. My googling only returned kaggle sample codes.
Are the extratree and randomforrest essentially the same? And xgboost uses boosting when it chooses the features for any particular tree i.e. sampling the features. But then how do the other two algorithms select the features?
Thanks!
Extra-trees(ET) aka. extremely randomized trees is quite similar to random forest (RF). Both methods are bagging methods aggregating some fully grow decision trees. RF will only try to split by e.g. a third of features, but evaluate any possible break point within these features and pick the best. However, ET will only evaluate a random few break points and pick the best of these. ET can bootstrap samples to each tree or use all samples. RF must use bootstrap to work well.
xgboost is an implementation of gradient boosting and can work with decision trees, typical smaller trees. Each tree is trained to correct the residuals of previous trained trees. Gradient boosting can be more difficult to train, but can achieve a lower model bias than RF. For noisy data bagging is likely to be most promising. For low noise and complex data structures boosting is likely to be most promising.
Edit 2:
There is now a lovely example in the sklearn documentation on this.
In order to see how many trees are necessary in my forest, I'd like to plot the OOB error as the number of trees used in the forest is increased. I'm in Python using a sklearn.ensemble.RandomForestClassifier but I can't find how to predict using a subset of trees in the forest. I could do this by making a new random forest on each iteration with increasing numbers of trees but this is too expensive.
It seems a similar task is possible with the Gradient Boosting object using the staged_decision_function method. See this example.
This is quite a simple procedure in R and can be achieved by simply calling plot(randomForestObject):
--Edit--
I see now the RandomForestClassifier object has an attribute estimators_ which returns all the DecisionTreeClassifier objects in a list. So to solve this I can iterate through that list, predicting the results from each tree and taking a 'cumulative average'. However, is there really no easier way to do this already implemented?
There is a discussion and code in this issue:
https://github.com/scikit-learn/scikit-learn/issues/4273
You can add trees one-by-one like this:
n_estimators = 100
forest = RandomForestClassifier(warm_start=True, oob_score=True)
for i in range(1, n_estimators + 1):
forest.set_params(n_estimators=i)
forest.fit(X, y)
print i, forest.oob_score_
The solution you propose also needs to get the oob indices for each tree, because you don't want to compute the score on all the training data.
I still feel this is a strange thing to do as the is really no natural ordering of the trees in the forest.
Can you explain what you use-case is? Do you want to find the minimum number of trees for a given accuracy to reduce prediction time? If you want fast prediction time, I'd suggest using GradientBoostingClassifier, which is usually much faster.
I'm performing a stepwise model selection, progressively dropping variables with a variance inflation factor over a certain threshold.
In order to do this, I'm running OLS many, many times on datasets ranging from a few hundred MB to 10 gigs.
What is the quickest implementation of OLS would be for larger datasets? The Statsmodel OLS implementation seems to be using numpy to invert matrices. Would a gradient descent based method be quicker? Does scikit-learn have an especially quick implementation?
Or maybe an mcmc based approach using pymc is quickest...
Update 1: Seems that the scikit learn implementation of LinearRegression is a wrapper for the scipy implementation.
Update 2: Scipy OLS via scikit learn LinearRegression is twice as fast as statsmodels OLS in my very limited tests...
The scikit-learn SGDRegressor class is (iirc) the fastest, but would probably be more difficult to tune than a simple LinearRegression.
I would give each of those a try, and see if they meet your needs. I also recommend subsampling your data - if you have many gigs but they are all samples from the same distibution, you can train/tune your model on a few thousand samples (dependent on the number of features). This should lead to faster exploration of your model space, without wasting a bunch of time on "repeat/uninteresting" data.
Once you find a few candidate models, then you can try those on the whole dataset.
Stepwise methods are not a good way to perform model selection, as they are entirely ad hoc, and depend highly on which direction you run the stepwise procedure. Its far better to use criterion-based methods, or some other method for generating model probabilities. Perhaps the best approach is to use reversible-jump MCMC, which fits models over the entire models space, and not just the parameter space of a particular model.
PyMC does not implement rjMCMC itself, but it can be implemented. Note also that PyMC 3 makes it really easy to fit regression models using its new glm submodule.
I'm trying to use a forest (or tree) augmented Bayes classifier (Original introduction, Learning) in python (preferably python 3, but python 2 would also be acceptable), first learning it (both structure and parameter learning) and then using it for discrete classification and obtaining probabilities for those features with missing data. (This is why just discrete classification and even good naive classifiers are not very useful for me.)
The way my data comes in, I'd love to use incremental learning from incomplete data, but I haven't even found anything doing both of these in the literature, so anything that does structure and parameter learning and inference at all is a good answer.
There seem to be a few very separate and unmaintained python packages that go roughly in this direction, but I haven't seen anything that is moderately recent (for example, I would expect that using pandas for these calculations would be reasonable, but OpenBayes barely uses numpy), and augmented classifiers seem completely absent from anything I have seen.
So, where should I look to save me some work implementing a forest augmented Bayes classifier? Is there a good implementation of Pearl's message passing algorithm in a python class, or would that be inappropriate for an augmented Bayes classifier anyway?
Is there a readable object-oriented implementation for learning and inference of TAN Bayes classifiers in some other language, which could be translated to python?
Existing packages I know of, but found inappropriate are
milk, which does support classification, but not with Bayesian classifiers (and I defitinetly need probabilities for the classification and unspecified features)
pebl, which only does structure learning
scikit-learn, which only learns naive Bayes classifiers
OpenBayes, which has only barely changed since somebody ported it from numarray to numpy and documentation is negligible.
libpgm, which claims to support an even different set of things. According to the main documentation, it does inference, structure and parameter learning. Except there do not seem to be any methods for exact inference.
Reverend claims to be a “Bayesian Classifier”, has negligible documentation, and from looking at the source code I am lead to the conclusion that it is mostly a Spam classifier, according to Robinson's and similar methods, and not a Bayesian classifier.
eBay's bayesian Belief Networks allows to build generic Bayesian networks and implements inference on them (both exact and approximate), which means that it can be used to build a TAN, but there is no learning algorithm in there, and the way BNs are built from functions means implementing parameter learning is more difficult than it might be for a hypothetical different implementation.
I'm afraid there is not an out-of-the-box implementation of Random Naive Bayes classifier (not that I am aware of) because it is still academic matters. The following paper present the method to combine RF and NB classifiers (behind a paywall) : http://link.springer.com/chapter/10.1007%2F978-3-540-74469-6_35
I think you should stick with scikit-learn, which is one of the most popular statistical module for Python (along with NLTK) and which is really well documented.
scikit-learn has a Random Forest module : http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees . There is a submodule which may (I insist of the uncertainty) be used to pipeline towards NB classifier :
RandomTreesEmbedding implements an unsupervised transformation of the
data. Using a forest of completely random trees, RandomTreesEmbedding
encodes the data by the indices of the leaves a data point ends up in.
This index is then encoded in a one-of-K manner, leading to a high
dimensional, sparse binary coding. This coding can be computed very
efficiently and can then be used as a basis for other learning tasks.
The size and sparsity of the code can be influenced by choosing the
number of trees and the maximum depth per tree. For each tree in the
ensemble, the coding contains one entry of one. The size of the coding
is at most n_estimators * 2 ** max_depth, the maximum number of leaves
in the forest.
As neighboring data points are more likely to lie within the same leaf
of a tree, the transformation performs an implicit, non-parametric
density estimation.
And of course there is a out-of-core implementation of Naive Bayes classifier, which can be used incrementally : http://scikit-learn.org/stable/modules/naive_bayes.html
Discrete naive Bayes models can be used to tackle large scale text
classification problems for which the full training set might not fit
in memory. To handle this case both MultinomialNB and BernoulliNB
expose a partial_fit method that can be used incrementally as done
with other classifiers as demonstrated in Out-of-core classification
of text documents.
I was similarly confused as to how to do exact inference with libpgm. However, turns out it is possible. For example (from libpgm docs),
import json
from libpgm.graphskeleton import GraphSkeleton
from libpgm.nodedata import NodeData
from libpgm.discretebayesiannetwork import DiscreteBayesianNetwork
from libpgm.tablecpdfactorization import TableCPDFactorization
# load nodedata and graphskeleton
nd = NodeData()
skel = GraphSkeleton()
nd.load("../tests/unittestdict.txt")
skel.load("../tests/unittestdict.txt")
# toporder graph skeleton
skel.toporder()
# load evidence
evidence = dict(Letter='weak')
query = dict(Grade='A')
# load bayesian network
bn = DiscreteBayesianNetwork(skel, nd)
# load factorization
fn = TableCPDFactorization(bn)
# calculate probability distribution
result = fn.condprobve(query, evidence)
# output
print json.dumps(result.vals, indent=2)
print json.dumps(result.scope, indent=2)
print json.dumps(result.card, indent=2)
print json.dumps(result.stride, indent=2)
To get the example to work, here is the datafile (I replaced None with null and saved as a .json).
I know this is quite late to the game, but this was the best post I found when searching for a resource to do Bayesian networks with Python. I thought I'd answer in case anyone else is looking for this. (Sorry, would have commented, but just signed up for SO to answer this and rep isn't high enough.)
R's bnlearn has implementations for both Naive Bayes and Tree-augmented Naive Bayes classifiers. You can use rpy2 to port these to Python.
http://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf
There seems to be no such thing yet.
The closest thing currently seems to be eBay's open source implementation bayesian of Belief Networks. It implements inference (two exact ways, and approximate), which means that it can be used to build a TAN. An example (at the moment still an ugly piece of spaghetti code) for that can be found in my open20q repository.
Advantages:
It works.
That is, I now have an implementation of TAN inference, based on bayesian belief network inference.
With Apache 2.0 and 3-clause BSD style licenses respectively, it is legally possible to combine bayesian code and libpgm code to try to get inference and learning to work.
Disadvantages:
There is no learning whatsoever in bayesian. Trying to combine something like libpgm learning with bayesian classes and inference will be a challenge.
Even more so as bayesian assumes that nodes are given by factors which are fixed python functions. Parameter learning requires some wrapping code to enable tweaking the probabilities.
bayesian is written in pure python, using dicts etc. as basic structures, not making use of any speedup numpy, pandas or similar packages might bring, and is therefore quite slow even for the tiny example I build.
I know it's a bit late in the day, but the Octave forge NaN package might be of interest to you. One of the classifiers in this package is an Augmented Naive Bayesian Classifier. The code is GPL'ed so you could easily port it to Python.