confused about random_state in decision tree of scikit learn - python

Confused about random_state parameter, not sure why decision tree training needs some randomness. My thoughts
is it related to random forest?
is it related to split training testing data set? If so, why not use training testing split method directly (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])

This is explained in the documentation
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
So, basically, a sub-optimal greedy algorithm is repeated a number of times using random selections of features and samples (a similar technique used in random forests). The random_state parameter allows controlling these random choices.
The interface documentation specifically states:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
So, the random algorithm will be used in any case. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you'll get the same result.

The above cited part of the documentation is misleading, the underlying problem is not greediness of the algorithm. The CART algorithm is deterministic (see e.g. here) and finds a global minimum of the weighted Gini indices.
Repeated runs of the decision tree can give different results because it is sometimes possible to split the data using different features and still achieve the same Gini index. This is described here:
https://github.com/scikit-learn/scikit-learn/issues/8443.
Setting the random state simply assures that the CART implementation works through the same randomized list of features when looking for the minimum.

Decision trees use heuristics process. Decision tree do not guarantee the same solution globally. There will be variations in the tree structure each time you build a model. Passing a specific seed to random_state ensures the same result is generated each time you build the model.

The random_state parameter present for decision trees in scikit-learn determines which feature to select for a split if (and only if) there are two splits that are equally good (i.e. two features yield the exact same improvement in the selected splitting criteria (e.g. gini)). If this is not the case, the random_state parameter has no effect.
The issue linked in teatrader's answer discusses this in more detail and as a result of that discussion the following section was added to the docs (emphasis added):
random_state int, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
To illustrate, let's consider the following example with the iris sample data set and a shallow decision tree containing just a single split:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
iris = load_iris(as_frame=True)
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The output of this code will alternate between the two following trees based on which random_state is used.
The reason for this is that splitting on either petal length <= 2.45 or petal width <= 0.8 will both perfectly separate out the setosa class from the other two classes (we can see that the leftmost setosa node contains all 50 of the setosa observations).
If we change just one observation of the data so that one of the previous two splitting criteria no longer produces a perfect separation, the random_state will have no effect and we will always end up with the same result, for example:
# Change the petal width for first observation of the "Setosa" class
# so that it overlaps with the values of the other two classes
iris['data'].loc[0, 'petal width (cm)'] = 5
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The first split will now always be petal length <= 2.45 since the split petal width <= 0.8 can only separate out 49 of the 50 setosa classes (in other words a lesser decreases in the gini score).
For a random forest (which consists of many decision trees), we would create each individual tree with a random selections of features and samples (see https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters for details), so there is a bigger role for the random_state parameter, but this is not the case when training just a single decision tree (this is true with the default parameters, but it is worth noting that some parameters could be affected by randomness if they are changed from the default value, most notably setting splitter="random").
A couple of related issues:
https://ai.stackexchange.com/questions/11576/are-decision-tree-learning-algorithms-deterministic
Is the CART algorithm used by scikit-learn deterministic?

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

Related

ExtraTreesRegressor criterion

As I understand, ExtraTreesRegressor from sklearn works by doing random splits instead of minimizing a metric like gini for classification or mae for regression.
I don't understand why there's a criterion parameter, as the criterion for the splits should be random.
Is it just for code compatibility, or am I missing something?
I think you misunderstood what about these splits is random. According to the User Guide for Extremely Randomized Trees:
As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds,
thresholds are drawn at random for each candidate feature
and
the best of these randomly-generated thresholds is picked as the splitting rule.
The additional randomization of the ExtraTreesRegressor concerns the thresholds of the candidate features. But it must still be determined which of them provides the best split. And this is why you still need a criterion specifying the function to evaluate the quality of a split.

Random Forest in-bag and node dimensions

I have to do a random forest classifier for an exercise and the exercise specifically says for the parameters, and I quote from my language
in-bag percentage: 25% 50% 85%
Number of dimensions in one node: 10%, 50%, 80%
I use scikit-learn for the classifier and I don't know which are the parameters from the class to set the in-bag percentage and the number of dimensions.
You can define the number of dimension using max_features parameter. Something like:
rf = RandomForestClassifier(max_features=.1)
Unfortunately, RandomForestClassifier doesn't yet support subsampling (i.e. in-bag percentage). However this feature has been added in current development branch of sklearn, so will be available in future.
A good workaround for now is to use BaggingClassifier: it have a max_samples parameter for subsampling, and it can be turned into RandomForestClassifier using DecisionTreeClassifier as base.
base = DecisionTreeClassifier(max_features=.1)
rf = BaggingClassifier(base_estimator=base, max_samples=.25)
Note that BaggingClassifier also have a max_features parameter, but that works differently than Random Forest does.

Classification results depend on random_state?

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.

Not enough Folds on CrossValidation from Scikit Learn

I try to create a prediction model with Scikit Learn in Python. I have a dataframe with about 850k rows and 17 columns. The last column is my label and the others my features.
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
predictors = [a list of my predictors columns]
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)
scores = cross_validation.cross_val_score(alg, train[predictors],train["Sales"], cv=5)
print(scores.mean())
However, when I run the code, I have the following warning:
/Users/.../anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:417: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=5.
% (min_labels, self.n_folds)), Warning)
I am not sure if I understood the warning message. I thought I would have it only on small samples.
As suggested by David in the comments, it sounds like your output is continuous instead of categorical. If this is the case, you almost certainly don't want to be performing classification, as opposed to regression.
The warning is stemming from the fact that (at least) one of the values in your targets, which are treated as categorical, is underrepresented. If you actually do want to perform classification, a good thing to do first would be to count the number of occurrences of each class in your entire training set.
When you do k fold cross-validation, and min_labels < k, one of the runs of the cross-validation is guaranteed not to see any examples of the class with min_labels, either at train or test time (more frequently happens at test time because the test sets are smaller). In the case that you only have 1 instance of a specific class, then further, you will be guaranteed to get a training run that doesn't see any examples of that class (it lies in one of the folds, which will be used as the test set once)

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn

Categories