I've read at https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3 to create reproducible machine learning models in Python, you need to set the random seed and pin the package versions.
I would like to be able to save models after training, that is, e.g. using pickle.dump(), load them up again and then get the same results.
At https://docs.python.org/3/library/random.html#notes-on-reproducibility it says:
"Sometimes it is useful to be able to reproduce the sequences given by a pseudo-random number generator. By re-using a seed value, the same sequence should be reproducible from run to run as long as multiple threads are not running."
I'm using a RandomForestClassifier with jobs=-1 so I'm wondering whether I need to do more or whether this is handled internally already.
For the random seed now I have:
os.environ['PYTHONHASHSEED'] = str(42)
random.seed(42)
np.random.seed(42)
And for the classifier I'm setting the random state:
rf = RandomForestClassifier(random_state=42)
According to the documentation, you must also set the random_state parameter in the RandomForestClassifier:
random_state: int, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when
building trees (if bootstrap=True) and the sampling of the features to
consider when looking for the best split at each node (if max_features
< n_features). See Glossary for details.
For example:
from sklearn.ensemble import RandomForestClassifier
SEED = 42
clf = RandomForestClassifier(random_state = SEED)
CLARIFICATIONS:
In order for the experiment to be fully reproducible, all steps in the preparation of the dataset must be checked (e.g. train and test splits) even with fixed seed. np.random.seed does not guarantee a fixed random state for sklearn. We need to set random_state parameter corresponding to each sklearn function to ensure repeatability.
It is also sufficient to set the random_state in multithreading. Make sure you use the latest version of sklearn if possible to avoid possible bugs on earlier versions.
Related
Does setting a specific random seed (random_state) when splitting train/test datasets using scikit-learn produce the same initialization of the random number generator (i.e., produces same pseudo-random numbers) over different platforms - for instance, over different cloud computing instances?
Thanks!
As long as random_state is equal on all platforms and they are all running the same versions of numpy, you should get the exact same splits.
Since random_state is a numpy instance, I think all of scikit-learn's pseudo-random number generators are frozen because numpy froze RandomState.
You can check the documentation for random_state here, which as you can see is numpy.random.RandomState. You can check numpy's compatibility guarantee here.
I'm running a reinforcement learning program in a gym environment(BipedalWalker-v2) implemented in tensorflow. I've set the random seed of the environment, tensorflow and numpy manually as follows
os.environ['PYTHONHASHSEED']=str(42)
random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)
env = gym.make('BipedalWalker-v2')
env.seed(0)
config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
# run the graph with sess
However, I get different results every time I run my program (without changing any code). Why are the results not consistent and what should I do if I want to obtain the same result?
Update:
The only places that I can think of may introduce randomness (other than the neural networks) are
I use tf.truncated_normal to generate random noise epsilon so as to implement noisy layer
I use np.random.uniform to randomly select samples from replay buffer
I also spot that the scores I get are pretty consistent at the first 10 episodes, but then begin to differ. Other things such as losses also show a similar trend but are not the same in numeric.
Update 2
I've also set "PYTHONHASHSEED" and use single-thread CPU as #jaypops96 described, but still cannot reproduce the result. Code has been updated in the above code block
I suggest checking whether your TensorFlow graph contains nondeterministic operations. For example, reduce_sum before TensorFlow 1.2 was one such operation. These operations are nondeterministic because floating-point addition and multiplication are nonassociative (the order in which floating-point numbers are added or multiplied affects the result) and because such operations don't guarantee their inputs are added or multiplied in the same order every time. See also this question.
EDIT (Sep. 20, 2020): The GitHub repository framework-determinism has more information about sources of nondeterminism in machine learning frameworks, particularly TensorFlow.
It seems that tensorflow neural networks introduce randomness during training that isn't controlled by a numpy random seed. The randomness appears to possibly come from python hash operations and parallelized operations executing in non-controlled ordering, at the very least.
I had success getting 100% reproducibility using a keras-tensorflow NN, by following the setup steps in this response:
How to get reproducible results in keras
specifically, I used the formulation proposed by #Poete Maudit in that link.
They key was to set random seed values UP FRONT, for numpy, python, and tensorflow, then also to make tensorflow run on single-thread CPU in a specially-configured session.
Here's the code i used, updated very slightly from the link i posted.
print('Running in 1-thread CPU mode for fully reproducible results training a CNN and generating numpy randomness. This mode may be slow...')
# Seed value
# Apparently you may use different seed values at each stage
seed_value= 1
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
seed_value += 1
# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
seed_value += 1
# 3. Set `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
seed_value += 1
# 4. Set `tensorflow` pseudo-random generator at a fixed value
import tensorflow as tf
tf.set_random_seed(seed_value)
# 5. Configure a new global `tensorflow` session
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
tf.keras.backend.set_session(sess)
#rest of code...
Maybe you can try to set the number of parallelism threads to 1. I have the same problem: the loss became different to the seventh decimal place start from the second episode. It fixed when I set
tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.
Confused about random_state parameter, not sure why decision tree training needs some randomness. My thoughts
is it related to random forest?
is it related to split training testing data set? If so, why not use training testing split method directly (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
This is explained in the documentation
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
So, basically, a sub-optimal greedy algorithm is repeated a number of times using random selections of features and samples (a similar technique used in random forests). The random_state parameter allows controlling these random choices.
The interface documentation specifically states:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
So, the random algorithm will be used in any case. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you'll get the same result.
The above cited part of the documentation is misleading, the underlying problem is not greediness of the algorithm. The CART algorithm is deterministic (see e.g. here) and finds a global minimum of the weighted Gini indices.
Repeated runs of the decision tree can give different results because it is sometimes possible to split the data using different features and still achieve the same Gini index. This is described here:
https://github.com/scikit-learn/scikit-learn/issues/8443.
Setting the random state simply assures that the CART implementation works through the same randomized list of features when looking for the minimum.
Decision trees use heuristics process. Decision tree do not guarantee the same solution globally. There will be variations in the tree structure each time you build a model. Passing a specific seed to random_state ensures the same result is generated each time you build the model.
The random_state parameter present for decision trees in scikit-learn determines which feature to select for a split if (and only if) there are two splits that are equally good (i.e. two features yield the exact same improvement in the selected splitting criteria (e.g. gini)). If this is not the case, the random_state parameter has no effect.
The issue linked in teatrader's answer discusses this in more detail and as a result of that discussion the following section was added to the docs (emphasis added):
random_state int, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
To illustrate, let's consider the following example with the iris sample data set and a shallow decision tree containing just a single split:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
iris = load_iris(as_frame=True)
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The output of this code will alternate between the two following trees based on which random_state is used.
The reason for this is that splitting on either petal length <= 2.45 or petal width <= 0.8 will both perfectly separate out the setosa class from the other two classes (we can see that the leftmost setosa node contains all 50 of the setosa observations).
If we change just one observation of the data so that one of the previous two splitting criteria no longer produces a perfect separation, the random_state will have no effect and we will always end up with the same result, for example:
# Change the petal width for first observation of the "Setosa" class
# so that it overlaps with the values of the other two classes
iris['data'].loc[0, 'petal width (cm)'] = 5
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The first split will now always be petal length <= 2.45 since the split petal width <= 0.8 can only separate out 49 of the 50 setosa classes (in other words a lesser decreases in the gini score).
For a random forest (which consists of many decision trees), we would create each individual tree with a random selections of features and samples (see https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters for details), so there is a bigger role for the random_state parameter, but this is not the case when training just a single decision tree (this is true with the default parameters, but it is worth noting that some parameters could be affected by randomness if they are changed from the default value, most notably setting splitter="random").
A couple of related issues:
https://ai.stackexchange.com/questions/11576/are-decision-tree-learning-algorithms-deterministic
Is the CART algorithm used by scikit-learn deterministic?
Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.
I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn