Platform-independent random state in scikit-learn train_test_split - python

Does setting a specific random seed (random_state) when splitting train/test datasets using scikit-learn produce the same initialization of the random number generator (i.e., produces same pseudo-random numbers) over different platforms - for instance, over different cloud computing instances?
Thanks!

As long as random_state is equal on all platforms and they are all running the same versions of numpy, you should get the exact same splits.
Since random_state is a numpy instance, I think all of scikit-learn's pseudo-random number generators are frozen because numpy froze RandomState.
You can check the documentation for random_state here, which as you can see is numpy.random.RandomState. You can check numpy's compatibility guarantee here.

Related

How to create reproducible RandomForestClassifier in Python using jobs=-1?

I've read at https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3 to create reproducible machine learning models in Python, you need to set the random seed and pin the package versions.
I would like to be able to save models after training, that is, e.g. using pickle.dump(), load them up again and then get the same results.
At https://docs.python.org/3/library/random.html#notes-on-reproducibility it says:
"Sometimes it is useful to be able to reproduce the sequences given by a pseudo-random number generator. By re-using a seed value, the same sequence should be reproducible from run to run as long as multiple threads are not running."
I'm using a RandomForestClassifier with jobs=-1 so I'm wondering whether I need to do more or whether this is handled internally already.
For the random seed now I have:
os.environ['PYTHONHASHSEED'] = str(42)
random.seed(42)
np.random.seed(42)
And for the classifier I'm setting the random state:
rf = RandomForestClassifier(random_state=42)
According to the documentation, you must also set the random_state parameter in the RandomForestClassifier:
random_state: int, RandomState instance or None, default=None
Controls both the randomness of the bootstrapping of the samples used when
building trees (if bootstrap=True) and the sampling of the features to
consider when looking for the best split at each node (if max_features
< n_features). See Glossary for details.
For example:
from sklearn.ensemble import RandomForestClassifier
SEED = 42
clf = RandomForestClassifier(random_state = SEED)
CLARIFICATIONS:
In order for the experiment to be fully reproducible, all steps in the preparation of the dataset must be checked (e.g. train and test splits) even with fixed seed. np.random.seed does not guarantee a fixed random state for sklearn. We need to set random_state parameter corresponding to each sklearn function to ensure repeatability.
It is also sufficient to set the random_state in multithreading. Make sure you use the latest version of sklearn if possible to avoid possible bugs on earlier versions.

IterativeImputer - sample_posterior

Sklearn implements an imputer called the IterativeImputer. I believe that it works by predicting the values for missing features values in a round robin fashion, using an estimator.
It has an argument called sample_posterior but I can't seem to figure out when I should use it.
sample_posterior boolean, default=False
Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation. Estimator must support
return_std in its predict method if set to True. Set to True if using
IterativeImputer for multiple imputations.
I looked at the source code but it still wasn't clear to me. Should I use this if I have multiple features that I am going to fill using the iterative imputer or should I use this if I plan to use the imputer multiple times like for a training and then validation set?
Even with multiple features, and a training and validation/test set, you don't need sample_posterior. The "multiple imputations" part of the docstring means generating more than one missings-replaced dataset; see e.g. wikipedia.
Normally, IterativeImputer imputes the missing values of a feature using the predictions of a model built on the other features (iteratively, round robin, etc.). If you use a model that produces not just a single prediction but an output distribution (the posterior), then you can sample from that distribution randomly, hence sample_posterior. By running it multiple times, with different random seeds, these random choices are different, and you get multiple imputed datasets.
The documentation on that isn't great, but there's a (somewhat aged) PR for an extended example, and a toy example on SO.

Classification results depend on random_state?

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.

confused about random_state in decision tree of scikit learn

Confused about random_state parameter, not sure why decision tree training needs some randomness. My thoughts
is it related to random forest?
is it related to split training testing data set? If so, why not use training testing split method directly (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.datasets import load_iris
from sklearn.cross_validation import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
This is explained in the documentation
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
So, basically, a sub-optimal greedy algorithm is repeated a number of times using random selections of features and samples (a similar technique used in random forests). The random_state parameter allows controlling these random choices.
The interface documentation specifically states:
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
So, the random algorithm will be used in any case. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you'll get the same result.
The above cited part of the documentation is misleading, the underlying problem is not greediness of the algorithm. The CART algorithm is deterministic (see e.g. here) and finds a global minimum of the weighted Gini indices.
Repeated runs of the decision tree can give different results because it is sometimes possible to split the data using different features and still achieve the same Gini index. This is described here:
https://github.com/scikit-learn/scikit-learn/issues/8443.
Setting the random state simply assures that the CART implementation works through the same randomized list of features when looking for the minimum.
Decision trees use heuristics process. Decision tree do not guarantee the same solution globally. There will be variations in the tree structure each time you build a model. Passing a specific seed to random_state ensures the same result is generated each time you build the model.
The random_state parameter present for decision trees in scikit-learn determines which feature to select for a split if (and only if) there are two splits that are equally good (i.e. two features yield the exact same improvement in the selected splitting criteria (e.g. gini)). If this is not the case, the random_state parameter has no effect.
The issue linked in teatrader's answer discusses this in more detail and as a result of that discussion the following section was added to the docs (emphasis added):
random_state int, RandomState instance or None, default=None
Controls the randomness of the estimator. The features are always randomly permuted at each split, even if splitter is set to "best". When max_features < n_features, the algorithm will select max_features at random at each split before finding the best split among them. But the best found split may vary across different runs, even if max_features=n_features. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer. See Glossary for details.
To illustrate, let's consider the following example with the iris sample data set and a shallow decision tree containing just a single split:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
iris = load_iris(as_frame=True)
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The output of this code will alternate between the two following trees based on which random_state is used.
The reason for this is that splitting on either petal length <= 2.45 or petal width <= 0.8 will both perfectly separate out the setosa class from the other two classes (we can see that the leftmost setosa node contains all 50 of the setosa observations).
If we change just one observation of the data so that one of the previous two splitting criteria no longer produces a perfect separation, the random_state will have no effect and we will always end up with the same result, for example:
# Change the petal width for first observation of the "Setosa" class
# so that it overlaps with the values of the other two classes
iris['data'].loc[0, 'petal width (cm)'] = 5
clf = DecisionTreeClassifier(max_depth=1)
clf = clf.fit(iris.data, iris.target)
plot_tree(clf, feature_names=iris['feature_names'], class_names=iris['target_names']);
The first split will now always be petal length <= 2.45 since the split petal width <= 0.8 can only separate out 49 of the 50 setosa classes (in other words a lesser decreases in the gini score).
For a random forest (which consists of many decision trees), we would create each individual tree with a random selections of features and samples (see https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters for details), so there is a bigger role for the random_state parameter, but this is not the case when training just a single decision tree (this is true with the default parameters, but it is worth noting that some parameters could be affected by randomness if they are changed from the default value, most notably setting splitter="random").
A couple of related issues:
https://ai.stackexchange.com/questions/11576/are-decision-tree-learning-algorithms-deterministic
Is the CART algorithm used by scikit-learn deterministic?
Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

Setting number of Gibbs steps for BernoulliRBM

I want to use BernoulliRBM implementation in scikit-learn for Restricted Boltzmann Machines, but I can’t find anywhere a way or a parameter to set the number of Gibbs steps k for the PCD sampling. Should I assume that k=1 and can't be modified?
Yes, the training algorithm uses a hardwired "k". It can be seen in the _fit method, which samples once, then updates the parameters.

Categories