I was going through sklearn class DecisionTreeClassifier.
Looking at parameters for the class, we have two parameters min_samples_split and min_samples_leaf. Basic idea behind them looks similar, you specify a minimum number of samples required to decide a node to be leaf or split further.
Why do we need two parameters when one implies the other?. Is there any reason or scenario which distinguish them?.
From the documentation:
The main difference between the two is that min_samples_leaf guarantees a minimum number of samples in a leaf, while min_samples_split can create arbitrary small leaves, though min_samples_split is more common in the literature.
To get a grasp of this piece of documentation I think you should make the distinction between a leaf (also called external node) and an internal node. An internal node will have further splits (also called children), while a leaf is by definition a node without any children (without any further splits).
min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node.
For instance, if min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. If min_samples_leaf = 2, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.
As the documentation referenced above mentions, min_samples_leaf guarantees a minimum number of samples in every leaf, no matter the value of min_samples_split.
Both parameters will produce similar results, the difference is the point of view.
The min_samples_split parameter will evaluate the number of samples in the node, and if the number is less than the minimum the split will be avoided and the node will be a leaf.
The min_samples_leaf parameter checks before the node is generated, that is, if the possible split results in a child with fewer samples, the split will be avoided (since the minimum number of samples for the child to be a leaf has not been reached) and the node will be replaced by a leaf.
In all cases, when we have samples with more than one Class in a leaf, the Final Class will be the most likely to happen, according to the samples that reached it in training.
In decision trees, there are many rules one can set up to configure how the tree should end up. Roughly, there are more 'design' oriented rules like max_depth. Max_depth is more like when you build a house, the architect asks you how many floors you want on the house.
Some other rules are 'defensive' rules. We often call them stopping rules. min_samples_leaf and min_samples_split belong to this type. All explanations already provided are very well said. My cent: rules interact when the tree is being built. For example, min_samples_leaf=100, you may very well end up with tree where all the terminal nodes are way larger than 100 because others rule kick in to have stopped the tree from expanding.
lets say that min_samples_split = 9 and min_samples_leaf =3 .
in the internal node,right split not allowed (3<9) and left split is allowed (10>9) .
but because min_samples_leaf =3 and one leaf is 2 (the right one) so 10 will not split to 2 and 8.
Look at the leaf with the number 3 (from the first splitting).
If we decide that mim_samples_leaf =4 and not 3 so even the first splitting would not be happen (13 to 10 and 3) .
min_sample_split tells above the minimum no. of samples reqd. to split an internal node. If an integer value is taken then consider min_samples_split as the minimum no. If float, then it shows the percentage. By default, it takes “2” value.
min_sample_leaf is the minimum number of samples required to be at a leaf node. If an integer value is taken then consider - -min_samples_leaf as the minimum no. If float, then it shows the percentage. By default, it takes “1” value.
Related
I'm trying to use nodevector's Node2Vec class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:
import networkx as nx
import pandas as pd
import nodevectors
n2v = nodevectors.Node2Vec(n_components=128,
walklen=80,
epochs=3,
return_weight=1,
neighbor_weight=1,
threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape
I know G has all the nodes from my scope. Then, I fit the model, but model.ww.vectors has a number of rows smaller than my number of nodes.
I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors is lower than my actual number of nodes in G.
Does anyone know why it happens?
TL;DR: Your non-default epochs=3 can result in some nodes appearing only 3 times – but the inner Word2Vec model by default ignores tokens appearing fewer than 5 times. Upping to epochs=5 may be a quick fix - but read on for the reasons & tradeoffs with various defaults.
--
If you're using the nodevectors package described here, it seems to be built on Gensim's Word2Vec – which uses a default min_count=5.
That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where Word2Vec was pioneered, discarding such rare words entirely usually has multiple benefits:
from only a few idiosyncratic examples, such rare words themselves get peculiar vectors less-likely to generalize to downstream uses (other texts)
compared to other frequent words, each gets very little training effort overall, & thus provides only a little pushback on shared model weights (based on their peculiar examples) - so the vectors are weaker & retain more arbitrary influence from random-initialization & relative positioning in the corpus. (More-frequent words provide more varied, numerous examples to extract their unique meaning.)
because of the Zipfian distribution of word-frequencies in natural language, there are a lot of such low-frequency words – often even typos – and altogether they take up a lot of the model's memory & training-time. But they don't individually get very good vectors, or have generalizable beneficial influences on the shared model. So they wind up serving a lot like noise that weakens other vectors for more-frequent words, as well.
So typically in Word2Vec, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.
Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner Word2Vec window of analysis).
You may be able to confirm if the default min_count is your issue by using the Node2Vec keep_walks parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer than min_count times in the walks?
If so, a few options may be:
override min_count using the Node2Vec w2vparams option to something like min_count=1. As noted above, this is always a bad idea in traditional natural-language Word2Vec - but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.
try to influence the walks to ensure all nodes appear enough times. I suppose some values of the Node2Vec walklen, return_weight, & neighbor_weight could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the default min_count) different walks. But it looks like the Node2Vec epochs parameter controls how many time every node is used as a starting point – so epochs=5 would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: the Node2Vec default is epochs=20 - which would never trigger a bad interaction with the internal Word2Vec min_count=5. But setting your non-default epochs=3 risks leaving some nodes with only 3 appearances.)
This class is used to build a decision tree. A lot of values within the trees structure including tree_.value and tree_.impurity are saved as arrays without much indication to which node each value is referring to. My deductions tell me that they are doing a preorder traversal, but I have no conclusive proof that this is how each array is being constructed. Does anyone know where to find this information?
From tree.pyx:
def class Tree:
"""Array-based representation of a binary decision tree.
The binary tree is represented as a number of parallel arrays. The i-th
element of each array holds information about the node `i`. Node 0 is the
tree's root. You can find a detailed description of all arrays in
`_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split
nodes, resp. In this case the values of nodes of the other type are
arbitrary!
So node[0] refers to the root. To add a node, it uses a splitter class on the leaves, which can either split the nodes based on the leaf that has a greater impurity improvement, or a depth-first splitting. I haven't looked into the order nodes are added to the parallel, but my guess is that they are added in the order in which they are created, which would be "similar" to pre-order transversal if the tree were ordered by impurity improvement.
I developed a decision tree (ensemble) in Matlab by using the "fitctree"-function (link: https://de.mathworks.com/help/stats/classificationtree-class.html).
Now I want to rebuild the same ensemble in python. Therefor I am using the sklearn library with the "DecisionTreeClassifier" (link: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
In Matlab I defined the maximum amount of splits in each tree by setting:
'MaxNumSplits' — Maximal number of decision splits in the "fitctree"-function.
So with this the amount of branch nodes can be defined.
Now as I understand the attributes of the "DecisionTreeClassifier" object, there isn't any option like this. Am I right? All I found to control the amount of nodes in each tree is the "max_leaf_nodes" which obviously controls the number of leaf nodes.
And secondly: What does "max_depth" exactly control? If it's not "None" what does the integer "max_depth = int" stand for?
I appreciate your help and suggestions. Thank you!
As far I know there is no option to limit the total number of splits (nodes) in scikit-learn. However, you can set max_leaf_nodes to MaxNumSplits + 1 and the result should be equivalent.
Assume our tree has n_split split nodes and n_leaf leaf nodes. If we split a leaf node, we turn it into a split node and add two new leaf nodes. So n_splits and n_leafs both increase by 1. We usually start with only the root node (n_splits=0, n_leafs=1) and every splits increases both numbers. In consequence, the number of leaf nodes is always n_leafs == n_splits + 1.
As for max_depth; the depth is how many "layers" the tree has. In other words, the depth is the maximum number of nodes between the root and the furthest leaf node. The max_depth parameter restricts this depth. It prevents further splitting of a node if it is too far down the tree. (You can think of max_depth as a limiting to the number of splits before a decision is made.)
I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.
When training a decision tree, there are 2 criteria to stop growing a tree:
a. Stop when there are no more features left to split a node on.
b. Stop when the node has all samples in it belonging to the same class.
Based on that,
1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total features, and then from these m features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree.
If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier.
The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?
2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!
Q1
You shouldn't remove the features from M. Otherwise it will not be able to detect some types of relationships (ex: linear relationships)
Maybe you can stop earlier, in your condition it might go up to leaves with only 1 sample, this will have no statistical significance. So it's better to stop at say, when the number of samples at leaf is <= 3
Q2
For continuous features maybe you can bin them to groups and use them to figure out a splitting point.
all!
Could anybody give me an advice on Random Forest implementation in Python? Ideally I need something that outputs as much information about the classifiers as possible, especially:
which vectors from the train set are used to train each decision
trees
which features are selected at random in each node in each
tree, which samples from the training set end up in this node, which
feature(s) are selected for split and which threashold is used for
split
I have found quite some implementations, the most well known one is probably from scikit, but it is not clear how to do (1) and (2) there (see this question). Other implementations seem to have the same problems, except the one from openCV, but it is in C++ (python interface does not cover all methods for Random Forests).
Does anybody know something that satisfies (1) and (2)? Alternatively, any idea how to improve scikit implementation to get the features (1) and (2)?
Solved: checked the source code of sklearn.tree._tree.Tree. It has good comments (which fully describe the tree):
children_left : int*
children_left[i] holds the node id of the left child of node i.
For leaves, children_left[i] == TREE_LEAF. Otherwise,
children_left[i] > i. This child handles the case where
X[:, feature[i]] <= threshold[i].
children_right : int*
children_right[i] holds the node id of the right child of node i.
For leaves, children_right[i] == TREE_LEAF. Otherwise,
children_right[i] > i. This child handles the case where
X[:, feature[i]] > threshold[i].
feature : int*
feature[i] holds the feature to split on, for the internal node i.
threshold : double*
threshold[i] holds the threshold for the internal node i.
You can get nearly all the information in scikit-learn. What exactly was the problem? You can even visualize the trees using dot.
I don't think you can find out which split candidates were sampled at random, but you can find out which were selected in the end.
Edit: Look at the tree_ property of the decision tree. I agree, it is not very well documented. There really should be an example to visualize the leaf distributions etc. You can have a look at the visualization function to get an understanding of how to get to the properties.