Traversal method of DecisionTreeClassifier in sklearn - python

This class is used to build a decision tree. A lot of values within the trees structure including tree_.value and tree_.impurity are saved as arrays without much indication to which node each value is referring to. My deductions tell me that they are doing a preorder traversal, but I have no conclusive proof that this is how each array is being constructed. Does anyone know where to find this information?

From tree.pyx:
def class Tree:
"""Array-based representation of a binary decision tree.
The binary tree is represented as a number of parallel arrays. The i-th
element of each array holds information about the node `i`. Node 0 is the
tree's root. You can find a detailed description of all arrays in
`_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split
nodes, resp. In this case the values of nodes of the other type are
arbitrary!
So node[0] refers to the root. To add a node, it uses a splitter class on the leaves, which can either split the nodes based on the leaf that has a greater impurity improvement, or a depth-first splitting. I haven't looked into the order nodes are added to the parallel, but my guess is that they are added in the order in which they are created, which would be "similar" to pre-order transversal if the tree were ordered by impurity improvement.

Related

Satisfying Properties of nodes after rotation in AVL trees

I was learning about Self Balanced BST & AVL Trees and I am stuck at a particular case when x = z.
I have this example for better understanding:
As you may see according to the properties --> all the elements >= node x, should be on right subtree of node x, but In this case 3 will be on left subtree of node x which violates the properties of BST.
I may be wrong at something since I am learning about Data Structures using online resources on my own, It would be really helpful if you could answer this question, And correct me If I am wrong at something.
Usually binary search trees do not have duplicate elements, so this problem is avoided.

Determine the amount of splits in a decision tree of sklearn

I developed a decision tree (ensemble) in Matlab by using the "fitctree"-function (link: https://de.mathworks.com/help/stats/classificationtree-class.html).
Now I want to rebuild the same ensemble in python. Therefor I am using the sklearn library with the "DecisionTreeClassifier" (link: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).
In Matlab I defined the maximum amount of splits in each tree by setting:
'MaxNumSplits' — Maximal number of decision splits in the "fitctree"-function.
So with this the amount of branch nodes can be defined.
Now as I understand the attributes of the "DecisionTreeClassifier" object, there isn't any option like this. Am I right? All I found to control the amount of nodes in each tree is the "max_leaf_nodes" which obviously controls the number of leaf nodes.
And secondly: What does "max_depth" exactly control? If it's not "None" what does the integer "max_depth = int" stand for?
I appreciate your help and suggestions. Thank you!
As far I know there is no option to limit the total number of splits (nodes) in scikit-learn. However, you can set max_leaf_nodes to MaxNumSplits + 1 and the result should be equivalent.
Assume our tree has n_split split nodes and n_leaf leaf nodes. If we split a leaf node, we turn it into a split node and add two new leaf nodes. So n_splits and n_leafs both increase by 1. We usually start with only the root node (n_splits=0, n_leafs=1) and every splits increases both numbers. In consequence, the number of leaf nodes is always n_leafs == n_splits + 1.
As for max_depth; the depth is how many "layers" the tree has. In other words, the depth is the maximum number of nodes between the root and the furthest leaf node. The max_depth parameter restricts this depth. It prevents further splitting of a node if it is too far down the tree. (You can think of max_depth as a limiting to the number of splits before a decision is made.)

Random forest tree growing algorithm

I'm doing a Random Forest implementation (for classification), and I have some questions regarding the tree growing algorithm mentioned in literature.
When training a decision tree, there are 2 criteria to stop growing a tree:
a. Stop when there are no more features left to split a node on.
b. Stop when the node has all samples in it belonging to the same class.
Based on that,
1. Consider growing one tree in the forest. When splitting a node of the tree, I randomly select m of the M total features, and then from these m features I find that one feature with maximum information gain. After I've found this one feature, say f, should I remove this feature from the feature list, before proceeding down to the children of the node? If I don't remove this feature, then this feature might get selected again down the tree.
If I implement the algorithm without removing the feature selected at a node, then the only way to stop growing the tree is when the leaves of the tree become "pure". When I did this, I got the "maximum recursion depth" reached error in Python, because the tree couldn't reach that "pure" condition earlier.
The RF literature even those written by Breiman say that the tree should be grown to the maximum . What does this mean?
2. At a node split, after selecting the best feature to split on (by information gain), what should be the threshold on which to split? One approach is to have no threshold, create one child node for every unique value of the feature; but I've continuous-valued features too, so that means creating one child node per sample!
Q1
You shouldn't remove the features from M. Otherwise it will not be able to detect some types of relationships (ex: linear relationships)
Maybe you can stop earlier, in your condition it might go up to leaves with only 1 sample, this will have no statistical significance. So it's better to stop at say, when the number of samples at leaf is <= 3
Q2
For continuous features maybe you can bin them to groups and use them to figure out a splitting point.

Obtaining a tree decomposition from an elimination ordering and chordal graph

I need a nice tree decomposition of a graph given an elimination ordering and a chordalization of the graph.
My idea is to obtain all cliques in the graph (which I can do) and build a binary tree starting from a root and make children (i.e., cliques) depending on how many veritices the cliques have in common. I want to do this until all cliques are used and hence, I have a tree. The problem is that the cliques could have more than 2 vertices so I can not recursively run for each vertex as then, the tree might not be binary.
http://en.wikipedia.org/wiki/Tree_decomposition
http://en.wikipedia.org/wiki/Chordal_graph
I am doing an implementation in python and currently I have the chordal graph, a list of all cliques and an elimination ordering. Ideas, and/or code are more than welcome!
To construct a non-nice (in general) tree decomposition of a chordal graph: find a perfect elimination ordering, enumerate the maximal cliques (the candidates are a vertex and the neighbors that appear after it in the ordering), use each clique as a decomposition node and connect it to the next clique in the ordering that it intersects. I didn't describe that quite right; see my subsequent answer.
A nice tree decomposition is defined as follows (definition from Daniel Marx). Nice tree decompositions are rooted. Each node is of one of four types.
Leaf (no children): a set {v}
Introduce (exactly one child): a set S union {v} with child S (v not in S)
Forget (exactly one child): a set S with child S union {v} (v not in S)
Join (exactly two children): a set S with children S and S
Root the non-nice tree decomposition arbitrarily and start a recursive conversion procedure at the root. If the current node has no children, then construct the obvious chain consisting of a leaf node with introduce ancestors. Otherwise, observe that, if some vertex belongs to at least two children, then it belongs to the current node. Recursively convert the children and chain forget ancestors until their sets are subsets of the current node's. The easiest way to proceed in theory is to introduce the missing elements to each child, then join en masse. Since the running time of the next step, however, often depends exponentially on the set size, it might be wise to try some heuristics to join children before their subsets are complete.

Random Forest implementation in Python

all!
Could anybody give me an advice on Random Forest implementation in Python? Ideally I need something that outputs as much information about the classifiers as possible, especially:
which vectors from the train set are used to train each decision
trees
which features are selected at random in each node in each
tree, which samples from the training set end up in this node, which
feature(s) are selected for split and which threashold is used for
split
I have found quite some implementations, the most well known one is probably from scikit, but it is not clear how to do (1) and (2) there (see this question). Other implementations seem to have the same problems, except the one from openCV, but it is in C++ (python interface does not cover all methods for Random Forests).
Does anybody know something that satisfies (1) and (2)? Alternatively, any idea how to improve scikit implementation to get the features (1) and (2)?
Solved: checked the source code of sklearn.tree._tree.Tree. It has good comments (which fully describe the tree):
children_left : int*
children_left[i] holds the node id of the left child of node i.
For leaves, children_left[i] == TREE_LEAF. Otherwise,
children_left[i] > i. This child handles the case where
X[:, feature[i]] <= threshold[i].
children_right : int*
children_right[i] holds the node id of the right child of node i.
For leaves, children_right[i] == TREE_LEAF. Otherwise,
children_right[i] > i. This child handles the case where
X[:, feature[i]] > threshold[i].
feature : int*
feature[i] holds the feature to split on, for the internal node i.
threshold : double*
threshold[i] holds the threshold for the internal node i.
You can get nearly all the information in scikit-learn. What exactly was the problem? You can even visualize the trees using dot.
I don't think you can find out which split candidates were sampled at random, but you can find out which were selected in the end.
Edit: Look at the tree_ property of the decision tree. I agree, it is not very well documented. There really should be an example to visualize the leaf distributions etc. You can have a look at the visualization function to get an understanding of how to get to the properties.

Categories