I'm doing a decision tree, and I would like to force the algorithm to split the results into different classes after one node.
The problem is that in the trees that I get, after evaluating the condition (is X < than a certain value), I get two results of the same class (yes and yes, for example). I want to have "yes" and "no" as results for the evaluation of the node.
Here is the example of what I'm getting:
This is the code generating the tree and the plot:
clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(users_data, users_target)
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names= feature_names,
class_names= target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
I expect to find "YES" and "NO" classes after the nodes. Now, I'm getting the same classes in the lasts levels after the respective conditions.
Thanks!
As is, you model indeed does look like it doesn't offer any further discrimination between the first and the second level nodes; so, if you are certain that this is (kind of) optimal for your case, you can simply ask it to stop there using max_depth=1 instead of 2:
clf = tree.DecisionTreeClassifier(max_depth=1)
Keep in mind however that in reality this can be far from optimal; have a look at the tree for the iris dataset from the scikit-learn docs:
where you can see that, further down the tree levels, nodes with class=versicolor emerge from what look like "pure" nodes of class=virginica (and vice versa).
So, before deciding to prune the tree beforehand to max_depth=1, you might want to check if leaving it to grow further (i.e. by not specifying the max_depth argument, thus leaving it in its default value of None), might be better for your case.
Everything depends on why exactly you are doing this (i.e. your business case): if it is an exploratory one, you might very well stop with max_depth=1; if it is a predictive one, you should consider which configuration maximizes an appropriate metric (most probably here, the accuracy).
Try using criterion = "entropy". I find this solves the problem
The splits of a decision tree are somewhat speculative, and they happen as long as the chosen criterion is decreased by the split. This, as you noticed, does not guarantee a particular split to result in different classes being the majority after the split. Limiting the depth of the tree is part of the reason you see a split not "played out" until it can reach nodes of distinct classes.
Pruning the tree should help. I was able to avoid a similar problem using a suitable value of the ccp_alpha parameter of the DecisionTreeClassifier. Here are my before- and after- trees.
Related
My goal is to revise the split criterion of the regression tree algorithm so that the structure of the regression tree is independent of the output values of the learning sample.
It is obvious that the CART used by Random Forest does not satisfy, but Extra Trees may satisfy. In Pierre Geurts et al.(2006) 's opinion, in the extreme case, extremely randomized trees build totally randomized trees whose structures are independent of the output values of the learning sample. However, the score they used is dependent on the output.
In my limited experience with Extra Tree in Python and R, the score to find the best split variable is also reliable on the output. As a result, I can't use extra trees directly.
I have tried to revise the criterion in source code in R and Python, but it is difficult for me because it is written in C which I'm not familiar with.
Are there any other existing tree algorithms that satisfy independence? Thanks in advance.
Having used xgboost in Python, I wanted to plot the tree. However: Not a single tree
(like with plot_tree(clf,num_trees=1)), but the combination of all the decision trees.
For R, I found an option in kaggle:
"One way that we can examine our model is by looking at a representation of the combination of all the decision trees in our model. Since all the trees have the same depth (remember that we set that with a parameter!) we can stack them all on top of one another and pick the things that show up most often in each node."
xgb.plot.multi.trees(feature_names = names(diseaseInfo_matrix), model = model)
(https://www.kaggle.com/code/rtatman/machine-learning-with-xgboost-in-r/notebook)
However, I could not find an equivalent in Python. Does anyone know if there is one?
I am making an explainable model with the past data, and not going to use it for future prediction at all.
In the data, there are a hundred X variables, and one Y binary class and trying to explain how Xs have effects on Y binary (0 or 1).
I came up with DecisionTree classifier as it clearly shows us that how decisions are made by value criterion of each variable
Here are my questions:
Is it necessary to split X data into X_test, X_train even though I am not going to predict with this model? ( I do not want to waste data for the test since I am interpreting only)
After I split the data and train model, only a few values get feature importance values (like 3 out of 100 X variables) and rest of them go to zero. Therefore, there are only a few branches. I do not know reason why it happens.
If here is not the right place to ask such question, please let me know.
Thanks.
No it is not necessary but it is a way to check if your decision tree is overfitting and just remembering the input values and classes or actually learning the pattern behind it. I would suggest you look into cross-validation since it doesn't 'waste' any data and trains and tests on all the data. If you need me to explain this further, leave a comment.
Getting any number of important features is not an issue since it does depend very solely on your data.
Example:
Let's say I want to make a model to tell if a number will be divisible by 69 (my Y class).
I have my X variables as divisibility by 2,3,5,7,9,13,17,19 and 23.
If I train the model correctly, I will get feature importance of only 3 and 23 as very high and everything else should have very low feature importance.
Consequently, my decision tree (trees if using ensemble models like Random Forest / XGBoost) will have less number of splits.
So, having less number of important features is normal and does not cause any problems.
No, it isn't. However, I would still split train-test and measure performance separately. While an explainable model is nice, it is significantly less nicer if it's a crap model. I'd make sure it had at least a reasonable performance before considering interpretation, at which point the splitting is unnecessary.
The number of important features is data-dependent. Random forests do a good job providing this as well. In any case, fewer branches is better. You want a simpler tree, which is easier to explain.
In order to tune some machine learning's (or even pipeline's) hyperparameters, sklearn proposes the exhaustive "GridsearchCV" and the randomized "RandomizedSearchCV". The latter samples the provided distributions and test them out, to finally select the best model (and provide the result of each tentative).
But let's say I train 1'000 models using this randomized method. Later, I decide this isn't precise enough, and want to try 1'000 more models. Can I resume the training? Aka, asking to sample more, and try more models without losing current progress. Calling fit() a second time "restarts" and discards previous hyperparameters combinations.
My situation looks like the following:
pipeline_cv = RandomizedSearchCV(pipeline, distribution, n_iter=1000, n_jobs=-1)
pipeline_cv = pipeline_cv.fit(trainX, trainy)
predictions = pipeline_cv.predict(targetX)
Then, later, I'd decide that 1000 iterations are not enough to cover my distributions' space, so I would do something like
pipeline_cv = pipeline_cv.resume(trainX, trainy, n_iter=1000) # doesn't exist?
And then I'd have a model trained across 2'000 hyperparameters combinations.
Is my goal achievable?
There is a Github issue on that back from Sep 2017, but it is still open:
In practice it is useful to search over some parameter space and then continue the search over some related space. We could provide a warm_start parameter to make it easy to accumulate results for further candidates into cv_results_ (without reevaluating parameter combinations that have already tested).
And a similar question in Cross Validated is also effectively unanswered.
So, the answer would seem to be no (plus that the scikit-learn community has not felt the need to include such a possibility).
But let's stop for a moment to think if something like that would be really valuable...
RandomizedSearchCV essentially works by random sampling parameter values from a given distribution; e.g., using the example from the docs:
distributions = dict(C=uniform(loc=0, scale=4),
penalty=['l2', 'l1'])
According to the very basic principles of such random sampling and random number generation (RNG) in general, there is not any guarantee that such a randomly sampled value will not be randomly sampled more than one time, especially if the number of iterations is large. Factor in the fact that RandomizedSearchCV does not do any bookkeeping itself either, hence in principle it can happen that same parameter combinations will be tried more than once in any single run (again, provided that the number of iterations is sufficiently large).
Even in cases of continuous distributions (like the uniform one used above), where the probability of getting exact values already sampled may be very small, there is the routine case of two samples being like 0.678918 and 0.678919, which, however close, they are still different, and count as different trials.
Given the above, I cannot see how "warm starting" a RandomizedSearchCV will be of any practical use. The real value of RandomizedSearchCV lies at the possibility of sampling a usually large area of parameter values - so large that we consider useful to unleash the power of simple random sampling, which, let me repeat, does not itself "remember" past samples returned, and it may very well return samples that are (exactly or approximately) equal to what has been already returned in the past, thus rendering any "warm start" practically irrelevant.
So effectively, simply running two (or more) RandomizedSearchCV processes sequentially (and storing their results) does the job adequately, provided that we do not use the same random seed for different runs (i.e. what is effectively suggested in the Cross Validated thread mentioned above).
all!
Could anybody give me an advice on Random Forest implementation in Python? Ideally I need something that outputs as much information about the classifiers as possible, especially:
which vectors from the train set are used to train each decision
trees
which features are selected at random in each node in each
tree, which samples from the training set end up in this node, which
feature(s) are selected for split and which threashold is used for
split
I have found quite some implementations, the most well known one is probably from scikit, but it is not clear how to do (1) and (2) there (see this question). Other implementations seem to have the same problems, except the one from openCV, but it is in C++ (python interface does not cover all methods for Random Forests).
Does anybody know something that satisfies (1) and (2)? Alternatively, any idea how to improve scikit implementation to get the features (1) and (2)?
Solved: checked the source code of sklearn.tree._tree.Tree. It has good comments (which fully describe the tree):
children_left : int*
children_left[i] holds the node id of the left child of node i.
For leaves, children_left[i] == TREE_LEAF. Otherwise,
children_left[i] > i. This child handles the case where
X[:, feature[i]] <= threshold[i].
children_right : int*
children_right[i] holds the node id of the right child of node i.
For leaves, children_right[i] == TREE_LEAF. Otherwise,
children_right[i] > i. This child handles the case where
X[:, feature[i]] > threshold[i].
feature : int*
feature[i] holds the feature to split on, for the internal node i.
threshold : double*
threshold[i] holds the threshold for the internal node i.
You can get nearly all the information in scikit-learn. What exactly was the problem? You can even visualize the trees using dot.
I don't think you can find out which split candidates were sampled at random, but you can find out which were selected in the end.
Edit: Look at the tree_ property of the decision tree. I agree, it is not very well documented. There really should be an example to visualize the leaf distributions etc. You can have a look at the visualization function to get an understanding of how to get to the properties.