How to interpret an h2o decision tree?

How to interpret an h2o decision tree? - python

I have graphed an h2o decision tree:
I was following a lot of posts on SO and correct me if I'm wrong, but the values at the leaves are correlations, the levels are the count of categorical values, and tree 0 means that first tree that was created.
Now my problem is that
1. I can't figure out the "greater or equal" signs and the "smaller than" signs at the categorical values. For example, if we continue after Z<10.032598, we have "greater or equal" sign on the right which implies what? Also, we have a "smaller than" sign on the left with NA which are the categorical variables but what does "smaller than" a categorical variable even means?
2. If we start at the top (c) and go right, we have the value 1, which I understand imply that c has 1 correlation. But if we go down 1 level to again Z<10.032598 , the "greater than or equal" sign on the right imply 1 correlation again. What does that mean?

If you are constructing a simple decision tree, then the values at leaf nodes are the output probability, not correlation and the levels are not count of categorical values as you can have multiple features repeating in the tree at different levels. The levels are decided by the depth you provide when training the model.
The greater than or smaller than sign shows which direction you have to go to. For example at level 1, if z>10.0325 than you go right but if it is smaller than you go left in the tree. NA basically shows that you go left if value is smaller than threshold or is null. Your model is considering categorical variables at numerical and H2O provides you the option to change that using categorical_encoding. Since the data is in numerical format, it is interpreted as numerical.
The reason there is decision 1 again is because your model is checking a different feature now to verify the results. If first level fails and model is not sure about output, it will check second level and do the same thing and will go further down the tree till it reaches a prediction.

Related

Is it reasonable to use 2 feature selection steps?

I'm building a model to identify a subset of features to classify an object belong which group. In detail, I have a dataset of 11 objects in which 5 belong to group A and 6 belong to group B, each object has been characterized with a mutation status of 19,000 genes and the values are binary, mutation or no-mutation. My aim is to identify a group of genes among those 19,000 genes so I can predict the object belongs to group A or B. For example, if the object has gene A, B, C mutation and D, E gene with no mutation, it belongs to group A, if not it belongs to group B.
Since I have a large number of features (19,000), I will need to perform feature selection. I'm thinking maybe I can remove features with low variance first as a primary step and then apply the recursive feature elimination with cross-validation to select optimal features. And also don't know yet which model I should use to do the classification, SVM or random forest.
Can you give me some advice? Thank you so much.

Obviously in a first step you can delete all features with zero variance. Also, with 11 observations against the remaining features you will not be able to "find the truth" but maybe "find some good candidates". Whether you'll want to set a lower limit of the variance above zero depends on whether you have additional information or theory. If not, why not leave feature selection in the hands of the algorithm?

Why removing outliers with Z-Score still leaves out some values as outliers?

I am new to Ml. I am removing outliers with z-scores with the code given below. The problem I am facing is that when I remove outliers, it still leaves some values as outliers. Can anyone explain why is this so? Isn't Z-score a reliable method to remove all the outliers from the data?
I am calculating the z-score second time to know if there are any data points still left.
for feature in numerical_features:
data = pd.DataFrame(housing[feature], columns=[feature])
data = data.copy()
z_scores = np.abs(stats.zscore(data[feature]))
print("Before z score on ", feature, " ====> ", data[z_scores > 3].shape)
data[z_scores > 3] = data[feature].median()
z_scores = np.abs(stats.zscore(data[feature]))
print("After z score on ", feature, " ===> ", data[z_scores > 3].shape)
housing[feature] = data[feature]
print()
Before z-score is the first time I apply z-score and tells me how many values will be impact when I replace with median.
After means, how many values are still left as outliers?
https://i.stack.imgur.com/g6kn2.png

The z-score tells you how many standard deviations away from the mean a certain point is. Using |z-score| > 3 is a very common way to identify outliers. What you are missing, is that when you remove/replace outliers, the standard deviation of your new distribution is different than it used to be, thus the z-scores of all remaining points are slightly different. In many cases, the change is negligible; however, there are some cases where the change in z-score is more pronounced.
Depending on your application, you may wish to run the z-score filter a couple times until you get a stable distribution. Also, depending on your application, you may consider dropping outlier data instead of replacing them with the median. Hopefully you know why you chose to replace and the caveats associated with that choice.

How to handle categorical independent variables in sklearn decision trees

I converted all my categorical independent variables from strings to numeric (binary 1's and 0's) using onehotencoder, but when i run a decision tree the algorithm is considering binary categorical variable as continuous.
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
how to convert this numeric continuous to numeric categorical?

how to convert this numeric continuous to numeric categorical?
If the result is the same, would you need it?
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
Maybe I am wrong, but this split makes sense for me.
Let's say we have a decision tree with a split rule that is categorical.
The division would be a binary division, meaning "0" is left and "1" is right (in this case).
Now, how can we optimize this division rule? Instead of searching if a value is "0" or "1", we can use one action to replace these two checks. "0" is left and everything else is right. Now, we can replace this same check from category to a float, <0.5 is left, else is right.
In code, it would be as simple as:
Case 1:
if value == "0":
tree.left()
elif value == "1":
tree.right()
else:
pass # if you work with binary, this will never happen, so its useless
Case 2:
if value == "0":
tree.left()
else:
tree.right()
Case 3:
if value < 0.5:
tree.left()
else:
tree.right()

There are basically 2 ways to deal with this. You can use
Integer encoding (if the categorical variable is ordinal in nature like size etc)
One-hot encoding (if the categorical variable is ordinal in nature
like gender etc)
It seems you have wrongly implemented one-hot encoding for this problem. What you are using is simple integer encoding (or binary encoding, to be more specific). Correctly implemented one-hot encoding ensures that there is no bias in the converted values and the results of performing the machine learning algorithm is not swayed away in favour of a variable just because of its sheer value. You can read more about it here.

How to force decision trees to use just integer numbers while evaluating

I'm doing a decision tree, and I would like to force the algorithm to evaluate conditions just in integer numbers. The features that I'm using, are discrete and integers, so it doesn't make sense that the tree shows "X <= 42.5" , so in this example case, I want the tree to show in the box one of the equivalents among "X < 43" or "X <= 42".
I need this to make the tree more understandable for non-technical people. It doesn't make sense to show "less than 15.5 songs", it should be less than 43 or less or equal than 42.
I tried changing the types of the columns of the source tables, and they are all int64, and the problem persists.
Code I'm using:
clf = tree.DecisionTreeClassifier(criterion='gini',
max_depth=2,
splitter='best')
clf = clf.fit(users_data, users_target)
So far I didn't find any parameters or anything similar in the documentation.
Thanks!

First off all, I would not adjust the tree rules himself, I would adjust the plot.
There is a extra tree ploting package from sklearn .
With this adjustment:
precision : int, optional (default=3)
Number of digits of precision for floating point in the values of impurity, threshold and value attributes of each node.
You can change it, for example:
tree.plot_tree(clf, precision=0)
Should give you rounded numbers.

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.