How to handle categorical independent variables in sklearn decision trees

How to handle categorical independent variables in sklearn decision trees - python

I converted all my categorical independent variables from strings to numeric (binary 1's and 0's) using onehotencoder, but when i run a decision tree the algorithm is considering binary categorical variable as continuous.
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
how to convert this numeric continuous to numeric categorical?

how to convert this numeric continuous to numeric categorical?
If the result is the same, would you need it?
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
Maybe I am wrong, but this split makes sense for me.
Let's say we have a decision tree with a split rule that is categorical.
The division would be a binary division, meaning "0" is left and "1" is right (in this case).
Now, how can we optimize this division rule? Instead of searching if a value is "0" or "1", we can use one action to replace these two checks. "0" is left and everything else is right. Now, we can replace this same check from category to a float, <0.5 is left, else is right.
In code, it would be as simple as:
Case 1:
if value == "0":
tree.left()
elif value == "1":
tree.right()
else:
pass # if you work with binary, this will never happen, so its useless
Case 2:
if value == "0":
tree.left()
else:
tree.right()
Case 3:
if value < 0.5:
tree.left()
else:
tree.right()

There are basically 2 ways to deal with this. You can use
Integer encoding (if the categorical variable is ordinal in nature like size etc)
One-hot encoding (if the categorical variable is ordinal in nature
like gender etc)
It seems you have wrongly implemented one-hot encoding for this problem. What you are using is simple integer encoding (or binary encoding, to be more specific). Correctly implemented one-hot encoding ensures that there is no bias in the converted values and the results of performing the machine learning algorithm is not swayed away in favour of a variable just because of its sheer value. You can read more about it here.

Related

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions

I have 3 types of categorical data in my dataframe, df.
df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]
What should I do for the df['Vehicles Owned'] ? (one hot encode, labelencode or leave it as is by converting 3+ to integer. I have used integer values as they are. looking for the suggestions as there is order)
for df['Sex'] , should I labelEncode it or One hot? ( as there is no order, I have used One Hot Encoding)
df['Income'] has lots of variations. so should I convert it to bins and use One Hot Encoding explaining low,medium,high incomes?

I would recommend:
For sex, one-hot encode, which translates to using a single boolean
var for is_female or is_male; for n categories you need n-1
one-hot-encoded vars because
the nth is linearly dependent on the first n-1.
For vehicles_owned if you want to preserve order, I would re-map
your vars from [1,2,3,3+] to [1,2,3,4] and treat as an int var,
or to [1,2,3,3.5] as a float var.
For income: you should probably just leave that as a float var.
Certain models (like GBT models) will likely do some sort of binning
under the hood. If your income data happens to have an exponential
distribution, you might try loging it. But just converting it to
bins in your own feature-engineering is not what I'd recommend.
Meta-advice for all these things is set up a cross-validation scheme you're confident in, try different formulations for all your feature-engineering decisions, and then follow your cross-validated performance measure to make your ultimate decision.
Finally, between which library/function to use I prefer pandas' get_dummies because it allows you to keep column-names informative in your final feature-matrix like so: https://stackoverflow.com/a/43971156/1870832

How to force decision trees to use just integer numbers while evaluating

I'm doing a decision tree, and I would like to force the algorithm to evaluate conditions just in integer numbers. The features that I'm using, are discrete and integers, so it doesn't make sense that the tree shows "X <= 42.5" , so in this example case, I want the tree to show in the box one of the equivalents among "X < 43" or "X <= 42".
I need this to make the tree more understandable for non-technical people. It doesn't make sense to show "less than 15.5 songs", it should be less than 43 or less or equal than 42.
I tried changing the types of the columns of the source tables, and they are all int64, and the problem persists.
Code I'm using:
clf = tree.DecisionTreeClassifier(criterion='gini',
max_depth=2,
splitter='best')
clf = clf.fit(users_data, users_target)
So far I didn't find any parameters or anything similar in the documentation.
Thanks!

First off all, I would not adjust the tree rules himself, I would adjust the plot.
There is a extra tree ploting package from sklearn .
With this adjustment:
precision : int, optional (default=3)
Number of digits of precision for floating point in the values of impurity, threshold and value attributes of each node.
You can change it, for example:
tree.plot_tree(clf, precision=0)
Should give you rounded numbers.

How to interpret an h2o decision tree?

I have graphed an h2o decision tree:
I was following a lot of posts on SO and correct me if I'm wrong, but the values at the leaves are correlations, the levels are the count of categorical values, and tree 0 means that first tree that was created.
Now my problem is that
1. I can't figure out the "greater or equal" signs and the "smaller than" signs at the categorical values. For example, if we continue after Z<10.032598, we have "greater or equal" sign on the right which implies what? Also, we have a "smaller than" sign on the left with NA which are the categorical variables but what does "smaller than" a categorical variable even means?
2. If we start at the top (c) and go right, we have the value 1, which I understand imply that c has 1 correlation. But if we go down 1 level to again Z<10.032598 , the "greater than or equal" sign on the right imply 1 correlation again. What does that mean?

If you are constructing a simple decision tree, then the values at leaf nodes are the output probability, not correlation and the levels are not count of categorical values as you can have multiple features repeating in the tree at different levels. The levels are decided by the depth you provide when training the model.
The greater than or smaller than sign shows which direction you have to go to. For example at level 1, if z>10.0325 than you go right but if it is smaller than you go left in the tree. NA basically shows that you go left if value is smaller than threshold or is null. Your model is considering categorical variables at numerical and H2O provides you the option to change that using categorical_encoding. Since the data is in numerical format, it is interpreted as numerical.
The reason there is decision 1 again is because your model is checking a different feature now to verify the results. If first level fails and model is not sure about output, it will check second level and do the same thing and will go further down the tree till it reaches a prediction.

SHA Hashing for training/validation/testing set split

Following is a small snippet from the full code
I am trying to understand the logical process of this methodology of split.
SHA1 encoding is 40 characters in hexadecimal. What kind of probability has been computed in the expression ?
What is the reason for (MAX_NUM_IMAGES_PER_CLASS + 1) ? Why add 1 ?
Does setting different values to MAX_NUM_IMAGES_PER_CLASS have an effect on the split quality ?
How good a quality of split would we get out of this ? Is this is a recommended way of splitting datasets ?
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
result[label_name] = {
'dir': dir_name,
'training': training_images,
'testing': testing_images,
'validation': validation_images,
}

This code is simply distributing file names “randomly” (but reproducibly) over a number of bins and then grouping the bins into just the three categories. The number of bits in the hash is irrelevant (so long as it’s “enough”, which is probably about 35 for this sort of work).
Reducing modulo n+1 produces a value on [0,n], and multiplying that by 100/n obviously produces a value on [0,100], which is being interpreted as a percentage. n being MAX_NUM_IMAGES_PER_CLASS is meant to control the rounding error in the interpretation to be no more than “one image”.
This strategy is reasonable, but looks a bit more sophisticated than it is (since there is still rounding going on, and the remainder introduces a bias—although with numbers this large it is utterly unobservable). You could make it simpler and more accurate by simply precalculating ranges over the whole space of 2^160 hashes for each class and just checking the hash against the two boundaries. That still notionally involves rounding, but with 160 bits it’s only that intrinsic to representing decimals like 31% in floating point.

In what contexts do programming languages make real use of an Infinity value?

So in Ruby there is a trick to specify infinity:
1.0/0
=> Infinity
I believe in Python you can do something like this
float('inf')
These are just examples though, I'm sure most languages have infinity in some capacity. When would you actually use this construct in the real world? Why would using it in a range be better than just using a boolean expression? For instance
(0..1.0/0).include?(number) == (number >= 0) # True for all values of number
=> true
To summarize, what I'm looking for is a real world reason to use Infinity.
EDIT: I'm looking for real world code. It's all well and good to say this is when you "could" use it, when have people actually used it.

Dijkstra's Algorithm typically assigns infinity as the initial edge weights in a graph. This doesn't have to be "infinity", just some arbitrarily constant but in java I typically use Double.Infinity. I assume ruby could be used similarly.

Off the top of the head, it can be useful as an initial value when searching for a minimum value.
For example:
min = float('inf')
for x in somelist:
if x<min:
min=x
Which I prefer to setting min initially to the first value of somelist
Of course, in Python, you should just use the min() built-in function in most cases.

There seems to be an implied "Why does this functionality even exist?" in your question. And the reason is that Ruby and Python are just giving access to the full range of values that one can specify in floating point form as specified by IEEE.
This page seems to describe it well:
http://steve.hollasch.net/cgindex/coding/ieeefloat.html
As a result, you can also have NaN (Not-a-number) values and -0.0, while you may not immediately have real-world uses for those either.

In some physics calculations you can normalize irregularities (ie, infinite numbers) of the same order with each other, canceling them both and allowing a approximate result to come through.
When you deal with limits, calculations like (infinity / infinity) -> approaching a finite a number could be achieved. It's useful for the language to have the ability to overwrite the regular divide-by-zero error.

Use Infinity and -Infinity when implementing a mathematical algorithm calls for it.
In Ruby, Infinity and -Infinity have nice comparative properties so that -Infinity < x < Infinity for any real number x. For example, Math.log(0) returns -Infinity, extending to 0 the property that x > y implies that Math.log(x) > Math.log(y). Also, Infinity * x is Infinity if x > 0, -Infinity if x < 0, and 'NaN' (not a number; that is, undefined) if x is 0.
For example, I use the following bit of code in part of the calculation of some log likelihood ratios. I explicitly reference -Infinity to define a value even if k is 0 or n AND x is 0 or 1.
Infinity = 1.0/0.0
def Similarity.log_l(k, n, x)
unless x == 0 or x == 1
k * Math.log(x.to_f) + (n-k) * Math.log(1.0-x)
end
-Infinity
end
end

Alpha-beta pruning

I use it to specify the mass and inertia of a static object in physics simulations. Static objects are essentially unaffected by gravity and other simulation forces.

In Ruby infinity can be used to implement lazy lists. Say i want N numbers starting at 200 which get successively larger by 100 units each time:
Inf = 1.0 / 0.0
(200..Inf).step(100).take(N)
More info here: http://banisterfiend.wordpress.com/2009/10/02/wtf-infinite-ranges-in-ruby/

I've used it for cases where you want to define ranges of preferences / allowed.
For example in 37signals apps you have like a limit to project number
Infinity = 1 / 0.0
FREE = 0..1
BASIC = 0..5
PREMIUM = 0..Infinity
then you can do checks like
if PREMIUM.include? current_user.projects.count
# do something
end

I used it for representing camera focus distance and to my surprise in Python:
>>> float("inf") is float("inf")
False
>>> float("inf") == float("inf")
True
I wonder why is that.

I've used it in the minimax algorithm. When I'm generating new moves, if the min player wins on that node then the value of the node is -∞. Conversely, if the max player wins then the value of that node is +∞.
Also, if you're generating nodes/game states and then trying out several heuristics you can set all the node values to -∞/+∞ which ever makes sense and then when you're running a heuristic its easy to set the node value:
node_val = -∞
node_val = max(heuristic1(node), node_val)
node_val = max(heuristic2(node), node_val)
node_val = max(heuristic2(node), node_val)

I've used it in a DSL similar to Rails' has_one and has_many:
has 0..1 :author
has 0..INFINITY :tags
This makes it easy to express concepts like Kleene star and plus in your DSL.

I use it when I have a Range object where one or both ends need to be open

I've used symbolic values for positive and negative infinity in dealing with range comparisons to eliminate corner cases that would otherwise require special handling:
Given two ranges A=[a,b) and C=[c,d) do they intersect, is one greater than the other, or does one contain the other?
A > C iff a >= d
A < C iff b <= c
etc...
If you have values for positive and negative infinity that respectively compare greater than and less than all other values, you don't need to do any special handling for open-ended ranges. Since floats and doubles already implement these values, you might as well use them instead of trying to find the largest/smallest values on your platform. With integers, it's more difficult to use "infinity" since it's not supported by hardware.

I ran across this because I'm looking for an "infinite" value to set for a maximum, if a given value doesn't exist, in an attempt to create a binary tree. (Because I'm selecting based on a range of values, and not just a single value, I quickly realized that even a hash won't work in my situation.)
Since I expect all numbers involved to be positive, the minimum is easy: 0. Since I don't know what to expect for a maximum, though, I would like the upper bound to be Infinity of some sort. This way, I won't have to figure out what "maximum" I should compare things to.
Since this is a project I'm working on at work, it's technically a "Real world problem". It may be kindof rare, but like a lot of abstractions, it's convenient when you need it!
Also, to those who say that this (and other examples) are contrived, I would point out that all abstractions are somewhat contrived; that doesn't mean they are useful when you contrive them.

When working in a problem domain where trig is used (especially tangent) infinity is an answer that can come up. Trig ends up being used heavily in graphics applications, games, and geospatial applications, plus the obvious math applications.

I'm sure there are other ways to do this, but you could use Infinity to check for reasonable inputs in a String-to-Float conversion. In Java, at least, the Float.isNaN() static method will return false for numbers with infinite magnitude, indicating they are valid numbers, even though your program might want to classify them as invalid. Checking against the Float.POSITIVE_INFINITY and Float.NEGATIVE_INFINITY constants solves that problem. For example:
// Some sample values to test our code with
String stringValues[] = {
"-999999999999999999999999999999999999999999999",
"12345",
"999999999999999999999999999999999999999999999"
};
// Loop through each string representation
for (String stringValue : stringValues) {
// Convert the string representation to a Float representation
Float floatValue = Float.parseFloat(stringValue);
System.out.println("String representation: " + stringValue);
System.out.println("Result of isNaN: " + floatValue.isNaN());
// Check the result for positive infinity, negative infinity, and
// "normal" float numbers (within the defined range for Float values).
if (floatValue == Float.POSITIVE_INFINITY) {
System.out.println("That number is too big.");
} else if (floatValue == Float.NEGATIVE_INFINITY) {
System.out.println("That number is too small.");
} else {
System.out.println("That number is jussssst right.");
}
}
Sample Output:
String representation: -999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too small.
String representation: 12345
Result of isNaN: false
That number is jussssst right.
String representation: 999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too big.

It is used quite extensively in graphics. For example, any pixel in a 3D image that is not part of an actual object is marked as infinitely far away. So that it can later be replaced with a background image.

I'm using a network library where you can specify the maximum number of reconnection attempts. Since I want mine to reconnect forever:
my_connection = ConnectionLibrary(max_connection_attempts = float('inf'))
In my opinion, it's more clear than the typical "set to -1 to retry forever" style, since it's literally saying "retry until the number of connection attempts is greater than infinity".

Some programmers use Infinity or NaNs to show a variable has never been initialized or assigned in the program.

If you want the largest number from an input but they might use very large negatives. If I enter -13543124321.431 it still works out as the largest number since it's bigger than -inf.
enter code here
initial_value = float('-inf')
while True:
try:
x = input('gimmee a number or type the word, stop ')
except KeyboardInterrupt:
print("we done - by yo command")
break
if x == "stop":
print("we done")
break
try:
x = float(x)
except ValueError:
print('not a number')
continue
if x > initial_value: initial_value = x
print("The largest number is: " + str(initial_value))

You can to use:
import decimal
decimal.Decimal("Infinity")
or:
from decimal import *
Decimal("Infinity")

For sorting
I've seen it used as a sort value, to say "always sort these items to the bottom".

To specify a non-existent maximum
If you're dealing with numbers, nil represents an unknown quantity, and should be preferred to 0 for that case. Similarly, Infinity represents an unbounded quantity, and should be preferred to (arbitrarily_large_number) in that case.
I think it can make the code cleaner. For example, I'm using Float::INFINITY in a Ruby gem for exactly that: the user can specify a maximum string length for a message, or they can specify :all. In that case, I represent the maximum length as Float::INFINITY, so that later when I check "is this message longer than the maximum length?" the answer will always be false, without needing a special case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle categorical independent variables in sklearn decision trees - python

Related

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions

How to force decision trees to use just integer numbers while evaluating

How to interpret an h2o decision tree?

SHA Hashing for training/validation/testing set split

In what contexts do programming languages make real use of an Infinity value?

Categories

Resources