LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions - python

I have 3 types of categorical data in my dataframe, df.
df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]
What should I do for the df['Vehicles Owned'] ? (one hot encode, labelencode or leave it as is by converting 3+ to integer. I have used integer values as they are. looking for the suggestions as there is order)
for df['Sex'] , should I labelEncode it or One hot? ( as there is no order, I have used One Hot Encoding)
df['Income'] has lots of variations. so should I convert it to bins and use One Hot Encoding explaining low,medium,high incomes?

I would recommend:
For sex, one-hot encode, which translates to using a single boolean
var for is_female or is_male; for n categories you need n-1
one-hot-encoded vars because
the nth is linearly dependent on the first n-1.
For vehicles_owned if you want to preserve order, I would re-map
your vars from [1,2,3,3+] to [1,2,3,4] and treat as an int var,
or to [1,2,3,3.5] as a float var.
For income: you should probably just leave that as a float var.
Certain models (like GBT models) will likely do some sort of binning
under the hood. If your income data happens to have an exponential
distribution, you might try loging it. But just converting it to
bins in your own feature-engineering is not what I'd recommend.
Meta-advice for all these things is set up a cross-validation scheme you're confident in, try different formulations for all your feature-engineering decisions, and then follow your cross-validated performance measure to make your ultimate decision.
Finally, between which library/function to use I prefer pandas' get_dummies because it allows you to keep column-names informative in your final feature-matrix like so: https://stackoverflow.com/a/43971156/1870832

Related

logistic-regression converting a categorical column to numeric : single vs multiple column

i want to train a logistic regression model on a dataset which has a categorical HomePlanet column contains 3 distinct values as : Earth , Europa , Mars
when i do :
pd.get_dummies(train['HomePlanet'])
it seperates all categories as columns.Then i train the model with that dataset.
I can also make numerical categories by doing
train['HomePlanet'] = train['HomePlanet'].replace({'Earth':1 , 'Europa':2 , 'Mars':3 })
is it logical if i use the second way to convert the categorical data then train the model?

The first approach is called 'One Hot Encoding' and the second is called 'Label Encoding'. Generally OHE is preferred over LE because LE can introduce the properties of similarity and ranking, when in fact these don't exist in the data.
Similarity - The idea that if categories are encoded with numbers that are closer to eachother, then they are more similar. In your example, one would expect Earth to be more similar to Europa than to Mars.
Ranking - Labels are assigned based on a specific order that is relevant to your problem, e.g size, distance, importance etc. For example in your case, you would be saying that Mars is bigger than Europa, and Europa is bigger than Earth.
I would say that in your example, one hot encoding will work better, but there are cases where label encoding makes more sense. For example to convert product reviews from "very bad, bad, neutral, good, very good" to "0,1,2,3,4" respectively. In this case, very good is the best option, so it is assigned a large number. Also very good is more similar to good than it is to very bad, therefore the number of very good (4) is closer to the number of good (3) than it is to very bad (0)

combining real and imag columns in dataframe into complex number to obtain magnitude using np.abs

I have a data frame that has complex numbers split into a real and an imaginary column. I want to add a column (2, actually, one for each channel) to the dataframe that computes the log magnitude:
` ch1_real ch1_imag ch2_real ch2_imag ch1_phase ch2_phase distance
79 0.011960 -0.003418 0.005127 -0.019530 -15.95 -75.290 0.0
78 -0.009766 -0.005371 -0.015870 0.010010 -151.20 147.800 1.0
343 0.002197 0.010990 0.003662 -0.013180 78.69 -74.480 2.0
80 -0.002686 0.010740 0.011960 0.013430 104.00 48.300 3.0
341 -0.007080 0.009033 0.016600 -0.000977 128.10 -3.366 4.0
If I try this:
df['ch1_log_mag']=20*np.log10(np.abs(complex(df.ch1_real,df.ch1_imag)))
I get error: "TypeError: cannot convert the series to <class 'float'>", because I think cmath.complex cannot work on an array.
So I then experimented using loc to pick out the first element of ch1_real, for example, to then work out how use it to accomplish what I'm trying to do, but couldn't figure out how to do it:
df.loc[0,df['ch1_real']]
This produces a KeyError.
Brute forcing it works,
df['ch1_log_mag'] = 20 * np.log10(np.sqrt(df.ch1_real**2+ df.ch1_imag**2))
but, I believe it is more legible to use np.abs to get the magnitude, plus I'm more interested in understanding how dataframes and indexing dataframes work and why what I initially attempted does not work.
btw, what is the difference between df.ch1_real and df['ch1_real'] ? When do I use one vs. the other?
Edit: more attempts at solution
I tried using apply, since my understanding is that it "applies" the function passed to it to each row (by default):
df.apply(complex(df['ch1_real'], df['ch1_imag']))
but this generates the same TypeError, since I think the issue is that complex cannot work on Series. Perhaps if I cast the series to float?
After reading this post, I tried using pd.to_numeric to convert a series to type float:
dfUnique.apply(complex(pd.to_numeric(dfUnique['ch1_real'],errors='coerce'), pd.to_numeric(dfUnique['ch1_imag'],errors='coerce')))
to no avail.

You can do simple multiplication with 1j which denotes the complex number 0+1j, see imaginary literals:
df['ch1_log_mag'] = 20 * np.log10((df.ch1_real + 1j * df.ch1_imag).abs())
complex(df.ch1_real, df.ch1_imag) doesn't work as it needs a float argument, not a whole series. df.loc[0,df['ch1_real']] is not a valid expression, as the second argument must be a string, not a series (df.loc[79,'ch1_real'] would work for accessing an element).
If you want to use apply it should be 20 * np.log10(df.apply(lambda x: complex(x.ch1_real, x.ch1_imag), 1).abs()) but as apply is just a disguised loop over the rows of the dataframe it's not recommended performancewise.
There's no difference between df.ch1_real and df['ch1_real'], it's a matter of personal preference. If your column name contains spaces or dots or the like you must use the latter form however.

How to handle categorical independent variables in sklearn decision trees

I converted all my categorical independent variables from strings to numeric (binary 1's and 0's) using onehotencoder, but when i run a decision tree the algorithm is considering binary categorical variable as continuous.
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
how to convert this numeric continuous to numeric categorical?

how to convert this numeric continuous to numeric categorical?
If the result is the same, would you need it?
for e.g. if gender is one of my independent variable, converted male to 1 and female to 0. when i use this in decision tree the node is splitting at 0.5, which makes no sense.
Maybe I am wrong, but this split makes sense for me.
Let's say we have a decision tree with a split rule that is categorical.
The division would be a binary division, meaning "0" is left and "1" is right (in this case).
Now, how can we optimize this division rule? Instead of searching if a value is "0" or "1", we can use one action to replace these two checks. "0" is left and everything else is right. Now, we can replace this same check from category to a float, <0.5 is left, else is right.
In code, it would be as simple as:
Case 1:
if value == "0":
tree.left()
elif value == "1":
tree.right()
else:
pass # if you work with binary, this will never happen, so its useless
Case 2:
if value == "0":
tree.left()
else:
tree.right()
Case 3:
if value < 0.5:
tree.left()
else:
tree.right()

There are basically 2 ways to deal with this. You can use
Integer encoding (if the categorical variable is ordinal in nature like size etc)
One-hot encoding (if the categorical variable is ordinal in nature
like gender etc)
It seems you have wrongly implemented one-hot encoding for this problem. What you are using is simple integer encoding (or binary encoding, to be more specific). Correctly implemented one-hot encoding ensures that there is no bias in the converted values and the results of performing the machine learning algorithm is not swayed away in favour of a variable just because of its sheer value. You can read more about it here.

Does name and order of Features matter for prediction algorithm

Do the names/order of the columns of my X_test dataframe have to be the same as the X_train I use for fitting?
Below is an example
I am training my model with:
model.fit(X_train,y)
where X_train=data['var1','var2']
But then during prediction, when I use:
model.predict(X_test)
X_test is defined as: X_test=data['var1','var3']
where var3 could be a completely different variable than var2.
Does predict assume that var3 is the same as var2 because it is the second column in X_test?
What if:
X_live was defined as: X_live=data['var2','var1']
Would predict know to re-order X to line them up correctly?

The names of your columns don't matter but the order does. You need to make sure that the order is consistent from your training and test data. If you pass in two columns in your training data, your model will assume that any future inputs are those features in that order.
Just a really simple thought experiment. Imagine you train a model that subtracts two numbers. The features are (n_1, n_2), and your output is going to be n_1 - n_2.
Your model doesn't process the names of your columns (since only numbers are passed in), and so it learns the relationship between the first column, the second column, and the output - namely output = col_1 - col_2.
Regardless of what you pass in, you'll get the result of the first thing you passed in minus the second thing you pass in. You can name the first thing you pass in and the second thing you pass in to whatever you want, but at the end of the day you'll still get the result of the subtraction.
To get a little more technical, what's going on inside your model is mostly a series of matrix multiplications. You pass in the input matrix, the multiplications happen, and you get what comes out. Training the model just "tunes" the values in the matrices that your inputs get multiplied by with the intention of maximizing how close the output of these multiplications is to your label. If you pass in an input matrix that isn't like the ones it was trained on, the multiplications still happen, but you'll almost certainly get a terribly wrong output. There's no intelligent feature rearranging going on underneath.

Firstly answer your question "Does predict assume that var3 is the same as var2 because it is the second column in X_test?"
No; any machine Learning model does not have any such assumption on
the data that you are passing into the fit function or the predict
function. What the model simply sees is an array of numbers, let it
be a multidimensional array of higher order. It is completely on the
user to concern about the features.
Let's take a simple classification problem, where you have 2 groups:
First one is a group of kids, with short height, and thereby lesser weight,
Second group is of mature adults, with higher age, height and weight.
Now you want to classify the below individual into any one of the classes.
Age
Height
Weight
10
120
34
Any well trained classifier can easily classify this data point to the group of kids, since the age and weight are small. The vector which the model will now consider is [ 10, 120, 34 ].
But now let us reorder the feature columns, in the following way - [ 120, 10, 34 ]. But you know that the number 120, you want to refer to the height if the individual and not age! But it is pretty sure that the model won't understand what you know or expect, and it is bound to classify the point to the group of adults.
Hope that answers both your questions.

SHA Hashing for training/validation/testing set split

Following is a small snippet from the full code
I am trying to understand the logical process of this methodology of split.
SHA1 encoding is 40 characters in hexadecimal. What kind of probability has been computed in the expression ?
What is the reason for (MAX_NUM_IMAGES_PER_CLASS + 1) ? Why add 1 ?
Does setting different values to MAX_NUM_IMAGES_PER_CLASS have an effect on the split quality ?
How good a quality of split would we get out of this ? Is this is a recommended way of splitting datasets ?
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
result[label_name] = {
'dir': dir_name,
'training': training_images,
'testing': testing_images,
'validation': validation_images,
}

This code is simply distributing file names “randomly” (but reproducibly) over a number of bins and then grouping the bins into just the three categories. The number of bits in the hash is irrelevant (so long as it’s “enough”, which is probably about 35 for this sort of work).
Reducing modulo n+1 produces a value on [0,n], and multiplying that by 100/n obviously produces a value on [0,100], which is being interpreted as a percentage. n being MAX_NUM_IMAGES_PER_CLASS is meant to control the rounding error in the interpretation to be no more than “one image”.
This strategy is reasonable, but looks a bit more sophisticated than it is (since there is still rounding going on, and the remainder introduces a bias—although with numbers this large it is utterly unobservable). You could make it simpler and more accurate by simply precalculating ranges over the whole space of 2^160 hashes for each class and just checking the hash against the two boundaries. That still notionally involves rounding, but with 160 bits it’s only that intrinsic to representing decimals like 31% in floating point.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.