I've been looking for a way to vectorize categorical variable and then I've come across category_encoders. It supports multiple ways to categorize.
I tried TargetEncoder and BinaryEncoder but the docs doesn't explain much about the working of it?
I really appreciate if anyone could explain how target encoder and binary encoder work and how they are different from one hot encoding?
Target encoding maps the categorical variable to the mean of the target variable. As it uses the target, steps must be taken to avoid overfitting (usually done with smoothing).
Binary encoding converts each integer into binary digits with each binary digit having its one column. It is essentially a form of feature hashing.
Both help with lowering the cardinality of categorical variables which helps improve some model performance, most notably with tree-based models.
Related
i have data-set where i am having 4 anonymous variables as shown below and also target variable is kind of anonymous:
can someone please tell me how to deal with anonymous feature in machine learning ,what is best approach for feature engineering from this anonymous variables , how i can improve my predication from this feature
You should do multiple steps :
1- scale the numeric features and one-hot encode the categorical ones (you can also encode your categorical variables with their number of appearances = replace each value with it's count)
2- study correlation between your target and other variables
3- Use different plots to know your data better
4- Use variable selection methods while modeling
I know there're some people who have answered this. I'm still trying to get this straight though.
I'm still a little bit confused over the one hot encoder. I was just thinking, if we were to encode before splitting, there shouldnt be any 'information leakage' into the test set. So why do people advocate doing the encoding after? Isn't the one hot encoder just used to convert categorical variables into binary.
And if we were to encode after splitting, the results can vary quite significantly as was pointed out here : Scikit-Learn One-hot-encode before or after train/test split
I'm just wondering what is the industry norm.
Thanks
Specifically for the One-Hot-Encoder, it should not make much difference, except when there are categories that are not represented in a split.
But in that case, there is information leakage. With splitting training/test data, you are trying to simulate how well your model (and that includes all feature selection/transformation!) generalizes. If there are categories that are present in the test set but not the training set, then arguably there can surely be categories in the real world that your whole data set does not contain. In that case you are betraying yourself if you encode before splitting.
There are cases where you would want to encode before, though. If you have few data points and are sampling to get balanced splits, you might want to ensure each split gets all the categories, or something like that. In such cases it might be useful to encode before.
In general, always keep in mind that feature selection and transformation are part of your model. One-hot encoding in particular depends on the data, so that applies even more.
One hot encoding is a technique to specify the desired class of a data item. It is a replacement to integer coding where you can just put integers. A simple example would be:
Let's say, we have 3 classes: Cat, Dog, Human
In integer encoding we would give the classes as (say):
Cat - 1, Dog - 2, Human - 3
In One-hot encoding, we would do these classes as:
Cat - [1,0,0]. Dog - [0,1,0], Human - [0,0,1]
So you can get an idea, one-hot encoding works only for categorical data!
Hence, the whole dataset has to be labeled in a homogenous manner. Hence the One-hot encoding has to be performed even before the test-train split.
I come to the same conclusion as #em_bis_me. I think most of the people are just doing that because they saw it a notebook where somebody was doing that before and they are just doing a copy + paste.( Kaggle is the best community to see that, a ton of people just copy + paste work of others without stopping to consider whether it is right or wrong).
Here you can see a example from Kaggle where they are doing the encoding after split.
https://www.kaggle.com/code/prashant111/logistic-regression-classifier-tutorial/notebook
here you have the same dataset encoding before split.
https://github.com/Enrique1987/machine_learning/blob/master/1_Classification_algorithms/01_Logistic_Regresion_Australian_Weather.ipynb
Of course: Same results.
enter image description here
I need to take squences as training data and output column as label. but before I have to apply one hot encoding on the sequences,as you can see sequences varies in length Please suggest me how to apply one-hot encoding on all amino acids to have different integer values assigned
No one else can determine the best way to bin your data set. That's a decision that can only be made by someone who has a good understanding of the objective and the dataset. ϕ(x) —your feature vector— is always very specific to your data.
For example if you had DNA you might have features for whether a certain codon is present, or bins for the quantity of Adenine etc., this is highly subjective and even with a good understanding tuning is a non-trivial task.
You have to be very careful because you might create biases in your data for certain classes to be of a certain length, quantity of certain amino acids etc. that are not truly representative of what you are classifying for if you generate the feature vector incorrectly. This could lead to testing and training error rates that are deceptive and produce incorrect conclusions.
Honesty, if you are in university, I would recommend soliciting someone in a computer science department or other analog to help contribute to your project. While it might seem tempting to use the pre-baked sklearn encoding it is not a good solution for your case. It is very likely you will have outlier cases in terms of sequence length due to limited quantity of data, and attempting to turn each character into it's own feature will cause poor performance with regards to fitting.
As for actually reading your data into python, it's a csv so you could parse it by hand with an open() and a split(',') or you could use some of the popular libraries for parsing csv formats. YMMV
LightGBM has support for categorical variables. I would like to know how it encodes them. It doesn't seem to be one hot encode since the algorithm is pretty fast (I tried with data that took a lot of time to one hot encode).
https://github.com/Microsoft/LightGBM/issues/699#issue-243313657
The basic idea is sorting the histogram according to it's accumulate values (sum_gradient / sum_hessian), then find the best split on the sorted histogram, just like the numerical features.
I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.