having OneHotEncoder to manage unseen values at transform step - python

I am using sklearn.preprocessing.OneHotEncoder to encode categorical data of the form
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
Suppose I use A at the .fit(A) step and B at some point as new data to .transform(B). If B contains unseen values in respect to A, doing so produces a feature out of bounds error. Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
ValueError: Feature out of bounds. Try setting n_values.
I understand I can change the feature bounds at .fit time. But if I am using A as training data, each time I got a new set B to predict, I would have to mess with my initial encoding.
Thanks.

Is it possible to have B containing new unseen values such that the transform step sets all binaries to zero for the concerned value?
No, but it would be nice if OneHotEncoder did that, so I've opened an issue for this. For now, you'll just have to set n_values a bit higher.

This feature is added to OneHotEncoder now. You can do this by setting the parameter handle_unknown='ignore'.
For example:
from sklearn.preprocessing import OneHotEncoder
A=array([[1,4,1],[0,3,2]])
B=array([[1,4,7],[0,3,2]])
onehot = OneHotEncoder(handle_unknown='ignore')
A = onehot.fit_transform(A)
B = onehot.transform(B)

Related

Can we standardize a numerical column which actually is categorical?

I have the House Prices - Advanced Regression Techniques Data set. I need to do Lasso and Ridge Regularization on it. I saved the train data in the variable named house. Typed the following code:
house.info()
Got this output: enter image description here
There are columns in this data set which are numerical(int64 and float 64) but they actually are categorical(both ordinal and nominal).
I wanted to ask whether I can standardize these categorical variables or should I first convert all these variables into type "object" using house[col_name]=house[col_name].astype(str) and then do one- hot encoding on these variables and standardize the rest of the numerical columns?
When a column is cardinal it is possible to apply one-hot-encoding, in this way the categorical columns can be vectorized in a binary way for each category.
import pandas as pd
raw_df= pd.get_dummies(data=raw_df,
cardinal_features=['col1', 'col2', 'col3'],
prefix=['feature1_', 'feature2_', 'feature3_'])

How to get original value for binary encoding using category_encoder package

I have a dataset which includes over 100 countries in it. I want to include these in an XGBoost model to make a classification prediction. I know that One Hot Encoding is the go-to process for this, but I would rather do something that wont increase the dimensionality so much and will be resilient to new values, so I'm trying binary classification using the category_encoders package. http://contrib.scikit-learn.org/categorical-encoding/binary.html
Using this encoding helped my model out over using basic one-hot encoding, but how do I get back to the original labels after encoding?
I know about the inverse_transform method, but that functions on the whole data frame. I need a way where I can put in a binary, or integer value and get back the original value.
Here's some example data taken from: https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159
import numpy as np
import pandas as pd
import category_encoders as ce
# make some data
df = pd.DataFrame({
'color':["a", "c", "a", "a", "b", "b"],
'outcome':[1, 2, 3, 2, 2, 2]})
# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)
# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])
# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)
I'd like to pass the values [0,0,1] or 1 into a function and get back a as a value.
The main reason for this is for looking at the feature importances of the model. I can get feature importances based on a column, but this will give me back a column id rather than the underlying value of a category that is the most important.
Please note that the article you reference suggests using the Binary Encoder for ordinal data only - that is, discrete data that has an order associated with it (small, medium, large), not nominal data (Red, White, Blue).
If you decide to use a Binary encoder, the order in which colors (or countries) are encoded will impact your performance. For example, assume red=001, white=010, and blue=011. When you apply an ML algorithm, it will see that red and blue have a feature in common (feature 3). This is probably not what you want.
In terms of applying the inverse transformation, you'll need to apply the inverse transformation to [0,0,1] in your example above, not "1". "1" is meaningless without context. You should be able to apply the inverse transformation to a single record (row) in your data, but not a single column. The inverse scaler will need to will operate on an object with the output dimension of the transformer.

sklearn tree treats categorical variable as float during splits, how should I solve this?

I'm having trouble with my regression tree using the sklearn package. It's about a book dataset, in which the regression tree can be seen below:
The problem is in the STORY_LANGUAGE variable. This is a categorical variable with the values 0, 1, 2, and 3, which all correspond to a different language of the book. Before running the model, I've made sure that STORY_LANGUAGE is a categorical variable, yet the tree still splits it and treats it as a float (1.5).
How should I solve this? Any help is appreciated!
By passing a list of integers as features to scikit-learn, you're telling it that there's some sort dependence between the features. That e.g. 0 is closer related to 1 than to 2. To get around this, you will need to do one-hot encoding with the built-in OneHotEncoder. If you have three categories, 0, 1 and 2, a 0 will be converted to [1,0,0], while a 1 will be converted to [0,1,0]. Basically your one feature is replaced with a vector that is equal to 1 at a position corresponding to which class it is and 0 otherwise.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# Generate random integers between 0 and 2
x = np.random.randint(0,3, size=(100,1))
# Create the one-hot encoder object, specifying not to use sparse arrays.
m = OneHotEncoder(sparse=False)
# Transform your features
x_one_hot = m.fit_transform(x)
If you are using sklearn's DecisionTreeRegressor, your label encoded features will simply be treated as numerical features. If you want them to be treated as categories, you can either perform one-hot-encoding (e.g. using OneHotEncoder) or use an algorithm that supports categorical features out of the box (e.g. lightGBM).

How to Perform a Naieve Bayes Classification in Scikit from CSV

I am looking to predict whether someone is a smoker from several columns of demographic data stored in a csv, as well as their smoker status.
The columns used are:
Gender, Age,Race, ServedInMilitary, CountryofBirth, EducationLevel MaritalStatus, HouseholdIncome, FamilyIncome, ChildrenInHouse, QuantitiyofAlcohol, PerUnitTime, ShortnessOfBreath, Asthma, Exercise, Smoker, SmokedBefore, AgeStartedSmoking.
All columns have numeric, but not necessarily binary values. Could someone help me correct my code to take these factors into account when determining smoker status and then help me measure the accuracy of my classifier?
I have the following code from a similar question: how to Load CSV Data in scikit and using it for Naive Bayes Classification
target_names = np.array(['Positives','Negatives'])
# add columns to your data frame
data['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.75
data['Type'] = pd.Factor(targets, target_names)
data['Targets'] = targets
# define training and test sets
train = data[data['is_train']==True]
test = data[data['is_train']==False]
trainTargets = np.array(train['Targets']).astype(int)
testTargets = np.array(test['Targets']).astype(int)
# columns you want to model
features = data.columns[0:7]
# call Gaussian Naive Bayesian class with default parameters
gnb = GaussianNB()
# train model
y_gnb = gnb.fit(train[features], trainTargets).predict(train[features])
#Predict Output
There seems to be a missing line here for the dataframe, but I will assume you have it. If you don't, then read your data using pandas.read_csv.
Also, your columns seem to have data that is both categorical and numerical. For example, the "SmokedBefore" column is likely 1/0 whereas your "Age" column is likely numbers such as 20 or 30.
This makes a difference, because in "SmokedBefore" the intent is not to say that 1>0. The intent is to say Yes/No. If your model assumes that higher (or lower) is better, then this will cause an issue. Therefore it is categorical and should not be treated like a numerical value. It is simply a tag to indicate whether someone has smoked before.
However, in "Age" the intent is to say that 30 is different than 20 by 10. Therefore, it is numerical and should be treated as such.
To treat this, you will need to transform your categorical features into another set of binary features that will balance out this effect and handle it for you. This is called One Hot Encoding. Instead of 1/0 on your "SmokedBefore", it will become "is_1" and "is_0" with corresponding data. Like that, each column will have a 1 and a 0.
You can simply use the onehotencoder function provided in sklearn. Use the categorical_features argument to specify which columns have categorical features

Discretizing continuous variables for RandomForest in Sklearn

I want to use Random Forest for feature selection based on Gini index. My dataset has mix of numeric (contiuous) and categorical(String) data. This is an example of the dataset
Var1 Var2
198 zcROj17IEC
336 DHeTmBftjz
252.3 crIgUHSK8h
252 ZSNrjIX0Db
I know trees works on discrete data (categorical) but does RandomForest in Sklearn require continuous numeric data to be discretized first or it can handle it?? For categorical string variables I used the following to encode the strings into numeric columns with zeros and ones
pandas.get_dummies(X['Var2'])
and it works but for the numeric I tried the following to discretize
pandas.qcut(X['Var1'], 2 , retbins=True)
but I keep getting an error of non-unique bins!
Do I need to discretize? How can I do it?
Random forest should support continuous variables no problem. See for example this sample.
Trees and Forest work worse when you make dummies from you categorical values.
You need just label you categorical features - that's all!

Categories