LabelEncoder for categorical features? - python

This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:
Input
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)
Output
array([0, 1, 1, 2], dtype=int64)
As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly?

TL;DR: Using a LabelEncoder to encode ordinal any kind of features is a bad idea!
This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label:
This transformer should be used to encode target values, i.e. y, and not the input X.
As you rightly point out in the question, mapping the inherent ordinality of an ordinal feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature). And the same applies to a categorical feature, just that the original feature has no ordinality.
An intuitive way to think about it, is in the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.
If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 representing boiling. In such case, the result will end up being a tree with an unnecessarily high amount of splits, and hence a much higher complexity for what should be simpler to model.
Instead, the right approach would be to use an OrdinalEncoder, and define the appropriate mapping schemes for the ordinal features. Or in the case of having a categorical feature, we should be looking at OneHotEncoder or the various encoders available in Category Encoders.
Though actually seeing why this is a bad idea will be more intuitive than just words.
Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:
df = pd.DataFrame(
{'Hours of dedication': pd.Categorical(
values = ['25-30', '20-25', '5-10', '5-10', '40-45',
'0-5', '15-20', '20-25', '30-35', '5-10',
'10-15', '45-50', '20-25'],
categories=['0-5', '5-10', '10-15', '15-20',
'20-25', '25-30','30-35','40-45', '45-50']),
'Assignments avg grade': pd.Categorical(
values = ['B', 'C', 'F', 'C', 'B',
'D', 'C', 'A', 'B', 'B',
'B', 'A', 'D'],
categories=['F', 'D', 'C', 'B','A']),
'Result': pd.Categorical(
values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass',
'Fail', 'Fail','Pass','Pass', 'Fail',
'Fail', 'Pass', 'Pass'],
categories=['Fail', 'Pass'])
}
)
The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.
So the dataframe we'll be using looks as follows:
print(df.head())
Hours_of_dedication Assignments_avg_grade Result
0 20-25 B Pass
1 20-25 C Pass
2 5-10 F Fail
3 5-10 C Fail
4 40-45 B Pass
5 0-5 D Fail
6 15-20 C Fail
7 20-25 A Pass
8 30-35 B Pass
9 5-10 B Fail
The corresponding category codes can be obtained with:
X = df.apply(lambda x: x.cat.codes)
X.head()
Hours_of_dedication Assignments_avg_grade Result
0 4 3 1
1 4 2 1
2 1 0 0
3 1 2 0
4 7 3 1
5 0 1 0
6 3 2 0
7 4 4 1
8 6 3 1
9 1 3 0
Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)
We can visualise the tree structure using plot_tree:
t = tree.plot_tree(dt,
feature_names = X.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.
Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:
df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20',
'20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
['A', 'C', 'F', 'D', 'B'], inplace=True)
rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result
dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)
t = tree.plot_tree(dt_wrong,
feature_names = X_wrong.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.
This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.
So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model.

Related

How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

pandas categories new levels

How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like:
https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce
def: fit()
for each column:
fit a label encoder:
def: transform()
for each column:
check if column was unseen
yes(unseen) replace
no: label encode
but this is pretty slow.
Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion.
But when looking at their code
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoderwhich is a standard scikit-learn LabelEncoder.
I wonder how that can handle unseen data.
If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?
edit
Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work.
Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.
There might be a pandas solutin, but it works probably best with sklearns LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
df= pd.DataFrame({'A':['a','b','c','a']})
lb = LabelBinarizer()
lb.fit(df["A"])
lb.transform(df["A"])
[[1 0 0]
[0 1 0]
[0 0 1]
[1 0 0]]
df2 = pd.DataFrame({'A':['a','b','d']})
lb.transform(df2['A'])
[[1 0 0]
[0 1 0]
[0 0 0]]
So we see that 'd' is essentially mapped to neither 'a','b' or 'c'.
Note however, that there is a bug which probably will be resolved in one of the next sklearn releases.
The LabelBinarizer is fit during training and recalls the values passed to it. New values get mapped to all zeros. It might be more feasible do write a transformer (as seen here before the edit) using pandas get_dummies.
This could be quite straightforward due to name matching of columns. Fit in the first step and store the column names, than just transform in the transformstep, but only keep column names that you identified in fitting (potentially adding zome zero columns if training levels are not present in the test set). Then you are done ;)

How to encode categorical features in sklearn?

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively.
In this context I have to encode them to use support vector machine algorithm.
This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
EDIT
With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder)
The problem persist for subset of string type with high cardinality.
This is by far the easiest:
df = pd.get_dummies(df, drop_first=True)
If you get a memory overflow or it is too slow then reduce the cardinality:
top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.
And performance-wise I think One Hot Encoder is much better than DictVectorizer.
You can use the pandasmethod .get_dummies() as suggested by #simon here above, or you can use the sklearn equivalent given by OneHotEncoder.
I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).
If, for some features, the cardinality is too big, impose low n_values.
If you have enough memory don't worry, encode all the values of your features.
I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.

Possible ways to do one hot encoding in scikit-learn?

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?
Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)
In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.
Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]

Specify list of possible values for Pandas get_dummies

Suppose I have a Pandas DataFrame like the below and I'm encoding categorical_1 for training in scikit-learn:
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'])
The values for 'categorical_1' are A, B, or C so I end up with 3 columns in dummy_values. However, categorical_1 can in reality take on values A, B, C, D, or E so there is no column represented for values D or E.
In R I would specify levels when factoring that column - is there a corresponding way to do this with Pandas or would I need to handle that manually?
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
First, if you want pandas to take more values simply add them to the list sent to the get_dummies method
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
as in python + on lists works as a concatenate operation, so
['A','B','C','B','B'] + ['D','E']
results in
['A', 'B', 'C', 'B', 'B', 'D', 'E']
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with
A B C
0 0 0
(i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)
because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'
The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.
If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.
I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies() in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train and df_test (not containing target data in them).
all_data = pd.concat([df_train,df_test], axis=0)
all_data = pd.get_dummies(all_data)
X_train = all_data[:df_train.shape[0]] # select the processed training data
X_test = all_data[-df_test.shape[0]:] # select the processed testing data
Hope it helps.
Isn't this a better answer?
data = pd.DataFrame({
"values": [1, 2, 3, 4, 5, 6, 7],
"categories": ["A", "A", "B", "B", "C", "C", "D"]
})
possibilites = ["A", "B", "C", "D", "E", "F"]
exists = data["categories"].tolist()
difference = pd.Series([item for item in possibilites if item not in exists])
target = data["categories"].append(pd.Series(difference))
target = target.reset_index(drop=True)
dummies = pd.get_dummies(
target
)
dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])
To handle the mismatch between the set of categorical values in train and test sets I used;
length = train_categorical_data.shape[0]
empty_col = np.zeros((length,1))
test_categorical_data_processed = pd.DataFrame()
for col in train_categorical_data.columns:
test_categorical_data_processed[col] = test_categorical_data.get(col, empty_col)

Categories