How to encode categorical features in sklearn? - python

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively.
In this context I have to encode them to use support vector machine algorithm.
This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
EDIT
With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder)
The problem persist for subset of string type with high cardinality.

This is by far the easiest:
df = pd.get_dummies(df, drop_first=True)
If you get a memory overflow or it is too slow then reduce the cardinality:
top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"

As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.
And performance-wise I think One Hot Encoder is much better than DictVectorizer.

You can use the pandasmethod .get_dummies() as suggested by #simon here above, or you can use the sklearn equivalent given by OneHotEncoder.
I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).
If, for some features, the cardinality is too big, impose low n_values.
If you have enough memory don't worry, encode all the values of your features.
I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.

Related

ValueError: could not convert string to float after converting features to integers for decision tree

I converted my dataset features into integers using the following code:
car_df = pd.DataFrame({'Integer Feature': [0,1,2,3,4,5],
'Categorical Feature': ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']})
This worked. Now, I am trying to create a decision tree and used the following code:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(car_df, y)
However, I get an error stating: ValueError: could not convert string to float: 'buying'
'Buying' is the first categorical feature in the dataset. There are six categorical features.
I thought that would not have been an issue since I converted the features to integers. Does anyone have an idea of how to fix this?
I just pulled this cars dataset so I have a better idea of its contents. Based on the documentation, here are the columns with possible values:
buying v-high, high, med, low
maint v-high, high, med, low
doors 2, 3, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
So all of these columns can contain strings and they all need to be converted to numeric type before you can pass the dataset to your model's fit() method.
Per Pandas documentation of the get_dummies() method: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html:
Once you have your original dataset in a dataframe (call it df), you can pass it to the .get_dummies() method like this:
import pandas as pd
df_with_dummies = pd.get_dummies(df)
This code will convert any columns with object or category dtype to integer dummies and will name each new column using the {original column name}_{original value} convention.

LabelEncoder for categorical features?

This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:
Input
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)
Output
array([0, 1, 1, 2], dtype=int64)
As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly?
TL;DR: Using a LabelEncoder to encode ordinal any kind of features is a bad idea!
This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label:
This transformer should be used to encode target values, i.e. y, and not the input X.
As you rightly point out in the question, mapping the inherent ordinality of an ordinal feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature). And the same applies to a categorical feature, just that the original feature has no ordinality.
An intuitive way to think about it, is in the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.
If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 representing boiling. In such case, the result will end up being a tree with an unnecessarily high amount of splits, and hence a much higher complexity for what should be simpler to model.
Instead, the right approach would be to use an OrdinalEncoder, and define the appropriate mapping schemes for the ordinal features. Or in the case of having a categorical feature, we should be looking at OneHotEncoder or the various encoders available in Category Encoders.
Though actually seeing why this is a bad idea will be more intuitive than just words.
Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:
df = pd.DataFrame(
{'Hours of dedication': pd.Categorical(
values = ['25-30', '20-25', '5-10', '5-10', '40-45',
'0-5', '15-20', '20-25', '30-35', '5-10',
'10-15', '45-50', '20-25'],
categories=['0-5', '5-10', '10-15', '15-20',
'20-25', '25-30','30-35','40-45', '45-50']),
'Assignments avg grade': pd.Categorical(
values = ['B', 'C', 'F', 'C', 'B',
'D', 'C', 'A', 'B', 'B',
'B', 'A', 'D'],
categories=['F', 'D', 'C', 'B','A']),
'Result': pd.Categorical(
values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass',
'Fail', 'Fail','Pass','Pass', 'Fail',
'Fail', 'Pass', 'Pass'],
categories=['Fail', 'Pass'])
}
)
The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.
So the dataframe we'll be using looks as follows:
print(df.head())
Hours_of_dedication Assignments_avg_grade Result
0 20-25 B Pass
1 20-25 C Pass
2 5-10 F Fail
3 5-10 C Fail
4 40-45 B Pass
5 0-5 D Fail
6 15-20 C Fail
7 20-25 A Pass
8 30-35 B Pass
9 5-10 B Fail
The corresponding category codes can be obtained with:
X = df.apply(lambda x: x.cat.codes)
X.head()
Hours_of_dedication Assignments_avg_grade Result
0 4 3 1
1 4 2 1
2 1 0 0
3 1 2 0
4 7 3 1
5 0 1 0
6 3 2 0
7 4 4 1
8 6 3 1
9 1 3 0
Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)
We can visualise the tree structure using plot_tree:
t = tree.plot_tree(dt,
feature_names = X.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.
Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:
df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20',
'20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
['A', 'C', 'F', 'D', 'B'], inplace=True)
rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result
dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)
t = tree.plot_tree(dt_wrong,
feature_names = X_wrong.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.
This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.
So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model.

Why does a pandas dataframe with sparse columns take up more memory?

I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
pandas 0.25.0
python 3.7.2
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather,
you can view these objects as being “compressed” where any data
matching a specific value (NaN / missing value, though any value can
be chosen, including 0) is omitted. The compressed values are not
actually stored in the array.
Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't

Scikit-learn fit_transform return type not homogeneous

I'm currently using sklearn and something bothers me.
The return type of Imputer.fit_transform(), LabelEncoder.fit_transform, etc is numpy.ndarray, but OneHotEncoder.fit_transform() returns a coo_matrix.
Is there an explication ?
Thank you.
Imputer works on existing data array. So the output will depend on the input to fit(). If the input is sparse the output is most likely sparse.
LabelEncoder just changes the string values to integer, so it requires a simple array (not sparse) and will output a similar array.
Now coming on to OneHotEncoder, the work of one-hot encoder is to get dummy encoding for a column, in which for a single sample only single 1 is present (thats why called 1-hot) and all others 0.
So if a column of 100 samples has 7 unique categories, then you will get 7 different columns in the output, which will have total 700 cells with only 100 cells having 1 (one for each sample) and remaining 600 cells have data 0. This is about a single column. Now consider this situation for multiple columns on a huge data of samples maybe greater than 10000. In that most cells will be 0. So thats why to save memory, it returns a sparse matrix.
If you have enough memory to handle that you should initialize the OneHotEncoder as:
enc = OneHotEncoder(sparse=False)
In this, the transform() will return a simple dense array.

pandas categories new levels

How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like:
https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce
def: fit()
for each column:
fit a label encoder:
def: transform()
for each column:
check if column was unseen
yes(unseen) replace
no: label encode
but this is pretty slow.
Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion.
But when looking at their code
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoderwhich is a standard scikit-learn LabelEncoder.
I wonder how that can handle unseen data.
If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?
edit
Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work.
Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.
There might be a pandas solutin, but it works probably best with sklearns LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
df= pd.DataFrame({'A':['a','b','c','a']})
lb = LabelBinarizer()
lb.fit(df["A"])
lb.transform(df["A"])
[[1 0 0]
[0 1 0]
[0 0 1]
[1 0 0]]
df2 = pd.DataFrame({'A':['a','b','d']})
lb.transform(df2['A'])
[[1 0 0]
[0 1 0]
[0 0 0]]
So we see that 'd' is essentially mapped to neither 'a','b' or 'c'.
Note however, that there is a bug which probably will be resolved in one of the next sklearn releases.
The LabelBinarizer is fit during training and recalls the values passed to it. New values get mapped to all zeros. It might be more feasible do write a transformer (as seen here before the edit) using pandas get_dummies.
This could be quite straightforward due to name matching of columns. Fit in the first step and store the column names, than just transform in the transformstep, but only keep column names that you identified in fitting (potentially adding zome zero columns if training levels are not present in the test set). Then you are done ;)

Categories