Specify list of possible values for Pandas get_dummies - python

Suppose I have a Pandas DataFrame like the below and I'm encoding categorical_1 for training in scikit-learn:
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'])
The values for 'categorical_1' are A, B, or C so I end up with 3 columns in dummy_values. However, categorical_1 can in reality take on values A, B, C, D, or E so there is no column represented for values D or E.
In R I would specify levels when factoring that column - is there a corresponding way to do this with Pandas or would I need to handle that manually?
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.

First, if you want pandas to take more values simply add them to the list sent to the get_dummies method
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
as in python + on lists works as a concatenate operation, so
['A','B','C','B','B'] + ['D','E']
results in
['A', 'B', 'C', 'B', 'B', 'D', 'E']
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with
A B C
0 0 0
(i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)
because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'
The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.
If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.

I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies() in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train and df_test (not containing target data in them).
all_data = pd.concat([df_train,df_test], axis=0)
all_data = pd.get_dummies(all_data)
X_train = all_data[:df_train.shape[0]] # select the processed training data
X_test = all_data[-df_test.shape[0]:] # select the processed testing data
Hope it helps.

Isn't this a better answer?
data = pd.DataFrame({
"values": [1, 2, 3, 4, 5, 6, 7],
"categories": ["A", "A", "B", "B", "C", "C", "D"]
})
possibilites = ["A", "B", "C", "D", "E", "F"]
exists = data["categories"].tolist()
difference = pd.Series([item for item in possibilites if item not in exists])
target = data["categories"].append(pd.Series(difference))
target = target.reset_index(drop=True)
dummies = pd.get_dummies(
target
)
dummies = dummies.drop(dummies.index[list(range(len(dummies)-len(difference), len(dummies)))])

To handle the mismatch between the set of categorical values in train and test sets I used;
length = train_categorical_data.shape[0]
empty_col = np.zeros((length,1))
test_categorical_data_processed = pd.DataFrame()
for col in train_categorical_data.columns:
test_categorical_data_processed[col] = test_categorical_data.get(col, empty_col)

Related

How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

LabelEncoder for categorical features?

This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having wrong ordinality in some of my features and how it will be effecting my model. Here is an example:
Input
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
a = pd.DataFrame(['High','Low','Low','Medium'])
le = LabelEncoder()
le.fit_transform(a)
Output
array([0, 1, 1, 2], dtype=int64)
As you can see, the ordinal values are not mapped correctly since my LabelEncoder only cares about the order in the column/array (it should be High=1, Med=2, Low=3 or vice versa). How drastically wrong mapping can effect the models and is there an easy way other than OrdinalEncoder() to map these values properly?
TL;DR: Using a LabelEncoder to encode ordinal any kind of features is a bad idea!
This is in fact clearly stated in the docs, where it is mentioned that as its name suggests this encoding method is aimed at encoding the label:
This transformer should be used to encode target values, i.e. y, and not the input X.
As you rightly point out in the question, mapping the inherent ordinality of an ordinal feature to a wrong scale will have a very negative impact on the performance of the model (that is, proportional to the relevance of the feature). And the same applies to a categorical feature, just that the original feature has no ordinality.
An intuitive way to think about it, is in the way a decision tree sets its boundaries. During training, a decision tree will learn the optimal features to set at each node, as well as an optimal threshold whereby unseen samples will follow a branch or another depending on these values.
If we encode an ordinal feature using a simple LabelEncoder, that could lead to a feature having say 1 represent warm, 2 which maybe would translate to hot, and a 0 representing boiling. In such case, the result will end up being a tree with an unnecessarily high amount of splits, and hence a much higher complexity for what should be simpler to model.
Instead, the right approach would be to use an OrdinalEncoder, and define the appropriate mapping schemes for the ordinal features. Or in the case of having a categorical feature, we should be looking at OneHotEncoder or the various encoders available in Category Encoders.
Though actually seeing why this is a bad idea will be more intuitive than just words.
Let's use a simple example to illustrate the above, consisting on two ordinal features containing a range with the amount of hours spend by a student preparing for an exam and the average grade of all previous assignments, and a target variable indicating whether the exam was past or not. I've defined the dataframe's columns as pd.Categorical:
df = pd.DataFrame(
{'Hours of dedication': pd.Categorical(
values = ['25-30', '20-25', '5-10', '5-10', '40-45',
'0-5', '15-20', '20-25', '30-35', '5-10',
'10-15', '45-50', '20-25'],
categories=['0-5', '5-10', '10-15', '15-20',
'20-25', '25-30','30-35','40-45', '45-50']),
'Assignments avg grade': pd.Categorical(
values = ['B', 'C', 'F', 'C', 'B',
'D', 'C', 'A', 'B', 'B',
'B', 'A', 'D'],
categories=['F', 'D', 'C', 'B','A']),
'Result': pd.Categorical(
values = ['Pass', 'Pass', 'Fail', 'Fail', 'Pass',
'Fail', 'Fail','Pass','Pass', 'Fail',
'Fail', 'Pass', 'Pass'],
categories=['Fail', 'Pass'])
}
)
The advantage of defining a categorical column as a pandas' categorical, is that we get to establish an order among its categories, as mentioned earlier. This allows for much faster sorting based on the established order rather than lexical sorting. And it can also be used as a simple way to get codes for the different categories according to their order.
So the dataframe we'll be using looks as follows:
print(df.head())
Hours_of_dedication Assignments_avg_grade Result
0 20-25 B Pass
1 20-25 C Pass
2 5-10 F Fail
3 5-10 C Fail
4 40-45 B Pass
5 0-5 D Fail
6 15-20 C Fail
7 20-25 A Pass
8 30-35 B Pass
9 5-10 B Fail
The corresponding category codes can be obtained with:
X = df.apply(lambda x: x.cat.codes)
X.head()
Hours_of_dedication Assignments_avg_grade Result
0 4 3 1
1 4 2 1
2 1 0 0
3 1 2 0
4 7 3 1
5 0 1 0
6 3 2 0
7 4 4 1
8 6 3 1
9 1 3 0
Now let's fit a DecisionTreeClassifier, and see what is how the tree has defined the splits:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
y = X.pop('Result')
dt.fit(X, y)
We can visualise the tree structure using plot_tree:
t = tree.plot_tree(dt,
feature_names = X.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
Is that all?? Well… yes! I've actually set the features in such a way that there is this simple and obvious relation between the Hours of dedication feature, and whether the exam is passed or not, making it clear that the problem should be very easy to model.
Now let's try to do the same by directly encoding all features with an encoding scheme we could have obtained for instance through a LabelEncoder, so disregarding the actual ordinality of the features, and just assigning a value at random:
df_wrong = df.copy()
df_wrong['Hours_of_dedication'].cat.set_categories(
['0-5','40-45', '25-30', '10-15', '5-10', '45-50','15-20',
'20-25','30-35'], inplace=True)
df_wrong['Assignments_avg_grade'].cat.set_categories(
['A', 'C', 'F', 'D', 'B'], inplace=True)
rcParams['figure.figsize'] = 14,18
X_wrong = df_wrong.drop(['Result'],1).apply(lambda x: x.cat.codes)
y = df_wrong.Result
dt_wrong = tree.DecisionTreeClassifier()
dt_wrong.fit(X_wrong, y)
t = tree.plot_tree(dt_wrong,
feature_names = X_wrong.columns,
class_names=["Fail", "Pass"],
filled = True,
label='all',
rounded=True)
As expected the tree structure is way more complex than necessary for the simple problem we're trying to model. In order for the tree to correctly predict all training samples it has expanded until a depth of 4, when a single node should suffice.
This will imply that the classifier is likely to overfit, since we’re drastically increasing the complexity. And by pruning the tree and tuning the necessary parameters to prevent overfitting we are not solving the problem either, since we’ve added too much noise by wrongly encoding the features.
So to summarize, preserving the ordinality of the features once encoding them is crucial, otherwise as made clear with this example we'll lose all their predictable power and just add noise to our model.

One-hot-encoding labels giving input labels

I am trying to apply one-hot-encoding for a pandas dataframe but I can't give a categories argument. My idea is to have the corresponding between categories and the encoding, for example:
CATEGORIES = ['A','B','C']
Y = pd.get_dummies(data['Article_Topic_1']).values
For example, Y will be [0,0,1] for category 'A', but I would like to prescribe the encoding for 'A' to be [1,0,0].
If this is not possible, is there a way to prescribe the encoding and know the exact string that was there?
I don't think you can do this with get_dummies() directly. But how about just reorganizing the result? If I got your question correctly, you want to reorder the columns of the one-hot-encoded data to match a prescribed ordering.
categories = ["A", "B", "C"]
Y = pd.get_dummies(data["Article_Topic_1"])
Y = Y[categories].values
Here a function checking some of the assumptions made that this solution works.
def get_dummies_for_coding(series, ordering):
# Ordering must contain only values present in series.
assert(len(set(ordering)-set(series.unique()))==0)
# It's easier to work with series here, because pd.get_dummies()
# will operate with string prefixes for data-frames, which make
# things a bit more complicated.
assert(isinstance(series, pd.Series))
dummies = pd.get_dummies(series)
dummies = dummies[ordering]
#return dummies
return dummies.values
# Example
df = pd.DataFrame([["a", "foo"],
["a", "bar"],
["b", "bar"],
["a", "baz"],
["b", "bar"]],
columns=["colA", "colB"])
orderingA = ["b", "a"]
orderingB = ["baz", "bar", "foo"]
ret = get_dummies_for_coding(df["colA"], orderingA)
print(ret)
ret = get_dummies_for_coding(df["colB"], orderingB)
print(ret)
Maybe you can try using scikit-learn for the one-hot encoding?
Find here a comprehensive example.

Value Label in pandas?

I am fairly new to pandas and come from a statistics background and I am struggling with a conceptual problem:
Pandas has columns, who are containing values. But sometimes values have a special meaning - in a statistical program like SPSS or R called a "value labels".
Imagine a column rain with two values 0 (meaning: no rain) and 1 (meaning: raining). Is there a way to assign these labels to that values?
Is there a way to do this in pandas, too? Mainly for platting and visualisation purposes.
There's not need to use a map anymore. Since version 0.15, Pandas allows a categorical data type for its columns.
The stored data takes less space, operations on it are faster and you can use labels.
I'm taking an example from the pandas docs:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
#Recast grade as a categorical variable
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
#Gives this:
Out[124]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
You can also rename categories and add missing categories
You could have a separate dictionary which maps values to labels:
d={0:"no rain",1:"raining"}
and then you could access the labelled data by doing
df.rain_column.apply(lambda x:d[x])

making multiple pandas data frames using a loop or list comprehension

I have a Python data frame that I want to subdivide by row BUT in 32 different slices (think of a large data set chopped by row into 32 smaller data sets). I can manually divide the data frames in this way:
df_a = df[df['Type']=='BROKEN PELVIS']
df_b = df[df['Type']=='ABDOMINAL STRAIN']
I'm assuming there is a much more Pythonic expression someone might like to share. I'm looking for something along the lines of:
for i in new1:
df_%s= df[df['#RIC']=='%s'] , %i
Hope that makes sense.
In these kind of situations I think it's more pythonic to store the DataFrames in a python dictionary:
injuries = {injury: df[df['Type'] == injury] for injury in df['Type'].unique()}
injuries['BROKEN PELVIS'] # is the same as df_a above
Most of the time you don't need to create a new DataFrame but can use a groupby (it depends what you're doing next), see http://pandas.pydata.org/pandas-docs/stable/groupby.html:
g = df.groupby('Type')
Update: in fact there is a method get_group to access these:
In [21]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [22]: g = df.groupby(0)
In [23]: g.get_group('A')
Out[23]:
0 1
0 A 2
1 A 4
Note: most of the time you don't need to do this, apply, aggregate and transform are your friends!

Categories