Possible ways to do one hot encoding in scikit-learn? - python

I have a pandas data frame with some categorical columns. Some of these contains non-integer values.
I currently want to apply several machine learning models on this data. With some models, it is necessary to do normalization to get better result. For example, converting categorical variable into dummy/indicator variables. Indeed, pandas has a function called get_dummies for that purpose. However, this function returns the result depending on the data. So if I call get_dummies on training data, then call it again on test data, columns achieved in two cases can be different because a categorical column in test data can contains just a sub-set/different set of possible values compared to possible values in training data.
Therefore, I am looking for other methods to do one-hot coding.
What are possible ways to do one hot encoding in python (pandas/sklearn)?

Scikit-learn provides an encoder sklearn.preprocessing.LabelBinarizer.
For encoding training data you can use fit_transform which will discover the category labels and create appropriate dummy variables.
label_binarizer = sklearn.preprocessing.LabelBinarizer()
training_mat = label_binarizer.fit_transform(df.Label)
For the test data you can use the same set of categories using transform.
test_mat = label_binarizer.transform(test_df.Label)

In the past, I've found the easiest way to deal with this problem is to use get_dummies and then enforce that the columns match up between test and train. For example, you might do something like:
import pandas as pd
train = pd.get_dummies(train_df)
test = pd.get_dummies(test_df)
# get the columns in train that are not in test
col_to_add = np.setdiff1d(train.columns, test.columns)
# add these columns to test, setting them equal to zero
for c in col_to_add:
test[c] = 0
# select and reorder the test columns using the train columns
test = test[train.columns]
This will discard information about labels that you haven't seen in the training set, but will enforce consistency. If you're doing cross validation using these splits, I'd recommend two things. First, do get_dummies on the whole dataset to get all of the columns (instead of just on the training set as in the code above). Second, use StratifiedKFold for cross validation so that your splits contain the relevant labels.

Say, I have a feature "A" with possible values "a", "b", "c", "d". But the training data set consists of only three categories "a", "b", "c" as values. If get_dummies is used at this stage, features generated will be three (A_a, A_b, A_c). But ideally there should be another feature A_d as well with all zeros. That can be achieved in the following way :
import pandas as pd
data = pd.DataFrame({"A" : ["a", "b", "c"]})
data["A"] = data["A"].astype("category", categories=["a", "b", "c", "d"])
mod_data = pd.get_dummies(data[["A"]])
print(mod_data)
The output being
A_a A_b A_c A_d
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0

For the text columns, you can try this
from sklearn.feature_extraction.text import CountVectorizer
data = ['he is good','he is bad','he is strong']
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(data)
For Output:
for i in range(len(data)):
print(vectors[i, :].toarray())
Output:
[[0 1 1 1 0]]
[[1 0 1 1 0]]
[[0 0 1 1 1]]

Related

How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

creating train and test datasets manually in python

I am trying to split the data set into train and test datasets manually meaning that I don't want to use the scikit learn package. I want split them in a way that if the row index module 4 is equal to zero, put them into the training dataset, else put them into the test dataset. I have done it in R like the following:
testidx = which(1:nrow(price_accommodates_bedrooms )%%4 == 0)
df_train = price_accommodates_bedrooms [-testidx, ]
df_test = price_accommodates_bedrooms[testidx, ]
But I am not sure how to do it in python because I am new to python. Thanks in advance
If you want to do this you can take advantage of the DataFrame index and masking:
test_df = df[df.index % 4 != 0]
train_df = df[df.index % 4 == 0]

Solving Kaggle's Titanic Machine Learning

I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.

pandas categories new levels

How does pandas categorical https://pandas.pydata.org/pandas-docs/stable/categorical.html handle new and unseen levels? I am thinking about a scikit-learn like setup. Currently, I have something like:
https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce
def: fit()
for each column:
fit a label encoder:
def: transform()
for each column:
check if column was unseen
yes(unseen) replace
no: label encode
but this is pretty slow.
Apparently, decision trees like xgboost or lightbm can directly handle categorical data, i.e. one would not need to fiddle around manually with this slow conversion.
But when looking at their code
https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/sklearn.py#L532 they seem to use LGBMLabelEncoderwhich is a standard scikit-learn LabelEncoder.
I wonder how that can handle unseen data.
If a manual conversion is required would pandas.Categorical allow a quicker conversion - even if unseen levels are in the new data?
edit
Please see https://github.com/geoHeil/pythonQuestions/blob/master/categorical-encoding.ipynb for an overview how I could not get scikit-learn's usual suspects to work.
Still looking for something more performant than my solution. Also lightGBM https://github.com/Microsoft/LightGBM/issues/789 suggests to use custom encoding strategy.
There might be a pandas solutin, but it works probably best with sklearns LabelBinarizer
from sklearn.preprocessing import LabelBinarizer
df= pd.DataFrame({'A':['a','b','c','a']})
lb = LabelBinarizer()
lb.fit(df["A"])
lb.transform(df["A"])
[[1 0 0]
[0 1 0]
[0 0 1]
[1 0 0]]
df2 = pd.DataFrame({'A':['a','b','d']})
lb.transform(df2['A'])
[[1 0 0]
[0 1 0]
[0 0 0]]
So we see that 'd' is essentially mapped to neither 'a','b' or 'c'.
Note however, that there is a bug which probably will be resolved in one of the next sklearn releases.
The LabelBinarizer is fit during training and recalls the values passed to it. New values get mapped to all zeros. It might be more feasible do write a transformer (as seen here before the edit) using pandas get_dummies.
This could be quite straightforward due to name matching of columns. Fit in the first step and store the column names, than just transform in the transformstep, but only keep column names that you identified in fitting (potentially adding zome zero columns if training levels are not present in the test set). Then you are done ;)

How to encode categorical features in sklearn?

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
A subset of string type(the column-features 1, 2, 3)
A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively.
In this context I have to encode them to use support vector machine algorithm.
This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
EDIT
With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder)
The problem persist for subset of string type with high cardinality.
This is by far the easiest:
df = pd.get_dummies(df, drop_first=True)
If you get a memory overflow or it is too slow then reduce the cardinality:
top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.
And performance-wise I think One Hot Encoder is much better than DictVectorizer.
You can use the pandasmethod .get_dummies() as suggested by #simon here above, or you can use the sklearn equivalent given by OneHotEncoder.
I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).
If, for some features, the cardinality is too big, impose low n_values.
If you have enough memory don't worry, encode all the values of your features.
I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.

Categories