Label encode then impute missing then inverse encoding - python

I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000
Name 0.000000
Age 0.018653
Gender 0.000640
Race 0.317429
Date 0.000000
City 0.000320
State 0.000000
Manner_of_death 0.000000
Armed 0.454487
Mental_illness 0.000000
Flee 0.000000
dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
'Mental_illness', 'Flee'],
dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4
State 0
Manner_of_death 0
Armed 5677
Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0,
0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps?
Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.

A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds

Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.
Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).
It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.
Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:
This transformer should be used to encode target values, i.e. y, and not the input X.
If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.

The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.
It was even tested against kNN and iterative imputer and showed better results.
Here is a personal guide.

Related

How to create synthetic data based on dataset with mixed data types for classification problem?

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.
To have your columns converted to ints, use round and then .astype(int):
df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)
You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

How to decide an optimum value for n_features parameter in Sklearn FeatureHasher

I am using hash encoding on a categorical column with 13 different value counts, and ideally speaking one-hot and dummy will give us 12 and 13 columns respectively after encoding. But when it comes to hash encoding, the default value of n_features is 2**20, which eventually creates 1000000+ columns.
How does one choose the value of n_features? I see that we need to consider a value to the nearest power of 2. Say if we consider the IRIS dataset we can end up using 2 or maybe even 4 for the n_features.
But what about a dataset where we have 13 different values in a column, what would be the n_features then?
# Feature Hashing Code:
import pandas as pd, numpy as np
from sklearn.feature_extraction import FeatureHasher
df = pd.read_csv(r'C:/Users/<user_name>/Downloads/datasets/countriesoftheworld.csv')
hash_encoder = FeatureHasher(n_features = ????, alternate_sign=False, input_type='string')
features = hash_encoder.fit_transform(df['Country'])
print(df.shape)
(227, 20)
I have not dropped the "Country" column, just to contrast between the original and encoded columns.
df = iris.join(pd.DataFrame(features.toarray()).add_prefix('encoded_'))
print(df.shape)
(227, 1048596)

How can I fill in a column missing 20% in the dataset?

There is a column missing 54% in the dataset. 17031 data is missing in this column. I did not delete it because this column is important to me. I filled it with knn. But because its neighbors are also nan values, some rows are still filled in nan. I changed the number of neighbors 3, I tried 4 and 5 but the result is the same. 12116 lines remain nan. Do you suggest me to wipe the column, do you have any other recommended method?
from sklearn.impute import KNNImputer
df_n = df[["Credit_Score","Annual_Income"]]
var_names = df_n.columns
n_df = np.array(df_n)
imputer = KNNImputer(n_neighbors=3)
new_data = imputer.fit_transform(n_df)
df2=pd.DataFrame(new_data, columns=var_names)
for s in ["Credit_Score","Annual_Income"]:
df[[s]] = df2[s]
You can use sklearn's SimpleImputer (link), which can fill the missing values with the mean, median, or other constant related to the column. This is a simpler imputation strategy than KNN, but it does ensure that no nans are remaining after imputation.

How to give column names after one-hot encoding with sklearn?

Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
You can get the column names using .get_feature_names() attribute.
>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()
Detailed example is here.
Update
from Version 1.0, use get_feature_names_out
This example could help for future readers:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
Sex AgeGroup
0 male 0
1 female 15
2 male 30
3 female 45
4 male 60
5 female 75
encoder=OneHotEncoder(sparse=False)
train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))
train_X_encoded.columns = encoder.get_feature_names(['Sex'])
train_X.drop(['Sex'] ,axis=1, inplace=True)
OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
AgeGroup Sex_female Sex_male
0 0 0.0 1.0
1 15 1.0 0.0
2 30 0.0 1.0
3 45 1.0 0.0
4 60 0.0 1.0
5 75 1.0 0.0`
Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base
I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
self.feature_names = []
def fit(self, X, y=None):
return self
def transform(self, X):
result = pd.get_dummies(X)
self.feature_names = result.columns
return result
A bit basic I know but it does the job I need it to.
If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.
a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names
df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)
There is another easy way with the package category_encoders this method uses a pipeline which also is one of the data science best practices.
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
ohe = OneHotEncoder(use_cat_names=True)
ohe.fit_transform(X)
Update: based on the answer of #Venkatachalam, the method get_feature_names() has been deprecated in scikit-learn 1.0. You will get a warning when trying to run it. Instead, use get_feature_names_out():
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
ohenc = OneHotEncoder(sparse=False)
x_cat_df = pd.DataFrame(ohenc.fit_transform(xtrain_lbl))
x_cat_df.columns = ohenc.get_feature_names_out(input_features=xtrain_lbl.columns)
Setting the parameter sparse=False in OneHotEncoder() will return an array instead of sparse matrix, so you don't need to convert it later. fit_transform() will calculate the parameters and transform the training set in one line.
Source: OneHotEncoder documentation

LSTM with more features / classes

How can I use more than one feature/class as input/output on a LSTM using Sequential from Keras Models in Python?
To be more specific, I would like to use as input and output to the network: [FeatureA][FeatureB][FeatureC].
FeatureA is a categorial class with 100 different possible values indicating the sensor which collected the data;
FeatureB is a on/off indicator, being 0 or 1;
FeatureC is a categorial class too having 5 unique values.
Data Example:
1. 40,1,5
2. 58,1,2
3. 57,1,5
4. 40,0,1
5. 57,1,4
6. 23,0,3
When using the raw data and loss='categorical_crossentropy' on model.compile, the loss is over than 10.0.
When normalisating the data to values between 0-1 and using mean_squared_error on loss, it gets an average of 0.27 on loss. But when testing it on prediction, the results does not makes any sense.
Any suggestions here or tutorials I could consult?
Thanks in advance.
You need to convert FeatureC to a binary category. Mean squared error is for regression and as best I can tell you are trying to predict which class a certain combination of sensors and states belongs to. Since there's 5 possible classes you can kind of think that you're trying to predict if the class is Red, Green, Blue, Yellow, or Purple. Right now those are represented by numbers but for a regression you model is going to be predicting values like 3.24 which doesn't make sense.
In effect you're converting values of FeatureC to 5 columns of binary values. Since it seems like the classes are exclusive there should be a single 1 and the rest of the columns of a row would be 0s. So if the first row is 'red' it would be [1, 0, 0, 0, 0]
For best results you should also convert FeatureA to binary categorical features. For the same reason as above, sensor 80 is not 4x more than sensor 20, but instead a different entity.
The last layer of your model should be of softmax type with 5 neurons. Basically your model is going to try to be predicting the probability of each class, in the end.
It looks like you are working with a dataframe since there's an index. Therefore I would try:
import keras
import numpy as np
import pandas as pd # assume that this has probably already been done though
feature_a = data.loc[:, "FeatureA"].values # pull out feature A
labels = data.loc[:, "FeatureC"].values # y are the labels that we're trying to predict
feature_b = data.loc[:, "FeatureB"].values # pull out B to make things more clear later
# make sure that feature_b.shape = (rows, 1) otherwise reset the shape
# so hstack works later
feature_b = feature_b.reshape(feature_b.shape[0], 1)
labels -= 1 # subtract 1 from all labels values to zero base (0 - 4)
y = keras.utils.to_categorical(labels)
# labels.shape should be (rows, 5)
# convert 1-100 to binary columns
# zero base again
feature_a -= 1
# Before: feature_a.shape=(rows, 1)
feature_a_data = keras.utils.to_categorical(feature_a)
# After: feature_a_data.shape=(rows, 100)
data = np.hstack([feature_a_data, feature_b])
# data.shape should be (rows, 101)
# y.shape should be (rows, 5)
Now you're ready to train/test split and so on.
Here's something to look at which has multi-class prediction:
https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Categories