Target encoding and MissForest imputation for columns with NaN - python

I have a column risk_appetite where there are some NaN in the column. The screenshot below is a summary of my column:
I plan to use KNN method to impute the missing value, and therefore, I need to do encoding first before the imputation. I'm using target encoding technique, and this is the function that I'm using:
from category_encoders import TargetEncoder
encoder = TargetEncoder(handle_missing = 'return_nan')
def targetencoder(data,col,target):
data[col] = encoder.fit_transform(data[col], data[target])
Then, I call the function to encode my column:
listofcol_te = ['risk_appetite']
for col in listofcol_te:
targetencoder(df,col,'target_variable')
Once the encoding done, this is the output:
Everything is fine until here. Next, I start to do imputation (using MissForest imputation) for the column:
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest
# Copy the original dataset
data = df.copy()
# Impute
imputer = MissForest()
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data=data_imputed, columns=data.columns)
I manage to impute all the NaN in risk_appetite, and this is the result:
As you can see from the screenshot above, initially there are only 5 categories for risk_appetite, after the imputation, it became 1332 categories. MissForest imputation method seems like creating new category instead of assigning the existing categories to the NaN.
May I know did I did anything wrong? Or MissForest imputation shouldn't be used for categorical feature? What is the best way for me to impute risk_appetite if MissForest is not suitable? I saw some imputation by mean, mode and median, but I think that is not really a good way to do imputation. Any help or advise will be greatly appreciated!

Related

Getting NaN in a column after applying map() function

I'm trying to replace the categorical variable in the Gender column - M, F with 0, 1. However, after running my code I'm getting NaN in place of 0 & 1.
Code-
df['Gender'] = df['Gender'].map({'F':1, 'M':0})
My input data frame-
Dataframe after running the code-
Details- Gender (Data Type) - object
Kindly, suggest a way out!
Maybe values in your dataframe are different from the expected strings 'F' and 'M'. Try to use LabelEncoder from SkLearn.
from sklearn.preprocessing import LabelEncoder
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
This particular code resolved the issue-
# Import label encoder
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Gender'.
df['Gender']= label_encoder.fit_transform(df['Gender'])
df['Gender'].unique()

Pipeline with SimpleImputer and OneHotEncoder - how to do properly?

I facing a challenge to create a pipeline to impute (SI) a category variable (eg colour) and then onehotencode (OHE) 2 variables (eg colour & dayofweek). colour is used in the 2 steps.
I wanted to put SI and OHE in 1 ColumnTransformer. I just learnt that both SI and OHE running in parallel, meaning OHE will not encode the imputed colour (ie OHE the original un-imputed colour.)
I then tried:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False,drop='first')
ctsi = ColumnTransformer(transformers=[('si',si,['colour'])],remainder='passthrough')
ctohe = ColumnTransformer(transformers=[('ohe',ohe,['colour','dayofweek'])],remainder='passthrough')
pl = Pipeline([('ctsi',ctsi),('ctohe',ctohe)])
outfit = pl.fit_transform(X,y)
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I believe it's because the column name colour has been removed by SI. When I change the OHE columns to a list of int:
ctohe = ColumnTransformer(transformers=[('ohe',ohe,[0,1])],remainder='passthrough')
It goes through. I'm just testing the processing, obviously, the columns are incorrect.
So my challenge here is that given what I want to accomplish, is it possible ? And how can I do that ?
Great many thanks in advance !
Actually, I agree with your reasoning. The problem is given by the fact that ColumnTransformer forgets the column names after the transformation and indeed - quoting the answer in here - ColumnTransformer's intended usage is to deal with transformations applied in parallel. That's also specified in the doc by means of this sentence in my opinion:
This estimator allows different columns or column subsets of the input to be transformed separately. [...] This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
I guess that one solution to this might be to go with a custom renaming of the columns, passing a callable to the columns portion of ColumnTransformer's transformers tuple (name, transformer, columns) (notation which follows the documentation) according to your needs (actually, this would work I guess if you pass a callable to your second ColumnTransformer instance in the pipeline).
EDIT: I have to withdraw somehow what I wrote, I'm not actually sure that passing a callable to columns might work for your need as your problem does not really stand in column selection per se, but rather in column selection via string column names, for which you would need a DataFrame (and imo acting on column selector only won't solve such problem).
Instead, you might better need a transformer that somehow changes column names after the imputation and before the one-hot-encoding (still provided that the setting is not the ideal one when different instances of ColumnTransformer have to transform the same variables in sequence in a Pipeline) acting on a DataFrame.
Actually a couple of months ago, the following https://github.com/scikit-learn/scikit-learn/pull/21078 was merged; I suspect it is not still in the latest release because by upgrading sklearn I couldn't get it to work. Anyway, IMO, in the future it may ease in similar situations as it adds get_feature_names_out() to SimpleImputer and get_feature_names_out() is in turn really useful when dealing with column names.
In general, I would also suggest the same post linked above for further details.
Eventually, here's a naive example I could get to; it's not scalable (I tried to get to something more scalable exploiting feature_names_in_ attribute of the fitted SimpleImputer instance, without arriving to a consistent result) but hopefully might give some hints.
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
X = pd.DataFrame({'city': ['London', 'London', 'Paris', np.NaN],
'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
ct_1 = ColumnTransformer([('si', SimpleImputer(strategy='most_frequent'), ['city'])],
remainder='passthrough')
ct_2 = ColumnTransformer([('ohe', OneHotEncoder(), ['city'])], remainder='passthrough', verbose_feature_names_out=True)
class ColumnExtractor(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def transform(self, X, *_):
return pd.DataFrame(X, columns=self.columns)
def fit(self, *_):
return self
pipe = Pipeline([
('ct_1', ct_1),
('ce', ColumnExtractor(['city', 'title', 'expert_rating', 'user_rating'])),
('ct_2', ct_2)
])
pipe.fit_transform(X)
I think the simplest option (for now) is separate pipelines for the two columns:
si = SimpleImputer(strategy='mean', add_indicator=True)
ohe = OneHotEncoder(sparse=False, drop='first'))
siohe = Pipeline([
('si', si),
('ohe', ohe),
])
ct = ColumnTransformer(
transformers=[
('si', siohe, ['colour']),
('siohe', ohe, ['dayofweek']),
],
remainder='passthrough'
)
(You specified the strategy as 'mean', which is probably not what you want, but I've left it in the code above.)
You might also consider just using siohe for both columns: if there are no missings in dayofweek (in production data!) then there's no real difference.

Solving Kaggle's Titanic Machine Learning

I'm trying to solving Kaggle's Titanic with Python.
But I have an error trying to fit my data.
This is my code:
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"] = data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"] = data["Age"].fillna(data["Age"].dropna().median())
data.loc[data["Sex"] == "male", "Sex"] = 0
data.loc[data["Sex"] == "female", "Sex"] = 1
data.loc["Embarked"] = data["Embarked"].fillna("S")
data.loc[data["Embarked"] == "S", "Embarked"] = 0
data.loc[data["Embarked"] == "C", "Embarked"] = 1
data.loc[data["Embarked"] == "Q", "Embarked"] = 2
train = pd.read_csv("train.csv")
clean_data(train)
target = train["Survived"].values
features = train[["Pclass", "Age","Sex","SibSp", "Parch"]].values
classifier = linear_model.LogisticRegression()
classifier_ = classifier.fit(features, target) # Here is where error comes from
And the error is this:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Can you help me please?
Before you fit the model with features and target, the best practice is to check whether the null value is present in all the features which you want to use in building the model. You can know the below to check it
dataframe_name.isnull().any() this will give the column names and True if atleast one Nan value is present
dataframe_name.isnull().sum() this will give the column names and value of how many NaN values are present
By knowing the column names then you perform cleaning of data.
This will not create the problem of NaN.
You should reset the index of your dataframe before running any sklearn code:
df = df.reset_index()
Nan simply represents empty,None or null values in a dataset. Before applying some ML algorithm on the dataset you, first, need to preprocess the dataset for it's streamlined processing. In other words it's called data cleaning. you can use scikit learn's imputer module to handle Nan.
How to check if dataset has Nan:
dataframe's isnan() returns a list of True/False values to show whether some column contains Nan or not for example:
str = pd.Series(['a','b',np.nan, 'c', 'np.nan'])
str.isnull()
out: False, False, True, False, True
And str.isnull().sum() would return you the count of null values present in the series. In this case '2'.
you can apply this method on a dataframe itself e.g. df.isnan()
Two techniques I know to handle Nan: 1. Removing the row which contains Nan.e.g.
str.dropna() orstr.dropna(inplace=True) or df.dropna(how=all)
But this would remove many valuable information from the dataset. Hence, mostly we avoid it.
2.Imputing: replacing the Nan values with the mean/median of the column.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
#strategy can also be median or most_frequent
imputer = imputer.fit(training_data_df)
imputed_data = imputer.fit_transform(training_data_df.values)
print(imputed_data_df)
I hope this would help you.

Creating many feature columns in Tensorflow

I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.
The traditional way of creating a feature_column is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column objects.
What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:
base_columns = [
gender, native_country, education, occupation, workclass, relationship,
age_buckets,
]
Which is ultimately passed into your estimator:
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns)
So would the ideal way of handling feature_column creation for hundreds of columns be to append them directly into a list? Something like this?
my_columns = []
for col in df.columns:
if is_string_dtype(df[col]): #is_string_dtype is pandas function
my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_column.append(tf.feature_column.numeric_column(col))
Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?
What you have posted in the question makes sense. Small extension based on your own code:
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]):
my_columns.append(tf.feature_column.numeric_column(col))
elif ptypes.is_categorical_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column(col,
hash_bucket_size= len(df[col].unique())))
I used your own answer. Just edited a little bit (there should be my_columns instead of my_column in for loop) and posting it the way it worked for me.
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_columns.append(tf.feature_column.numeric_column(col))
The above two methods works only if the data is provided in pandas data frame where you have column name for each column. But, in case you have all numeric column and you don't want to name those columns. for e.g. reading several numerical columns from a numpy array, you can use something like this:-
feature_column = [tf.feature_column.numeric_column(key='image',shape=(784,))]
input_fn = tf.estimator.inputs.numpy_input_fn(dict({'image':x_train})
where X_train is your numy array with 784 columns. You can check this post by Vikas Sangwan for more details.

Data imputation with fancyimpute and pandas

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick).
I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it.
Here is what I do:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
However, df_filled is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
Update
I realized, fancyimpute needs a numpay array. I hence converted the df_numeric to a an array using as_matrix().
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
Add the following lines after your code:
df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index
I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.
from fancyimpute import SoftImpute
class SoftImputeDf(SoftImpute):
"""DataFrame Wrapper around SoftImpute"""
def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
min_value=None,max_value=None,normalizer=None,verbose=True):
super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value,
convergence_threshold=convergence_threshold,
max_iters=max_iters,max_rank=max_rank,
n_power_iterations=n_power_iterations,
init_fill_method=init_fill_method,
min_value=min_value,max_value=max_value,
normalizer=normalizer,verbose=False)
def fit_transform(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Must be pandas dframe"
for col in X.columns:
if X[col].isnull().sum() < 10:
X[col].fillna(0.0, inplace=True)
z = super(SoftImputeDf, self).fit_transform(X.values)
return pd.DataFrame(z, index=X.index, columns=X.columns)
I really appreciate #jander081's approach, and expanded on it a tiny bit to deal with setting categorical columns. I had a problem where the categorical columns would get unset and create errors during training, so modified the code as follows:
from fancyimpute import SoftImpute
import pandas as pd
class SoftImputeDf(SoftImpute):
"""DataFrame Wrapper around SoftImpute"""
def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
min_value=None,max_value=None,normalizer=None,verbose=True):
super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value,
convergence_threshold=convergence_threshold,
max_iters=max_iters,max_rank=max_rank,
n_power_iterations=n_power_iterations,
init_fill_method=init_fill_method,
min_value=min_value,max_value=max_value,
normalizer=normalizer,verbose=False)
def fit_transform(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Must be pandas dframe"
for col in X.columns:
if X[col].isnull().sum() < 10:
X[col].fillna(0.0, inplace=True)
z = super(SoftImputeDf, self).fit_transform(X.values)
df = pd.DataFrame(z, index=X.index, columns=X.columns)
cats = list(X.select_dtypes(include='category'))
df[cats] = df[cats].astype('category')
# return pd.DataFrame(z, index=X.index, columns=X.columns)
return df
df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)
The np.array that is returned by the .complete() method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=) of a pandas dataframe whose cols and indexes are the same as the original data frame.

Categories