SciKit Learn when the data contains strings or boolean values - python

I have a dataframe that looks like this (it is obviously much bigger):
id points isAvailable frequency Score
abc1 325 True 93.0 0.01
def2 467 False 80.1 0.59
ghi3 122 True 90.3 1
jkl4 546 True 84.0 0
mno5 355 False 93.5 0.99
I want to see how much the features points, isAvailable and frequency influence the Score. I want to use Random Forests:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
X = df
y = df['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
I get the following error: ValueError: could not convert string to float: 'abc1'
Questions:
How can I pre-process the data? What happens to the boolean variables?
Is it wrong to even include the id column in X?
I was thinking of using something like df = df.astype({"a": int, "b": complex}) but I don't really know how in this case and I read that there are special algorithms for encoding.

First, you have to remove the score column from the X dataset: it is the label of you data, so it should not be used as a feature.
Second, assuming that the id column is an identifier for you data, you should remove it from X. It is like if you were trying to analyze a dataset of weight of a group of persons: you would remove their names because there is no correlation between their names and their weight.
Last, to deal with the boolean variables, there are some encoding methods, like you said (for example this one), but since the value can be only 0 or 1, it should be fine if you convert False = 0, True = 1
You can do it with this code (assuming df is the name of you DataFrame):
df['isAvailable'] = (df['isAvailable'] == True).astype(int)

Related

How to build a custom scaler based on StandardScaler?

I am trying to build a custom scaler to scale only the continuous variables on a dataset (the US Adult Income: https://www.kaggle.com/uciml/adult-census-income), using StandardScaler as a base.
Here is my Python code that I used:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
class CustomScaler(BaseEstimator,TransformerMixin):
def __init__(self,columns,copy=True,with_mean=True,with_std=True):
self.scaler = StandardScaler(copy,with_mean,with_std)
self.columns = columns
self.mean_ = None
self.var_ = None
def fit(self, X, y=None):
self.scaler.fit(X[self.columns], y)
self.mean_ = np.mean(X[self.columns])
self.var_ = np.var(X[self.columns])
return self
def transform(self, X, y=None, copy=None):
init_col_order = X.columns
X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
X=new_df_upsampled.copy()
X.drop('income',axis=1,inplace=True)
continuous = df.iloc[:, np.r_[0,2,10:13]]
#basically independent variables that I consider continuous
columns_to_scale = continuous
scaler = CustomScaler(columns_to_scale)
scaler.fit(X)
However when I tried to run the scaler, I met this problem:
So what is the error that I have on building the scaler? And furthermore, how could you build a custom scaler for this dataset?
Thank you!
There is no need to create a custom transformer for this problematic. as this operation can be performed using ColumnTransformer. This transformer allows different columns of the input to be transformed separately.
The example below is scaling the columns ['A', 'B'] without changing the column C.
import numpy as np
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({'A': np.arange(10),
'B': np.arange(10),
'C': np.arange(10)})
transformer = make_column_transformer(
(StandardScaler(), ['A', 'B']),
remainder='passthrough'
)
pd.DataFrame(transformer.fit_transform(df), columns=df.columns)
This output the following result:
A B C
0 -1.566699 -1.566699 0.0
1 -1.218544 -1.218544 1.0
2 -0.870388 -0.870388 2.0
3 -0.522233 -0.522233 3.0
4 -0.174078 -0.174078 4.0
5 0.174078 0.174078 5.0
6 0.522233 0.522233 6.0
7 0.870388 0.870388 7.0
8 1.218544 1.218544 8.0
9 1.566699 1.566699 9.0
I agree with #AntoineDubuis, that ColumnTransformer is a better (builtin!) way to do this. That said, I'd like to address where your code goes wrong.
In fit, you have self.scaler.fit(X[self.columns], y); this indicates that self.columns should be a list of column names (or a few other options). But you've defined the parameter as continuous = df.iloc[:, np.r_[0,2,10:13]], which is a dataframe.
A couple other issues:
you should only set attributes in __init__ that come from its signature, or cloning will fail. Move self.scaler
to fit, and save its parameters copy etc. directly at __init__. Don't initialize mean_ or var_.
you never actually use mean_ or var_. You can keep them if you want, but the relevant statistics are stored in the scaler object.

Label encode then impute missing then inverse encoding

I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000
Name 0.000000
Age 0.018653
Gender 0.000640
Race 0.317429
Date 0.000000
City 0.000320
State 0.000000
Manner_of_death 0.000000
Armed 0.454487
Mental_illness 0.000000
Flee 0.000000
dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
'Mental_illness', 'Flee'],
dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4
State 0
Manner_of_death 0
Armed 5677
Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0,
0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps?
Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.
A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds
Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.
Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).
It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.
Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:
This transformer should be used to encode target values, i.e. y, and not the input X.
If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.
The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.
It was even tested against kNN and iterative imputer and showed better results.
Here is a personal guide.

Feature selection and categorical variables

I work on a dataset which contain mainly binary variables. However two of the are categorical with multiple values (strings). I want to apply feature selection using lasso but i have an error Keyerror: could not convert string to float:
Should i use LabelEncoder and then do the feature selection? Any ideas how to deal with this?
Here is my code
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
scaler = MinMaxScaler()
scaler.fit(X)
X_scaled = scaler.transform()
selector = SelectFromModel(estimator=LassoCV (cv=5)).fit(X_scaled,y)
selector.get_support()
It is problematic to use onehot because each category will be coded as binary and feeding it into lasso doesn't allow selection of the categorical variable as a whole, which is what you are after i guess. You can also check out this post.
You can use the group lasso implementation in python. Below I use an example dataset:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from group_lasso import GroupLasso
from group_lasso.utils import extract_ohe_groups
import scipy.sparse
data = pd.DataFrame({'cat1':np.random.choice(['A','B','C'],100),
'cat2':np.random.choice(['D','E','F'],100),
'bin1':np.random.choice([0,1],100),
'bin2':np.random.choice([0,1],100)})
data['y'] = 1.5*data['bin1'] + -3*data['bin2'] + 2*(data['cat1'] == 'A').astype('int') + np.random.normal(0,1,100)
Define the categorical and numeric (binary) columns. You don't need the min max scaler since your values are binary. Next we onehot encode the categorical columns and extract the groups out:
cat_columns = ['cat1','cat2']
num_columns = ['bin1','bin2']
ohe = OneHotEncoder()
onehot_data = ohe.fit_transform(data[cat_columns])
groups = extract_ohe_groups(ohe)
Put numeric and onehot together, you can also convert them to dense, but can be problematic if data is huge:
X = scipy.sparse.hstack([onehot_data,scipy.sparse.csr_matrix(data[num_columns])])
y = data['y']
Likewise, construct the groups:
groups = np.hstack([groups,len(cat_columns) + np.arange(len(num_columns))+1])
groups
Run the group lasso:
grpLasso = GroupLasso(groups=groups,supress_warning=True,n_iter=1000)
grpLasso.sparsity_mask_
array([ True, True, True, False, False, False, True, True])
grpLasso.chosen_groups_
{0, 3, 4}
Check out also the help page for using it in a pipeline.

Resampling (boostrap) a data set of continious data for regression problem

For a regression problem, I have a training data set with :
- 3 variables with a gaussian distribution
- 20 variables with a uniform distribution.
All my variables are continious, between [0;1].
The problem is the test data, used to score my regression model has an uniform distribution for all the variables.
Actually, I have bad results at tail-end distribution, so I want to oversample my training set, in order to duplicate the rarest rows.
So my idea is to bootstrap (using sampling with replacement) on my training set in order to have a set of data with the same distribution as the test set.
In order to do that, my idea (don't know if it's a good one !) is to add 3 columns with intervals for my 3 variables and use this columns to stratify the resampling.
Example :
First, generating the data
from scipy.stats import truncnorm
def get_truncated_normal(mean=0.5, sd=0.15, min_value=0, max_value=1):
return truncnorm(
(min_value - mean) / sd, (max_value - mean) / sd, loc=mean, scale=sd)
generator = get_truncated_normal()
import numpy as np
from sklearn.preprocessing import MinMaxScaler
S1 = generator.rvs(1000)
S2 = generator.rvs(1000)
S3 = generator.rvs(1000)
u = np.random.uniform(0, 1, 1000)
Then check the distribution :
import seaborn as sns
sns.distplot(u);
sns.distplot(S2);
It's OK, so I'll add categories columns
import pandas as pd
df = pd.DataFrame({'S1':S1,'S2':S2,'S3':S3,'Unif':u})
BINS_NUMBER = 10
df['S1_range'] = pd.cut(df.S1,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S2_range'] = pd.cut(df.S2,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
df['S3_range'] = pd.cut(df.S3,
bins=BINS_NUMBER,
precision=6,
right=True,
include_lowest=True)
a check
df.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
It's good for me.
So now I'll try to resample but it's not working as intended
from sklearn.utils import resample
df_resampled = resample(df,replace=True,n_samples=1000, stratify=df['S1_range'])
df_resampled.groupby('S1_range').size()
S1_range
(0.022025899999999998, 0.116709] 3
(0.116709, 0.210454] 15
(0.210454, 0.304199] 64
(0.304199, 0.397944] 152
(0.397944, 0.491689] 254
(0.491689, 0.585434] 217
(0.585434, 0.679179] 173
(0.679179, 0.772924] 86
(0.772924, 0.866669] 30
(0.866669, 0.960414] 6
dtype: int64
So it's not working, I get the same distribution in output as in input...
Can you help me ?
Perhaps it's not the good way to do this ?
Thanks !!
Rather than writing code from scratch to resample your continuous data, you should take advantage a library for resampling regression data.
Whereas the popular libraries (imbalanced-learn, etc), focus on classification (categorical) variables, there is a recent Python library (called resreg - RESampling for REGression) that allows you to resample your continuous data (resreg GitHub page)
Also, rather than bootstraping, you may want to generate synthetic data points at the tail ends of your normally distributed variables, as doing this will likely lead to much better results (see this paper). Similar to SMOTE for classification, which interpolates between features, you can use SMOTER (SMOTE for regression) in the resreg package to generate synthetic values in regression/continuous data.
Here is an example of how you would use resreg to achieve resampling with a few lines of code:
import numpy as np
import resreg
cl = np.percentile(y,10) # Oversample values less than the 10th percentile
ch = np.percentile(y,90) # Oversample values less than the 10th percentile
# Assign relevance scores to indicate which samples in your dataset are
# to be resampled. Values below cl and above ch are assigned a relevance
# value above 0.5, other values are assigned a relevance value above 0.5
relevance = resreg.sigmoid_relevance(X, y, cl=cl, ch=ch)
# Resample the relevant values (i.e relevance >= 0.5) by interpolating
# between nearest k-neighbors (k=5). By setting over='balance', the
# relevant values are oversampled so that the number of relevant and
# irrelevant values are equal
X_res, y_res = resreg.smoter(X, y, relevance=relevance, relevance_threshold=0.5, k=5, over='balance', random_state=0)
My solution:
def create_sampled_data_set(n_samples_by_bin=1000,
n_bins=10,
replace=True,
save_csv=True):
"""In order to have the same distribution for S1..S3 between training
set and test set, this function will generate a new
training set resampled
Return: (X_train, y_train)
"""
def stratified_sample_df_(df, col, n_samples, replace=True):
if replace:
n = n_samples
else:
n = min(n_samples, df[col].value_counts().min())
df_ = df.groupby(col).apply(lambda x: x.sample(n, replace=replace))
df_.index = df_.index.droplevel(0)
return df_
X_train, y_train = load_data_for_train()
# merge the dataframe for the sampling. Target will be removed after
X_train = pd.merge(
X_train, y_train[['Target']], left_index=True, right_index=True)
del y_train
# build a categorical feature, from S1..S3 distribution
disc = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
disc.fit(X_train[['S1', 'S2', 'S3']])
y_bin = disc.transform(X_train[['S1', 'S2', 'S3']])
del disc
vint = np.vectorize(np.int)
y_bin = vint(y_bin)
y_concat = []
for i in range(len(y_bin)):
a = y_bin[i, 0].astype('str')
b = y_bin[i, 1].astype('str')
c = y_bin[i, 2].astype('str')
y_concat.append(a + ';' + b + ';' + c)
del y_bin
X_train['S_Class'] = y_concat
del y_concat
X_train_resampled = stratified_sample_df_(
X_train, 'S_Class', n_samples_by_bin)
del X_train
y_train_resampled = X_train_resampled[['Target']].copy()
y_train_resampled.rename(
columns={y_train_resampled.columns[0]: 'Target'}, inplace=True)
X_train_resampled = X_train_resampled.drop(['S_Class', 'Target'], axis=1)
# save in file for further usage
if save_csv:
X_train_resampled.to_csv(
"./data/training_input_resampled.csv", sep=",")
y_train_resampled.to_csv(
"./data/training_output_resampled.csv", sep=",")
return(X_train_resampled,
y_train_resampled)

How to give column names after one-hot encoding with sklearn?

Here is my question, I hope someone can help me to figure it out..
To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.
After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]
It should be like that after encoding,
A-1, A-2, A-3
Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;
I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..
You can get the column names using .get_feature_names() attribute.
>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()
Detailed example is here.
Update
from Version 1.0, use get_feature_names_out
This example could help for future readers:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
Sex AgeGroup
0 male 0
1 female 15
2 male 30
3 female 45
4 male 60
5 female 75
encoder=OneHotEncoder(sparse=False)
train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))
train_X_encoded.columns = encoder.get_feature_names(['Sex'])
train_X.drop(['Sex'] ,axis=1, inplace=True)
OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
AgeGroup Sex_female Sex_male
0 0 0.0 1.0
1 15 1.0 0.0
2 30 0.0 1.0
3 45 1.0 0.0
4 60 0.0 1.0
5 75 1.0 0.0`
Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base
I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
self.feature_names = []
def fit(self, X, y=None):
return self
def transform(self, X):
result = pd.get_dummies(X)
self.feature_names = result.columns
return result
A bit basic I know but it does the job I need it to.
If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.
a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names
df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)
There is another easy way with the package category_encoders this method uses a pipeline which also is one of the data science best practices.
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
ohe = OneHotEncoder(use_cat_names=True)
ohe.fit_transform(X)
Update: based on the answer of #Venkatachalam, the method get_feature_names() has been deprecated in scikit-learn 1.0. You will get a warning when trying to run it. Instead, use get_feature_names_out():
import pandas as pd
from category_encoders.one_hot import OneHotEncoder
ohenc = OneHotEncoder(sparse=False)
x_cat_df = pd.DataFrame(ohenc.fit_transform(xtrain_lbl))
x_cat_df.columns = ohenc.get_feature_names_out(input_features=xtrain_lbl.columns)
Setting the parameter sparse=False in OneHotEncoder() will return an array instead of sparse matrix, so you don't need to convert it later. fit_transform() will calculate the parameters and transform the training set in one line.
Source: OneHotEncoder documentation

Categories