I am using hash encoding on a categorical column with 13 different value counts, and ideally speaking one-hot and dummy will give us 12 and 13 columns respectively after encoding. But when it comes to hash encoding, the default value of n_features is 2**20, which eventually creates 1000000+ columns.
How does one choose the value of n_features? I see that we need to consider a value to the nearest power of 2. Say if we consider the IRIS dataset we can end up using 2 or maybe even 4 for the n_features.
But what about a dataset where we have 13 different values in a column, what would be the n_features then?
# Feature Hashing Code:
import pandas as pd, numpy as np
from sklearn.feature_extraction import FeatureHasher
df = pd.read_csv(r'C:/Users/<user_name>/Downloads/datasets/countriesoftheworld.csv')
hash_encoder = FeatureHasher(n_features = ????, alternate_sign=False, input_type='string')
features = hash_encoder.fit_transform(df['Country'])
print(df.shape)
(227, 20)
I have not dropped the "Country" column, just to contrast between the original and encoded columns.
df = iris.join(pd.DataFrame(features.toarray()).add_prefix('encoded_'))
print(df.shape)
(227, 1048596)
Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)
I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000
Name 0.000000
Age 0.018653
Gender 0.000640
Race 0.317429
Date 0.000000
City 0.000320
State 0.000000
Manner_of_death 0.000000
Armed 0.454487
Mental_illness 0.000000
Flee 0.000000
dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Label encode all categorical columns:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed',
'Mental_illness', 'Flee'],
dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
Then, I located those nan values in the original dataframe (pf), to substitute those encoded nan's in lpfdf:
for col in lpfdf:
print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4
State 0
Manner_of_death 0
Armed 5677
Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0,
0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
Create the instance of iterative imputer and then fit and transform lpfdf:
itimp = IterativeImputer()
iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps?
Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.
A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds
Your approach of encoding categorical values first and then imputing missing values is prone to problems and thus, not recommended.
Some imputing strategies, like IterativeImputer, will not guarantee that the output contains only previously known numeric values . This can result in imputed values which are unknown to the encoder and will cause an error upon the inverse transformation (which is exactly your case).
It is better to first impute the missing values for both, numeric and categorical features, and then encode the categorical features. One option would be to use SimpleImputer and replacing missing values with the most frequent category or a new constant value.
Also, a note on LabelEncoder: it is clearly mentioned in its documentation that:
This transformer should be used to encode target values, i.e. y, and not the input X.
If you insist on an encoding strategy like LabelEncoder, you can use OrdinalEncoder which does the same but is actually meant for feature encoding. However, you should be aware that such an encoding strategy might falsely suggest an ordinal relationship between each category of a feature, which might lead to undesired consequences. You should therefore consider other encoding strategies as well.
The entire process can be automated with the datawig package.You just need to create an imputation model for each to-be-imputed column and it will handle the encoding and inverse encoding by itself.
It was even tested against kNN and iterative imputer and showed better results.
Here is a personal guide.
I have a dataframe that looks like this (it is obviously much bigger):
id points isAvailable frequency Score
abc1 325 True 93.0 0.01
def2 467 False 80.1 0.59
ghi3 122 True 90.3 1
jkl4 546 True 84.0 0
mno5 355 False 93.5 0.99
I want to see how much the features points, isAvailable and frequency influence the Score. I want to use Random Forests:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
X = df
y = df['Score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
I get the following error: ValueError: could not convert string to float: 'abc1'
Questions:
How can I pre-process the data? What happens to the boolean variables?
Is it wrong to even include the id column in X?
I was thinking of using something like df = df.astype({"a": int, "b": complex}) but I don't really know how in this case and I read that there are special algorithms for encoding.
First, you have to remove the score column from the X dataset: it is the label of you data, so it should not be used as a feature.
Second, assuming that the id column is an identifier for you data, you should remove it from X. It is like if you were trying to analyze a dataset of weight of a group of persons: you would remove their names because there is no correlation between their names and their weight.
Last, to deal with the boolean variables, there are some encoding methods, like you said (for example this one), but since the value can be only 0 or 1, it should be fine if you convert False = 0, True = 1
You can do it with this code (assuming df is the name of you DataFrame):
df['isAvailable'] = (df['isAvailable'] == True).astype(int)
New to python and sklearn so apologies in advance. I have two transformers and I would like to gather the results in a `FeatureUnion (for a final modelling step at the end). This should be quite simple but FeatureUnion is stacking the outputs rather than providing an nx2 array or DataFrame. In the example below I will generate some data that is 10 rows by 2 columns. This will then generate two features that are 10 rows by 1 column. I would like the final feature union to have 10 rows and 1 column but what I get are 20 rows by 1 column.
I will try to demonstrate with my example below:
some imports
import numpy as np
import pandas as pd
from sklearn import pipeline
from sklearn.base import TransformerMixin
some random data
df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
a custom transformer that selects a column
class Trans(TransformerMixin):
def __init__(self, col_name):
self.col_name = col_name
def fit(self, X):
return self
def transform(self, X):
return X[self.col_name]
a pipeline that uses the transformer twice (in my real case I have two different transformers but this reproduces the problem)
pipe = pipeline.FeatureUnion([
('select_a', Trans('a')),
('select_b', Trans('b'))
])
now i use the pipeline but it returns an array of twice the length
pipe.fit_transform(df).shape
(20,)
however I would like an array with dimensions (10, 2).
Quick fix?
The transformers in the FeatureUnion need to return 2-dimensional matrices, however in your code by selecting a column, you are returning a 1-dimensional vector. You could fix this by selecting the column with X[[self.col_name]].