High cardinal Categorical features into numerics - python

In most of the Academic examples, we used to convert categorical features using get_dummies or OneHotEncoder. Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies on country we will get 100 columns and model will be trained with 100 country columns plus other features.
Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies, then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns plus other features.
I came across below article, where we can calculate score using Supervised ratio, Weight of evidence. But how to calculate the score when we want to predict the target in production, which country need to be assigned to right number.
https://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html
Can you please help me to understand how to handle such scenarios?

There are two things you can do.
Apply OHE after combining your training set and test/validation set data not before that.
Skip OHE and apply StandardScaler because "If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."
I usually try second option when, I've multiple unique feature in any categorical dataset and can cause my test/validation set
Feel free to correct me.

Related

SHAP for a single data point, instead of average prediction of entire dataset

I am trying to explain a regression model based on LightGBM using SHAP. I'm using the
shap.TreeExplainer(<lightgbm model>).shap_values(X)
method to get the SHAP values, where X is the entire training dataset. These SHAP values give me comparison of an individual prediction, compared to the average prediction of the entire dataset.
In the online book by Christopher Molnar, section 5.9.4, he mentions that:
"Instead of comparing a prediction to the average prediction of the entire dataset, you could compare it to a subset or even to a single data point."
I have a couple of questions regarding this:
Am I correct to interpret that if, instead of passing the entire training dataset, I pass a subset of say 20 observations, then the SHAP values returned will be relative to the average of these 20 observations? This will be the equivalent of "subset" that Christopher Molnar mentioned in his book
Assuming that the answer to question 1 is yes, what if, instead of generating SHAP values relative to the average of 20 observations, I want to generate SHAP values relative to one specific observation. Christopher Molnar seems to imply that is possible. If it is possible, how do I do that?
Thank you in advance for the guidance!
Yes, but definition of "average" is important. If you supply a "background" dataset, your explanations will be calculated against this background, not against the whole dataset. As far as "relative to the average" of the background, one needs to understand shap values are average marginal contributions over all possible coalitions. So as far as SHAP values are concerned, you fix coalition(s), and the rest is yes, averaged. This allows fitting model once, and then passing different coalitions (with the rest averaged) through the model that was trained only once. This is where SHAP time savings come from.
If you're interested in more you may visit original paper or this blog.
Yes. You supply a single data row as background, for a binary classification e.g., supply another class' data row for explanation, and see which feature, and by how much, changed class output.
Yes. By the mathematical formulation in the original paper, SHAP values are "the contribution of a feature to the difference between the actual prediction and the average prediction". The average prediction, sometimes called the "base value" or "expected model output", is relative to the background dataset you provided.
Yes. You can use a background dataset of 1 sample. The common choices of the background dataset is the training data, one single sample as the reference sample, or even a dataset of all zeros. From the author: “I recommend using either a single background data point, a small random subset of the true background, or for the best performance a set of k-medians (weighted by how many training points they each represent) designed to represent the background succinctly. “
Below are more details to support my answers to the two questions and how 2 can be done. So, why does the "expected model output" depend on the background dataset? To answer this questions, let's walk through how SHAP is done:
Step 1: We create a shap explainer providing two things: a trained prediction model and a background dataset. From the background dataset, SHAP creates an artificial dataset of coalitions. Each coalition is a binary vector representing the permutation of feature combinations, 1 represents a feature being present, and 0 absent. So there are 2^M possible coalitions for M features.
explainer = shap.KernelExplainer(f, background_X)
Step 2: We provide the sample(s) for which we want to compute SHAP values for. SHAP fills in values for this artificial dataset such that present features take original values of that sample, and absent features are filled with a value from the background dataset. Then the prediction is generated for this coalition. If the background dataset has n rows, the absent features are filled n times and the average of the n predictions is used as the prediction of this coalition. If the background dataset has a single sample, then the absent feature is filled with the values of that sample.
shap_values = explainer.shap_values(test_X)
Therefore, the SHAP values are relative to the average prediction of the background dataset.

Correlations of feature columns in TensorFlow

I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.

Implementing Scikit Learn's FeatureHasher for High Cardinality Data

Background: I am working on a binary classification of health insurance claims. The data I am working with has approximately 1 million rows and a mix of numeric features and categorical features (all of which are nominal discrete). The issue I am facing is that several of my categorical features have high cardinality with many values that are very uncommon or unique. I have plotted 8 of my categorical features below which had the highest counts of unique factor levels:
Alternative to Dummy Variables: I have been reading up on feature hashing and understand that this method is an alternative that can be used for a fast and space-efficient way of vectorizing features and is particularity suitable for categorical data with high cardinality. I plan to utilize Scikit Learn's FeatureHasher to perform feature hashing on my categorical features with more than 100 unique feature levels (I will create dummy variables for the remaining categorical features with less than 100 unique feature levels). Before I implement this I have a few questions relating to feature hashing and how it relates to model performance in machine learning:
What is the primary advantage of using feature hashing as opposed to dummying only the most frequently occuring factor levels? I assume there is less information loss with the feature hashing approach but need more clarification on what advantages hashing provides in machine learning algorithms when dealing with high cardinality.
I am interested in evaluating feature importance after evaluating a few separate classification models. Is there a way to evaluate hashed features in the context of how they relate to the original categorical levels? Is there a way to reverse hashes or does feature hashing inevitably lead to loss of model interpretability?
Sorry for the long post and questions. Any feedback/recommendations would be much appreciated!
Feature hashing can support new categories during inference that were not seen in training. With dummy encoding, you can only encode a fixed set of previously seen categories. If you encounter a category not seen in training, you're out of luck.
For feature importance, there are two canonical approaches.
a) Train/evaluate your model with and without each feature to see its effect. This can be computationally expensive.
b) Train/evaluate your model with the feature and also with that feature permuted among all samples.
With feature hashing, each feature expands to multiple columns so b) will be tricky and I haven't found any packages that do permutation importance of feature hashed columns.
So, I think a) is probably your best bet, considering you only have 1 million rows.
Also, you'll probably get better answers on Cross Validated for ML questions on stackoverflow.

Why does a Random Forest take longer to fit a dataframe with dummy variables?

I am taking the fastai Intro to Machine Learning course, and in Lesson 1 he uses a Random Forest on the Blue Book for Bulldozers dataset from Kaggle.
In a curious move to me the instructor did not use pd.get_dummies() or OneHotEncoder from SKlearn to handle categorical data. Instead he called pd.Series.cat.codes on all categorical columns.
I noticed when thefit() method was called, it computed much faster (about 1 minute) on the dataset using pd.Series.cat.codes, whereas the dataset with the dummy variables crashed on a virtual server I had running that was using 60 GB of RAM.
The memory each dataframe occupied was about the same........54 MB. I'm curious why one dataframe is so much more performant than the other?
Is it because with a single column of integers a Random Forest only considers the average of that column as its cut point, thus making it easier to compute? Or is it something else?
To understand this better we need to look at the working of Tree based models. In a tree based algo the data is split into bins based on feature and its values. The splitting algorithm considers all possible splits and learns the most optimal split (Minimized impurity of resulting bins).
When we consider continuous numeric feature for a split, then there would be a number of combination on which a tree can split.
Categorical features are disadvantaged and have only a few options for splitting which results in a very sparse decision trees. This becomes worse for category with just two levels.
Also dummy variables are created to avoid the model from learning false ordinality. Since tree based model works on the principle of splitting this is not an issue and there is no need to create dummy variables.
pd.get_dummies will add k (or k-1 if drop_first = True) columns to your DataFrame. In case of a very large K, the RandomForest algorithm as more choice to make when sub-selecting the features thus making each tree training longer to train.
You could use the max_features parameters to limit the number of feature used during each tree training but the scikit-learn implementation of the algorithm doesn't take into account that your dummies variable are actually from one feature, meaning it could select only a subset of dummies from your categorical variable
This could lead to sub-performance of your model. I'm guessing this is why fastai uses
pd.Series.cat.codes.

xgboost feature importance of categorical variable

I am using XGBClassifier to train in python and there are a handful of categorical variables in my training dataset. Originally, I planed to convert each of them into a few dummies before I throw in my data, but then the feature importance will be calculated for each dummy, not the original categorical ones. Since I also need to order all of my original variables (including numerical + categorical) by importance, I am wondering how to get importance of my original variables? Is it simply adding up?
You could probably get by with summing the individual categories' importances into their original, parent category. But, unless these features are high-cardinality, my two cents would be to report them individually. I tend to err on the side of being more explicit with reporting model performance/importance measures.

Categories