xgboost feature importance of categorical variable - python

I am using XGBClassifier to train in python and there are a handful of categorical variables in my training dataset. Originally, I planed to convert each of them into a few dummies before I throw in my data, but then the feature importance will be calculated for each dummy, not the original categorical ones. Since I also need to order all of my original variables (including numerical + categorical) by importance, I am wondering how to get importance of my original variables? Is it simply adding up?

You could probably get by with summing the individual categories' importances into their original, parent category. But, unless these features are high-cardinality, my two cents would be to report them individually. I tend to err on the side of being more explicit with reporting model performance/importance measures.

Related

Correlations of feature columns in TensorFlow

I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.

Unexpectedly getting different standardized data with sklearn StandardScaler

I am getting different standardized values using two scalers built ont he same dataset using scikit-learn's standardScaler class.
I have built a StandarScaler object using Scikit-learn on a training data set with 52 features. Let's call it Scaler1. I then used that scaler to standardize the training data set and learn different models on the standardized data. This led to a best model with selected features (26 out of 52). In order to implement a predictor class that uses the model: (1) I grabbed only columns form the original (non-standardized) data set that correspond to the 26 selected features; then (2) I created and saved (with joblib) a new StandarScaler object by fitting the newly created data set. Let's call it Scaler2. Below is a simple outline of my implementation.
scaler = StandardScaler()
scaler.set_params (**parameters)
scaler.fit(data)
joblib.dump(scaler, destination)
Contrary to my expectation, when trying to standardize the original data set, Scaler2 gives me different values for the same data points, compared to Scaler1, for each of the 26 features.Is that behaviour normal? Doesn't the standardization happen independently for each row? How can I fix this issue?
Best,
Yannick
This issue was fixed. It is important to make sure the order in which the features are processed remains the same, as the standardizer model does not appear to have the names of the features saved.

How to implement feature selection for categorical variables?

I'm having a problem selecting the important feature. The features for the dataset are categorical and numerical. The target variable is False or True. The features for the dataset are about 100, so I need to drop some of the features that are not related to the target variable. Which method can be used other than Random Forest feature importance? I'm using Python. In R I can use Boruta package to select the important features. but I do not know how to do this in Python.
Selecting relevant features can be done by calculating the P-value of the feature relating to the hypothesis, check https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf.

Why does a Random Forest take longer to fit a dataframe with dummy variables?

I am taking the fastai Intro to Machine Learning course, and in Lesson 1 he uses a Random Forest on the Blue Book for Bulldozers dataset from Kaggle.
In a curious move to me the instructor did not use pd.get_dummies() or OneHotEncoder from SKlearn to handle categorical data. Instead he called pd.Series.cat.codes on all categorical columns.
I noticed when thefit() method was called, it computed much faster (about 1 minute) on the dataset using pd.Series.cat.codes, whereas the dataset with the dummy variables crashed on a virtual server I had running that was using 60 GB of RAM.
The memory each dataframe occupied was about the same........54 MB. I'm curious why one dataframe is so much more performant than the other?
Is it because with a single column of integers a Random Forest only considers the average of that column as its cut point, thus making it easier to compute? Or is it something else?
To understand this better we need to look at the working of Tree based models. In a tree based algo the data is split into bins based on feature and its values. The splitting algorithm considers all possible splits and learns the most optimal split (Minimized impurity of resulting bins).
When we consider continuous numeric feature for a split, then there would be a number of combination on which a tree can split.
Categorical features are disadvantaged and have only a few options for splitting which results in a very sparse decision trees. This becomes worse for category with just two levels.
Also dummy variables are created to avoid the model from learning false ordinality. Since tree based model works on the principle of splitting this is not an issue and there is no need to create dummy variables.
pd.get_dummies will add k (or k-1 if drop_first = True) columns to your DataFrame. In case of a very large K, the RandomForest algorithm as more choice to make when sub-selecting the features thus making each tree training longer to train.
You could use the max_features parameters to limit the number of feature used during each tree training but the scikit-learn implementation of the algorithm doesn't take into account that your dummies variable are actually from one feature, meaning it could select only a subset of dummies from your categorical variable
This could lead to sub-performance of your model. I'm guessing this is why fastai uses
pd.Series.cat.codes.

High cardinal Categorical features into numerics

In most of the Academic examples, we used to convert categorical features using get_dummies or OneHotEncoder. Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies on country we will get 100 columns and model will be trained with 100 country columns plus other features.
Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies, then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns plus other features.
I came across below article, where we can calculate score using Supervised ratio, Weight of evidence. But how to calculate the score when we want to predict the target in production, which country need to be assigned to right number.
https://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html
Can you please help me to understand how to handle such scenarios?
There are two things you can do.
Apply OHE after combining your training set and test/validation set data not before that.
Skip OHE and apply StandardScaler because "If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."
I usually try second option when, I've multiple unique feature in any categorical dataset and can cause my test/validation set
Feel free to correct me.

Categories