I need to plot feature_importances for DecisionTreeClassifier. Features are already found and target results are achieved, but my teacher tells me to plot feature_importances to see weights of contributing factors.
I have no idea how to do it.
model = DecisionTreeClassifier(random_state=12345, max_depth=8,class_weight='balanced')
model.fit(features_train,target_train)
model.feature_importances_
It gives me.
array([0.02927077, 0.3551379 , 0.01647181, ..., 0.03705096, 0. ,
0.01626676])
Why it is not attached to anything like max_depth and just an array of some numbers?
Feature importances represent the affect of the factor to the outcome variable. The greater it is, the more it affects the outcome. That's why you received the array.
For plotting, you can do:
import matplotlib.pyplot as plt
feat_importances = pd.DataFrame(model.feature_importances_, index=features_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))
Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.
Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.
Load the feature importances into a pandas series indexed by your dataframe column names, then use its plot method.
From Scikit Learn
Feature importances are provided by the fitted attribute feature_importances_ and they are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.
How are feature_importances in RandomForestClassifier determined?
For your example:
feat_importances = pd.Series(model.feature_importances_, index=df.columns)
feat_importances.nlargest(5).plot(kind='barh')
More ways to plot Feature Importances- Random Forest Feature Importance Chart using Python
Related
I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.
Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!
I am using sklearns permutation_importance to estimate the importance of my independent variables. The model I am fitting is a linear Regression.
The permutation importance the model returns looks like this: [0.7939618 3.6692722 0.02936469].
The permutation importance is defined to be the difference between the baseline metric and metric from permutating the feature column.
In my case, I think that the baseline metric is the R^2 value if I do not permutate any variables (baseline R^2~0.86). How is it possible that I obtain a value of 3.66 for one of my features in this case? If I manually permutate this feature and recalculate R^2, I get a value of ~ 0.18, so the feature importance would be ~0.68 if I am not mistaken.
If anyone could explain to my why I am observing these high feature importance values, I'd be very grateful!
I'm running a feature selection using sns.heatmap and one using sklearn feature_importances.
When using the same data I get two difference values.
Here is the heatmap
and heatmap code
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")
df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
"Win", "Money_Line_Percentage", 'Cover'], axis=1)
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
df_model[top_corr_features].corr(), annot=True, cmap='hot')
plt.xticks(rotation=90)
plt.yticks(rotation=45)
plt.show()
Here is the feature_importances bar graph
and the code
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance
training_data = pd.read_csv(
"/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)
df_model = training_data.copy()
df_model = df_model.dropna()
X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
'Under_Percentage', 'Under_Money']] # independent columns
y = df_model['Over_Under'] # target column
model = RandomForestClassifier(
random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)
skf = StratifiedKFold(n_splits=2)
skf.get_n_splits(X, y)
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')
print(perm_importance)
plt.show()
I'm not sure which one is more accurate or if I'm using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
You are comparing two different things, why would you expect them to be the same? And what would it even mean in this case?
Feature importances in tree based models are computed based on how many times given feature was used for splitting. Feature that is used more often for a split is more important (for a particular model fitted with particular dataset) than a feature that is used less often.
Correlation on the other hand is a measure of linear relationship between 2 features.
I'm not sure which one is more accurate
What do you mean by accuracy? Both of these are accurate in what they are measuring. It is just that none of these directly tells you which feature/s to throw away.
Note that just because 2 features are correlated, it doesn't mean that you can automatically throw one of them away. Collinearity can cause issues with interpretability of the model. If you have highly correlated features, then you can't say which one is more important based on the weights associated with these features. Collinearity should not affect the prediction power of the model. More often, you will find that by throwing away one of the correlated features, your model's prediction power decreases.
Collinearity in a dataset can therefore make feature importances of your random forrest model less interpretable in a sense that you can't rely on their strict ordering. But again, it should not affect the predictive power of the model (except that the model is more prone to overfitting due to having more degrees of freedom).
Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?
Feature engineering/selection is more of an art than science (outside of end-to-end deep learning). There is no correct answer here and you will need to develop your own heuristics and try different things to see which one works better in which scenario.
Example of a simple heuristic based on feature importances and correlation can be (assuming that you have large number of features):
fit the random forrest model and measure the feature importances
throw away those that seem to have no impact on the model (close to 0 importance)
refit the model with the new subset of your original data and see whether the metric of your interest (accuracy, MSE, ...) stays approximately the same as in the step 1.
if you still have a lot of features, you can repeat the step 1-3, increasing the throw-away threshold until your metric of interest starts worsening
measure the correlation of the features that you are left with and select the most correlated pairs (based on some threshold, e.g. (|c| > 0.8))
pick one pair; drop a feature from this pair; measure model performance; return the dropped feature; repeat for each each pair
drop the feature that seems to have the least negative effect on the model's performance based on the results from step 6.
repeat steps 6-7 until the model's performance starts dropping
I used lightgbm for feature importance. However, the output is a plot scores by some metric. My questions are:
What are is the metric in the x-axis? Is that an F-score or something else?
How can I get an output of the features where it shows me how much each feature makes up for the variance the model (similar to PCA)?
How do I extract the Metric for all the feature of importance in a dataframe format?
This is my code:
import lightgbm as lgb
import matplotlib.pyplot as plt
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'num_leaves': 30,
'num_round': 360,
'max_depth':8,
'learning_rate': 0.01,
'feature_fraction': 0.5,
'bagging_fraction': 0.8,
'bagging_freq': 12
}
lgb_train = lgb.Dataset(X, y)
model = lgb.train(lgb_params, lgb_train)
plt.figure(figsize=(12,6))
lgb.plot_importance(model, max_num_features=30)
plt.title("Feature importances")
plt.show()
1) the metric on x axis, in your case, is the feature importance obtained with "split" type (by default). as you can see in lgm doc: the importance can be calculated using "split" or "gain" method. If "split", result contains numbers of times the feature is used in a model. If "gain", result contains total gains of splits that use the feature.
The first measure is split-based, it doesn’t take the number of samples into account.
The second measure is gain-based. It’s basically the same as the method in scikit-learn with Gini impurity replaced by the objective used by the gradient boosting model
These measures are purely calculated using training data, so there’s a chance that a split creates no improvement on the objective in test-set
2) the most similar measure to explained_variance_ratio_ of sklearn pca (not in the meaning but in the way it can be used) is exactly feature_importances in tree-based method. if you are more confident you can scale this number in % as done by sklearn random forest, where the feature importances sum up to 1. you can do model.feature_importances_/model.feature_importances_.sum(). Otherwise, there are other similar methods like permutation importance
3) to store in df all the importances you can do: pd.DataFrame({'name':model.feature_name_,'importance':model.feature_importances_})
Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))