Feature Importance with XGBClassifier - python

Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.
However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'
My code snippet is below:
from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_
It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?

As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.
def get_xgb_imp(xgb, feat_names):
from numpy import array
imp_vals = xgb.booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
>>> import numpy as np
>>> from xgboost import XGBClassifier
>>>
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>>
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}

For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.
import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')
from xgboost import plot_importance
plot_importance(xgb_model, )

I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to #David's answer if you want a workaround.
However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.
Hope this helps others!

Get Feature Importance as a sorted data frame
import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.booster().get_fscore()
feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
feats_imp.iloc[:,0]= feats_imp.index
feats_imp.columns=['feature','importance']
feats_imp.sort_values('importance',inplace=True,ascending=False)
feats_imp.reset_index(drop=True,inplace=True)
return feats_imp
feature_importance_df = get_xgb_imp(xgb, feat_names)

For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.
In short, I found modifying David's code from
imp_vals = xgb.booster().get_fscore()
to
imp_vals = xgb.get_fscore()
worked for me.
For more detail I would recommend visiting the link above.
Big thanks to David and ianozsvald

You can also use the built-in plot_importance function:
from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)

The alternative to built-in feature importance can be:
permutation-based importance from scikit-learn (permutation_importance method
importance with Shapley values (shap package)
I really like shap package because it provides additional plots. Example:
Importance Plot
Summary Plot
Dependence Plot
You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.

An update of the accepted answer since it no longer works:
def get_xgb_imp(xgb_model, feat_names):
imp_vals = xgb_model.get_fscore()
imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
total = sum(list(imp_dict.values()))
return {k: round(v/total, 5) for k,v in imp_dict.items()}

It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in #David's answer does the trick. The updated code is -
from numpy import array
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.get_booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}

I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use
v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())
You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further.
f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()

Related

lasso regression for larger datasets using dask

I'm attempting to perform a lasso regression for a larger than main memory dataset by using Dask, but there doesn't seem to be a cleanly documented way to do so.
I did previously find a somewhat related question but no actual answer.
Looking into how scikit sets up the Lasso regression, I thought I could set it up the same way. For example, here is one approach I tried
from dask_ml.datasets import make_regression
import dask_glm.families
import dask_glm.regularizers
import dask_glm.algorithms
import pandas as pd
# dask dataframes
X, y = make_regression(n_samples=1000, chunks=100)
# pandas dataframes
df_X = X.compute()
df_y = y.compute()
family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False, fit_intercept=False)
print(b)
reg = linear_model.Lasso(alpha=0.01, fit_intercept=False)
reg.fit(df_X, df_y)
print(reg.coef_)
However, the coefficients don't match up at all, and the dask version's coefficients seem more unstable than scikit's.
Here's another approach I tried, this time based on a comment from this GH issue
from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
print(lr.coef_)
Again, the coefficients seem very unstable.
Ideally there would already be an implementation of Lasso using Dask for this, but I can't seem to find much on the internet except for running LassoCV with dask as the backend to joblib, which is a little different than I want.

XGBoost feature importance giving the results for 10 features

I am trying to use XGBoost as a feature importance tool. However, out of 84 features, I got only results for only 10 of them and the for the rest of them prints zeros. Do you know how to fix it?
This is my code and the results:
import numpy as np
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
X = data.iloc[:,:-1]
y = data['clusters_pred']
model = XGBClassifier()
model.fit(X, y)
sorted_idx = np.argsort(model.feature_importances_)[::-1]
for index in sorted_idx:
print([X.columns[index], model.feature_importances_[index]])
['XYZ', 0.97976303]
['ABC', 0.008309732]
['ZZZ', 0.007930854]
['CGNK', 0.0011405549]
['TTT', 0.0011277349]
['PLT', 0.0007475067]
['HB', 0.00056899816]
['PBB', 0.00020151233]
['AGE', 0.000108826855]
['SEX', 0.0001012349]
['BLA', 0.0]
['STAT', 0.0]
['tRU', 0.0]
...
Looks like your 'XYZ' feature is turning out to be the most important compared to others and as per the important values - it is suggested to drop the lower important features.
You can try with different feature combination, try some normalization on the existing feature or try with different feature important type used in XGBClassifier e.g. “gain”, “weight”, “cover”, “total_gain” or “total_cover”. The default is 'weight'.

Output of a statsmodels regression

I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.

LinearDiscriminantAnalysis - Single column output from .transform(X)

I have been successfully playing around with replicating one of the sklearn tutorials using the iris dataset in PyCharm using Python 2.7. However, when trying to repeat this with my own data I have been encountering an issue. I have been importing data from a .csv file using 'np.genfromtxt', but for some reason I keep getting a single column output for X_r2 (see below), when I should get a 2 column output. I have therefore replaced my data with some randomly generated variables to post onto SO, and I am still getting the same issue.
I have included the 'problem' code below, and I would be interested to know what I have done wrong. I have extensively used the debugging features in PyCharm to check that the type and shape of my variables are similar to the original sklearn example, but it did not help me with the problem. Any help or suggestions would be appreciated.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(2, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
The array y in the example you posted has values of 0, 1 and 2 while yours only has values of 0 and 1. This change achieves what you want:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(3, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

What is the Python equivalent to R predict function for linear models?

What is the Python equivalent to R predict function for linear models?
I'm sure there is something in scipy that can help here but is there an equivalent function?
https://stat.ethz.ch/R-manual/R-patched/library/stats/html/predict.lm.html
Scipy has plenty of regression tools with predict methods; though IMO, Pandas is the python library that comes closest to replicating R's functionality, complete with predict methods. The following snippets in R and python demonstrate the similarities.
R linear regression:
data(trees)
linmodel <- lm(Volume~., data = trees[1:20,])
linpred <- predict(linmodel, trees[21:31,])
plot(linpred, trees$Volume[21:31])
Same data set in python using pandas ols:
import pandas as pd
from pandas.stats.api import ols
import matplotlib.pyplot as plt
trees = pd.read_csv('trees.csv')
linmodel = ols(y = trees['Volume'][0:20], x = trees[['Girth', 'Height']][0:20])
linpred = linmodel.predict(x = trees[['Girth', 'Height']][20:31])
plt.scatter(linpred,trees['Volume'][20:31])

Categories