I am trying to use XGBoost as a feature importance tool. However, out of 84 features, I got only results for only 10 of them and the for the rest of them prints zeros. Do you know how to fix it?
This is my code and the results:
import numpy as np
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
X = data.iloc[:,:-1]
y = data['clusters_pred']
model = XGBClassifier()
model.fit(X, y)
sorted_idx = np.argsort(model.feature_importances_)[::-1]
for index in sorted_idx:
print([X.columns[index], model.feature_importances_[index]])
['XYZ', 0.97976303]
['ABC', 0.008309732]
['ZZZ', 0.007930854]
['CGNK', 0.0011405549]
['TTT', 0.0011277349]
['PLT', 0.0007475067]
['HB', 0.00056899816]
['PBB', 0.00020151233]
['AGE', 0.000108826855]
['SEX', 0.0001012349]
['BLA', 0.0]
['STAT', 0.0]
['tRU', 0.0]
...
Looks like your 'XYZ' feature is turning out to be the most important compared to others and as per the important values - it is suggested to drop the lower important features.
You can try with different feature combination, try some normalization on the existing feature or try with different feature important type used in XGBClassifier e.g. “gain”, “weight”, “cover”, “total_gain” or “total_cover”. The default is 'weight'.
Related
I'm attempting to perform a lasso regression for a larger than main memory dataset by using Dask, but there doesn't seem to be a cleanly documented way to do so.
I did previously find a somewhat related question but no actual answer.
Looking into how scikit sets up the Lasso regression, I thought I could set it up the same way. For example, here is one approach I tried
from dask_ml.datasets import make_regression
import dask_glm.families
import dask_glm.regularizers
import dask_glm.algorithms
import pandas as pd
# dask dataframes
X, y = make_regression(n_samples=1000, chunks=100)
# pandas dataframes
df_X = X.compute()
df_y = y.compute()
family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False, fit_intercept=False)
print(b)
reg = linear_model.Lasso(alpha=0.01, fit_intercept=False)
reg.fit(df_X, df_y)
print(reg.coef_)
However, the coefficients don't match up at all, and the dask version's coefficients seem more unstable than scikit's.
Here's another approach I tried, this time based on a comment from this GH issue
from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
print(lr.coef_)
Again, the coefficients seem very unstable.
Ideally there would already be an implementation of Lasso using Dask for this, but I can't seem to find much on the internet except for running LassoCV with dask as the backend to joblib, which is a little different than I want.
Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)
I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.
Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.
However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'
My code snippet is below:
from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_
It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?
As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.
def get_xgb_imp(xgb, feat_names):
from numpy import array
imp_vals = xgb.booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
>>> import numpy as np
>>> from xgboost import XGBClassifier
>>>
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>>
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}
For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.
import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')
from xgboost import plot_importance
plot_importance(xgb_model, )
I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to #David's answer if you want a workaround.
However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.
Hope this helps others!
Get Feature Importance as a sorted data frame
import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.booster().get_fscore()
feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
feats_imp.iloc[:,0]= feats_imp.index
feats_imp.columns=['feature','importance']
feats_imp.sort_values('importance',inplace=True,ascending=False)
feats_imp.reset_index(drop=True,inplace=True)
return feats_imp
feature_importance_df = get_xgb_imp(xgb, feat_names)
For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.
In short, I found modifying David's code from
imp_vals = xgb.booster().get_fscore()
to
imp_vals = xgb.get_fscore()
worked for me.
For more detail I would recommend visiting the link above.
Big thanks to David and ianozsvald
You can also use the built-in plot_importance function:
from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)
The alternative to built-in feature importance can be:
permutation-based importance from scikit-learn (permutation_importance method
importance with Shapley values (shap package)
I really like shap package because it provides additional plots. Example:
Importance Plot
Summary Plot
Dependence Plot
You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.
An update of the accepted answer since it no longer works:
def get_xgb_imp(xgb_model, feat_names):
imp_vals = xgb_model.get_fscore()
imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
total = sum(list(imp_dict.values()))
return {k: round(v/total, 5) for k,v in imp_dict.items()}
It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in #David's answer does the trick. The updated code is -
from numpy import array
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.get_booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use
v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())
You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further.
f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()
I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64