I am working on a binary classification using random forest model, neural networks in which am using SHAP to explain the model predictions. I followed the tutorial and wrote the below code to get the waterfall plot shown below
row_to_show = 20
data_for_prediction = ord_test_t.iloc[row_to_show] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
rf_boruta.predict_proba(data_for_prediction_array)
explainer = shap.TreeExplainer(rf_boruta)
# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
shap.plots._waterfall.waterfall_legacy(explainer.expected_value[0], shap_values[0],ord_test_t.iloc[row_to_show])
This generated the plot as shown below
However, I want to export this to dataframe and how can I do it?
I expect my output to be like as shown below. I want to export this for the full dataframe. Can you help me please?
Let's do a small experiment:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
What is explainer here? If you do dir(explainer) you'll find out it has some methods and attributes among which is:
explainer.expected_value
which is of interest to you because this is base on which SHAP values add up.
Furthermore:
sv = explainer.shap_values(X)
len(sv)
will give a hint sv is a list consisting of 2 objects which are most probably SHAP values for 1 and 0, which must be symmetric (because what moves towards 1 moves exactly by the same amount, but with opposite sign, towards 0).
Hence:
sv1 = sv[1]
Now you have everything to pack it to the desired format:
df = pd.DataFrame(sv1, columns=X.columns)
df.insert(0, 'bv', explainer.expected_value[1])
Q: How do I know?
A: Read docs and source code.
If I recall correctly, you can do something like this with pandas
import pandas as pd
shap_values = explainer.shap_values(data_for_prediction)
shap_values_df = pd.DataFrame(shap_values)
to get the feature names, you should do something like this (if data_for_prediction is a dataframe):
feature_names = data_for_prediction.columns.tolist()
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
I'm a currenty using that :
def getShapReport(classifier,X_test):
shap_values = shap.TreeExplainer(classifier).shap_values(X_test)
shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values[1], X_test)
return pd.DataFrame(shap_values[1])
It first displays the shap values for the model, and for each prediction after that, and finally it returns the dataframe for the positive class(i'm on an imbalance context)
It is for a Tree explainer and not a waterfall, but it is basically the same.
Related
I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])
Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)
So I was running a Catboost model using Python, which was pretty simple, basically:
from catboost import CatBoostClassifier, Pool, cv
catboost_model = CatBoostClassifier(
cat_features=["categorical_variable_1", "categorical_variable_2"],
loss_function="Logloss",
eval_metric="AUC",
iterations=200,
)
So I wanted to get the feature importance. With XGBoost Classifier, I could prepare a dataframe with the feature importance doing something like:
importances = xgb_model.get_fscore()
feat_list = []
date = datetime.today()
for feature, importance in importances.items():
dummy_list.append([date, feature, importance])
feat_df = pd.DataFrame(feat_list, columns=['date', 'feature', 'importance'])
Now, I wanted to do the same thing with CatBoost features. I started by doing:
catboost_model.get_feature_importance(
Pool(X_train, y_train, cat_features=["categorical_variable_1", "categorical_variable_2"]))
But I don't know how to move on from this (which should be very simple, but I'm lost). Can anyone give me a hand?
In short, you can do something like
pd.DataFrame({'feature_importance': model.get_feature_importance(train_pool),
'feature_names': x_val.columns}).sort_values(by=['feature_importance'],
ascending=False)
you can also make a function like (I found the explanation on Analyseup.com here
def plot_feature_importance(importance,names,model_type):
#Create arrays from feature importance and feature names
feature_importance = np.array(importance)
feature_names = np.array(names)
#Create a DataFrame using a Dictionary
data={'feature_names':feature_names,'feature_importance':feature_importance}
fi_df = pd.DataFrame(data)
#Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True))
#Define size of bar plot
plt.figure(figsize=(10,8))
#Plot Searborn bar chart
sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
#Add chart labels
plt.title(model_type + 'FEATURE IMPORTANCE')
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
and plot the feature importance from different boosting algorithm
#plot the xgboost result
plot_feature_importance(xgb_model.feature_importances_,train.columns,'XG BOOST')
#plot the catboost result
plot_feature_importance(cb_model.get_feature_importance(),train.columns,'CATBOOST')
Now you already have a Dataframe:
data = pd.DataFrame({'feature_importance': model.get_feature_importance(train_pool),
'feature_names': x_val.columns}).sort_values(by=['feature_importance'],
ascending=False)
I found it easier to plot using the inbuilt pandas tool, for instance for the top 20 features:
data[:20].sort_values(by=['feature_importance'], ascending=True).plot.barh(x='feature_names', y='feature_importance')
I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.
I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64