How to plot specific features on SHAP summary plots? - python

I am currently trying to plot a set of specific features on a SHAP summary plot. However, I am struggling to find the code necessary to do so.
When looking at the source code on Github, the summary_plot function does seem to have a 'features' attribute. However, this does not seem to be the solution to my problem.
Could anybody help me plot a specific set of features, or is this not a viable option in the current code of SHAP.

A possible, albeit hacky, solution could be as follows, for example plotting a summary plot for a single feature in the 5th column
shap.summary_plot(shap_values[:,5:6], X.iloc[:, 5:6])

I reconstruct the shap_value to include the feature you want into the plot using below code.
shap_values = explainer.shap_values(samples)[1]
vals = np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(
list(zip(samples.columns, vals)),
columns=["col_name", "feature_importance_vals"],
)
feature_importance.sort_values(
by=["feature_importance_vals"], ascending=False, inplace=True
)
feature_importance['rank'] = feature_importance['feature_importance_vals'].rank(method='max',ascending=False)
missing_features = [
i
for i in columns_to_show
if i not in feature_importance["col_name"][:20].tolist()
]
missing_index = []
for i in missing_features:
missing_index.append(samples.columns.tolist().index(i))
missing_features_new = []
rename_col = {}
for i in missing_features:
rank = int(feature_importance[feature_importance['col_name']==i]['rank'].values)
missing_features_new.append('rank:'+str(rank)+' - '+i)
rename_col[i] = 'rank:'+str(rank)+' - '+i
column_names = feature_importance["col_name"][:20].values.tolist() + missing_features_new
feature_index = feature_importance.index[:20].tolist() + missing_index
shap.summary_plot(
shap_values[:, feature_index].reshape(
samples.shape[0], len(feature_index)
),
samples.rename(columns=rename_col)[column_names],
max_display=len(feature_index),
)

To plot only 1 feature, get the index of your feature you want to check in list of features
i = X.iloc[:,:].index.tolist().index('your_feature_name_here')
shap.summary_plot(shap_values[1][:,i:i+1], X.iloc[:, i:i+1])
To plot your selected features,
your_feature_list = ['your_feature_1','your_feature_2','your_feature_3']
your_feature_indices = [X.iloc[:,:].index.tolist().index(x) for x in your_feature_list]
shap.summary_plot(shap_values[1][:,your_feature_indices], X.iloc[:, your_feature_indices])
feel free to change "your_feature_indices" to a shorter variable name
change shap_values[1] to shap_values if you are not doing binary classification

Related

Why Multioutput XGBoosting feature importance gives different results using importnace_plot or estimators_[0].feature_importances_?

I have a multioutput XGboosting model and trying to plot important features for each output. There are 23 outputs.
I have tried to do this from two ways:
important features as a dataframe:
# Get features for the first output as numpy array. Can change number [0,22]
features = multioutputregressor.estimators_[0].feature_importances_
# Convert features to dataframe and corresponding feature names
wo_interaction_terms = pd.DataFrame(features, index=list(X_train.columns()),\
columns=['importance']).sort_values('importance', ascending=False)
important features as a bar plot in a for loop to get for all 23 outputs :
f = 0
fig, ax = plt.subplots(5,5,figsize=(12, 18))
for i in range(5):
for j in range(5):
plot_importance(multioutputregressor.estimators_[f], height=0.2, ax=ax[i, j], title=output_cols[f])
f += 1
fig.tight_layout()
The output for the first approach gives the following result for output 0:
the plot from the second approach generate a different set of important features and the values are also different from what you see in the first image.
f22 is not "Lead" or f0 is not "Gaseous CO2" and so on.
Questions:
1- Plot_importance uses F score but what .estimators_[0].feature_importances_ uses as the criteria? the numbers are obviously different?
2- How add feature names to the plots? I saw other posts like here
but it dsnt work for multioutput XGBoosting. what the options are for this case?

How do I color clusters after k-means and TSNE in either seaborn or matplotlib?

I have a dataframe that look something like this:
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[-true_k:, 0]
df["comp-2"] = transformed_centroids[-true_k:, 1]
The 'y' are the k-means labels I want to color by, and "comp-1" and "comp-2" are the results from the TSNE model. When I try to plot like this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df['y'])
plt.show()
It gives me this error:
ValueError: Length of values (2) does not match length of index (35104) (from this line: df["comp-1"] = transformed_centroids[-true_k:, 0])
This happens even if I do this:
sns.scatterplot(transformed_centroids[:-true_k, 0], transformed_centroids[:-true_k, 1], marker='x')
sns.scatterplot(df['comp-1'], df['comp-2'], marker='o', hue=df.y.astype('category').cat.codes)
plt.show()
I've tried several other pieces of code scattered around random tutorials and here, but I haven't found a solution. I feel silly having successfully completed the clustering but failing on the colors.
EDIT: I realized I was using the wrong plot-points. The updates code and error is below:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:, 0]
df["comp-2"] = transformed_centroids[:, 1]
ValueError: Length of values (35106) does not match length of index (35104)
I'm not sure where the two dropped data-points are being... dropped.
EDIT2: Here is the TSNE code:
centroids = model.cluster_centers_
tweets_df2['labels'] = model.labels_
everything = np.concatenate((X.todense(), centroids))
tsne_init = 'pca' # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model2 = TSNE(n_components=2, random_state=0, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model2.fit_transform(everything)
df = pd.DataFrame()
I took this code from another stacked overflow post and fit it to my data so I can't explain it 100%, I just know I needed to use TSNE to get my data-points to become 2D plottable since the data was words vectorized using TD-IDF
With help from #tdy, I realized one of the solutions tried a little while ago was the solution I needed. My main problem was my edit 2, I wasn't graphing the right set of data. I changed the df to this:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:-2, 0]
df["comp-2"] = transformed_centroids[:-2, 1]
of course, this is the same as this for my 2-cluster code:
df["y"] = model.labels_
df["comp-1"] = transformed_centroids[:true_k, 0]
df["comp-2"] = transformed_centroids[:true_k, 1]
where true_k is the variable representing how many k-means clusters I have. I had this solution but changed it because I thought getting rid of the true_k would solve my 2-variable problem and I never reverted it. I just needed to do this with the right transformed_centroids[] slice and everything should run smoothly in 7 minutes when it's done melting my CPU... :)

how to replicate plot: density bar plot in Python

I'm working on a project and would like to plot by data in a similar way as this example from a book:
So I would like to create a density histogram for my categorical features (left image) and than add a separate column for each value of another feature (middle and right image).
In my case the feature I want to plot is called [district_code] and I would like to create columns based on a feature called [status_group]
What I've tried so far:
sns.kdeplot(data = raw, x = "district_code"): problem, it is a line plot, not a histogram
sns.kdeplot(data = raw, x = "district_code", col = "status_group"): problem, you can't use the col argument for this plottype
sns.displot(raw, x="district_code", col = 'status_group'): problem, col argument works, but it creates a countplot, not a density plot
I would really appreciate some suggestions about the correct code I could use.
This is just an example for one of my categorical features, but I have many more I would like to plot. Any suggestions on how to turn this into a function where I could run the code for a list of categorical features would be highly appreciated.
UPDATE:
sns.displot(raw, x="source_class", stat = 'density', col = 'status_group', color = 'black'): works but looks a bit akward for some features.
How could I improve this?
Good:
Not so good:

ggplot summarise mean value of categorical variable on y axis

I am trying to replicate a Python plot in R that I found in this Kaggle notebook: Titanic Data Science Solutions
This is the Python code to generate the plot, the dataset used can be found here:
import seaborn as sns
...
grid = sns.FacetGrid(train_df, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
Here is the resulting plot.
The survival column takes values of 0 and 1 (survive or not survive) and the y-axis is displaying the mean per pclass. When searching for a way to calculate the mean using ggplot2, I usually find the stat_summary() function. The best I could do was this:
library(dplyr)
library(ggplot2)
...
train_df %>%
ggplot(aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.y = mean, geom = "line") +
facet_grid(Embarked ~ .)
The output can be found here.
There are some issues:
There seems to be an empty facet, maybe from NA's in Embarked?
The points don't align with the line
The lines are different than those in the Python plot
I think I also haven't fully grasped the layering concept of ggplot. I would like to separate the geom = "line" in the stat_summary() function and rather add it as a + geom_line().
There is actually an empty level (i.e. "") in train_df$Embarked. You can filter that out before plotting.
train_df <- read.csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
train_df <- subset(train_df, Embarked != "")
ggplot(train_df, aes(x = factor(Pclass), y = Survived, group = Sex, colour = Sex)) +
stat_summary(fun.data = 'mean_cl_boot') +
geom_line(stat = 'summary', fun.y = mean) +
facet_grid(Embarked ~ .)
You can replicate the python plot by drawing confidence intervals using stat_summary. Although your lines with stat_summary were great, I've rewritten it as a geom_line call, as you asked.
Note that your ggplot code doesn't draw any points, so I can't answer that part, but probably you were drawing the raw values which are just many 0s and 1s.

a nice pystan trace plot for a stan vector parameter

I am doing a multiple regression in Stan.
I want a trace plot of the beta vector parameter for the regressors/design matrix.
When I do the following:
fit = model.sampling(data=data, iter=2000, chains=4)
fig = fit.plot('beta')
I get a pretty horrid image:
I was after something a little more user friendly. I have managed to hack the following which is closer to what I am after.
My hack plugs into the back of pystan as follows.
r = fit.extract() # r for results
from pystan.external.pymc import plots
param = 'beta'
beta = r[param]
name = df.columns.values.tolist()
(rows, cols) = beta.shape
assert(len(df.columns) == cols)
values = {param+'['+str(k+1)+'] '+name[k]:
beta[:,k] for k in range(cols)}
fig = plots.traceplot(values, values.keys())
for a in fig.axes:
# shorten the y-labels
l = a.get_ylabel()
if l == 'frequency':
a.set_ylabel('freq')
if l=='sample value':
a.set_ylabel('val')
fig.set_size_inches(8, 12)
fig.tight_layout(pad=1)
fig.savefig(g_dir+param+'-trace.png', dpi=125)
plt.close()
My question - surely I have missed something - but is there an easier way to get the kind of output I am after from pystan for a vector parameter?
Discovered that the ArviZ module does this pretty well.
ArviZ can be found here: https://arviz-devs.github.io/arviz/
I also struggled with this and just found a way to extract the parameters for the traceplot (the betas, I already knew).
When you do your fit, you can save it to a dataframe:
fit_df = fit.to_dataframe()
Now you have a new variable, your dataframe. Yes, it took me a while to find that pystan had a straightforward way to save the fit to a dataframe.
With that at hand you can check your dataframe. You can see it's header by printing the keys:
fit_df.keys()
the output is something like this:
Index([u'chain', u'chain_idx', u'warmup', u'accept_stat__', u'energy__',
u'n_leapfrog__', u'stepsize__', u'treedepth__', u'divergent__',
u'beta[1,1]', ...
u'eta05[892]', u'eta05[893]', u'eta05[894]', u'eta05[895]',
u'eta05[896]', u'eta05[897]', u'eta05[898]', u'eta05[899]',
u'eta05[900]', u'lp__'],
dtype='object', length=9037)
Now, you have everything you need! The betas are in columns as well as the chain ids. That's all you need to plot the betas and traceplot. Therefore, you can manipulate it in anyway you want and customize your figures as you wish. I'll show you an example of how I did it:
chain_idx = fit_df['chain_idx']
beta11 = fit_df['beta[1,1]']
beta12 = fit_df['beta[1,2]']
plt.subplots(figsize=(15,3))
plt.subplot(1,4,1)
sns.kdeplot(beta11)
plt.subplot(1,4,2)
plt.plot(chain_idx, beta11)
plt.subplot(1,4,3)
sns.kdeplot(beta12)
plt.subplot(1,4,4)
plt.plot(chain_idx, beta12)
plt.tight_layout()
plt.show()
The image from the above plot!
I hope it helps (if you still need it) ;)

Categories