I am doing a multiple regression in Stan.
I want a trace plot of the beta vector parameter for the regressors/design matrix.
When I do the following:
fit = model.sampling(data=data, iter=2000, chains=4)
fig = fit.plot('beta')
I get a pretty horrid image:
I was after something a little more user friendly. I have managed to hack the following which is closer to what I am after.
My hack plugs into the back of pystan as follows.
r = fit.extract() # r for results
from pystan.external.pymc import plots
param = 'beta'
beta = r[param]
name = df.columns.values.tolist()
(rows, cols) = beta.shape
assert(len(df.columns) == cols)
values = {param+'['+str(k+1)+'] '+name[k]:
beta[:,k] for k in range(cols)}
fig = plots.traceplot(values, values.keys())
for a in fig.axes:
# shorten the y-labels
l = a.get_ylabel()
if l == 'frequency':
a.set_ylabel('freq')
if l=='sample value':
a.set_ylabel('val')
fig.set_size_inches(8, 12)
fig.tight_layout(pad=1)
fig.savefig(g_dir+param+'-trace.png', dpi=125)
plt.close()
My question - surely I have missed something - but is there an easier way to get the kind of output I am after from pystan for a vector parameter?
Discovered that the ArviZ module does this pretty well.
ArviZ can be found here: https://arviz-devs.github.io/arviz/
I also struggled with this and just found a way to extract the parameters for the traceplot (the betas, I already knew).
When you do your fit, you can save it to a dataframe:
fit_df = fit.to_dataframe()
Now you have a new variable, your dataframe. Yes, it took me a while to find that pystan had a straightforward way to save the fit to a dataframe.
With that at hand you can check your dataframe. You can see it's header by printing the keys:
fit_df.keys()
the output is something like this:
Index([u'chain', u'chain_idx', u'warmup', u'accept_stat__', u'energy__',
u'n_leapfrog__', u'stepsize__', u'treedepth__', u'divergent__',
u'beta[1,1]', ...
u'eta05[892]', u'eta05[893]', u'eta05[894]', u'eta05[895]',
u'eta05[896]', u'eta05[897]', u'eta05[898]', u'eta05[899]',
u'eta05[900]', u'lp__'],
dtype='object', length=9037)
Now, you have everything you need! The betas are in columns as well as the chain ids. That's all you need to plot the betas and traceplot. Therefore, you can manipulate it in anyway you want and customize your figures as you wish. I'll show you an example of how I did it:
chain_idx = fit_df['chain_idx']
beta11 = fit_df['beta[1,1]']
beta12 = fit_df['beta[1,2]']
plt.subplots(figsize=(15,3))
plt.subplot(1,4,1)
sns.kdeplot(beta11)
plt.subplot(1,4,2)
plt.plot(chain_idx, beta11)
plt.subplot(1,4,3)
sns.kdeplot(beta12)
plt.subplot(1,4,4)
plt.plot(chain_idx, beta12)
plt.tight_layout()
plt.show()
The image from the above plot!
I hope it helps (if you still need it) ;)
Related
I'm working on a project and would like to plot by data in a similar way as this example from a book:
So I would like to create a density histogram for my categorical features (left image) and than add a separate column for each value of another feature (middle and right image).
In my case the feature I want to plot is called [district_code] and I would like to create columns based on a feature called [status_group]
What I've tried so far:
sns.kdeplot(data = raw, x = "district_code"): problem, it is a line plot, not a histogram
sns.kdeplot(data = raw, x = "district_code", col = "status_group"): problem, you can't use the col argument for this plottype
sns.displot(raw, x="district_code", col = 'status_group'): problem, col argument works, but it creates a countplot, not a density plot
I would really appreciate some suggestions about the correct code I could use.
This is just an example for one of my categorical features, but I have many more I would like to plot. Any suggestions on how to turn this into a function where I could run the code for a list of categorical features would be highly appreciated.
UPDATE:
sns.displot(raw, x="source_class", stat = 'density', col = 'status_group', color = 'black'): works but looks a bit akward for some features.
How could I improve this?
Good:
Not so good:
I am just getting started with Holoviews. My questions are on customizing histograms, but also I am sharing a complete example as it may be helpful for other newbies to look at, since the documentation for Holoviews is very thorough but can be overwhelming.
I have a number of time series in text files loaded as Pandas DataFrames where:
each file is for a specific location
at each location about 10 time series were collected, each with about 15,000 points
I am building a small interactive tool where a Selector can be used to choose the location / DataFrame, and then another Selector to pick 3 of 10 of the time series to be plotted together.
My goal is to allow linked zooms (both x and y scales). The questions and code will focus on this aspect of the tool.
I cannot share the actual data I am using, unfortunately, as it is proprietary, but I have created 3 random walks with specific data ranges that are consistent with the actual data.
## preliminaries ##
import pandas as pd
import numpy as np
import holoviews as hv
from holoviews.util.transform import dim
from holoviews.selection import link_selections
from holoviews import opts
from holoviews.operation.datashader import shade, rasterize
import hvplot.pandas
hv.extension('bokeh', width=100)
## create random walks (one location) ##
data_df = pd.DataFrame()
npoints=15000
np.random.seed(71)
x = np.arange(npoints)
y1 = 1300+2.5*np.random.randn(npoints).cumsum()
y2 = 1500+2*np.random.randn(npoints).cumsum()
y3 = 3+np.random.randn(npoints).cumsum()
data_df.loc[:,'x'] = x
data_df.loc[:,'rand1'] = y1
data_df.loc[:,'rand2'] = y2
data_df.loc[:,'rand3'] = y3
This first block is just to plot the data and show how, by design, one of the random walks have different range from the other two:
data_df.hvplot(x='x',y=['rand1','rand2','rand3'],value_label='y',width=800,height=400)
As a result, although hvplot subplots work out of the box (for linking), ranges are different so the scaling is not quite there:
data_df.hvplot(x='x',y=['rand1','rand2','rand3'],
value_label='y',subplots=True,width=800,height=200).cols(1)
So, my first attempt was to adapt the Python-based Points example from Linked brushing in the documentation:
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
hv.Points(data_df, dim).opts(color=c)
for c, dim in zip(colors, [['x', d] for d in dims])
])
link_selections(layout).opts(opts.Points(width=1200, height=300)).cols(1)
That is already an amazing result for a 20 minutes effort!
However, what I would really like is to plot a curve rather than points, and also see a histogram, so I adapted the comprehension syntax to work with Curve (after reading the documentation pages Applying customization, and Composing elements):
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([hv.Curve(data_df,'x',dim).opts(height=300,width=1200,
color=c).hist(dim) for c,
dim in zip(colors,[d for d in dims])])
link_selections(layout).cols(1)
Which is almost exactly what I want. But I still struggle with the different layers of opts syntax.
Question 1: with the comprehension from the last code block, how would I make the histogram share color with the curves?
Now, suppose I want to rasterize the plots (although I do not think is quite yet necessary with 15,000 points like in this case), I tried to adapt the first example with Points:
cmaps = ['Blues', 'Greens', 'Reds']
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
shade(rasterize(hv.Points(data_df, dims),
cmap=c)).opts(width=1200, height = 400).hist(dims[1])
for c, dims in zip(cmaps, [['x', d] for d in dims])
])
link_selections(layout).cols(1)
This is a decent start, but again I struggle with the options/customization.
Question 2: in the above cod block, how would I pass the colormaps (it does not work as it is now), and how do I make the histogram reflect data values as in the previous case (and also have the right colormap)?
Thank you!
Sander answered how to color the histogram, but for the other question about coloring the datashaded plot, Datashader renders your data with a colormap rather than a single color, so the parameter is named cmap rather than color. So you were correct to use cmap in the datashaded case, but (a) cmap is actually a parameter to shade (which does the colormapping of the output of rasterize), and (b) you don't really need shade, as you can let Bokeh do the colormapping in most cases nowadays, in which case cmap is an option rather than an argument. Example:
from bokeh.palettes import Blues, Greens, Reds
cmaps = [Blues[256][200:], Greens[256][200:], Reds[256][200:]]
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout([
rasterize(hv.Points(data_df, ds)).opts(cmap=c,width=1200, height = 400).hist(dims[1])
for c, ds in zip(cmaps, [['x', d] for d in dims])
])
link_selections(layout).cols(1)
To answer your first question to make the histogram share the color of the curve, I've added .opts(opts.Histogram(color=c)) to your code.
When you have a layout you can specify the options of an element inside the layout like that.
colors = hv.Cycle('Category10').values
dims = ['rand1', 'rand2', 'rand3']
layout = hv.Layout(
[hv.Curve(data_df,'x',dim)
.opts(height=300,width=600, color=c)
.hist(dim)
.opts(opts.Histogram(color=c))
for c, dim in zip(colors,[d for d in dims])]
)
link_selections(layout).cols(1)
I am currently trying to plot a set of specific features on a SHAP summary plot. However, I am struggling to find the code necessary to do so.
When looking at the source code on Github, the summary_plot function does seem to have a 'features' attribute. However, this does not seem to be the solution to my problem.
Could anybody help me plot a specific set of features, or is this not a viable option in the current code of SHAP.
A possible, albeit hacky, solution could be as follows, for example plotting a summary plot for a single feature in the 5th column
shap.summary_plot(shap_values[:,5:6], X.iloc[:, 5:6])
I reconstruct the shap_value to include the feature you want into the plot using below code.
shap_values = explainer.shap_values(samples)[1]
vals = np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(
list(zip(samples.columns, vals)),
columns=["col_name", "feature_importance_vals"],
)
feature_importance.sort_values(
by=["feature_importance_vals"], ascending=False, inplace=True
)
feature_importance['rank'] = feature_importance['feature_importance_vals'].rank(method='max',ascending=False)
missing_features = [
i
for i in columns_to_show
if i not in feature_importance["col_name"][:20].tolist()
]
missing_index = []
for i in missing_features:
missing_index.append(samples.columns.tolist().index(i))
missing_features_new = []
rename_col = {}
for i in missing_features:
rank = int(feature_importance[feature_importance['col_name']==i]['rank'].values)
missing_features_new.append('rank:'+str(rank)+' - '+i)
rename_col[i] = 'rank:'+str(rank)+' - '+i
column_names = feature_importance["col_name"][:20].values.tolist() + missing_features_new
feature_index = feature_importance.index[:20].tolist() + missing_index
shap.summary_plot(
shap_values[:, feature_index].reshape(
samples.shape[0], len(feature_index)
),
samples.rename(columns=rename_col)[column_names],
max_display=len(feature_index),
)
To plot only 1 feature, get the index of your feature you want to check in list of features
i = X.iloc[:,:].index.tolist().index('your_feature_name_here')
shap.summary_plot(shap_values[1][:,i:i+1], X.iloc[:, i:i+1])
To plot your selected features,
your_feature_list = ['your_feature_1','your_feature_2','your_feature_3']
your_feature_indices = [X.iloc[:,:].index.tolist().index(x) for x in your_feature_list]
shap.summary_plot(shap_values[1][:,your_feature_indices], X.iloc[:, your_feature_indices])
feel free to change "your_feature_indices" to a shorter variable name
change shap_values[1] to shap_values if you are not doing binary classification
I'm trying to plot a probability distribution using a pandas.Series and I'm struggling to set different yerr for each bar. In summary, I'm plotting the following distribution:
It comes from a Series and it is working fine, except for the yerr. It cannot overpass 1 or 0. So, I'd like to set different errors for each bar. Therefore, I went to the documentation, which is available here and here.
According to them, I have 3 options to use either the yerr aor xerr:
scalar: Symmetric +/- values for all data points.
scalar: Symmetric +/- values for all data points.
shape(2,N): Separate - and + values for each bar. The first row contains the lower errors, the second row contains the upper errors.
The case I need is the last one. In this case, I can use a DataFrame, Series, array-like, dict and str. Thus, I set the arrays for each yerr bar, however it's not working as expected. Just to replicate what's happening, I prepared the following examples:
First I set a pandas.Series:
import pandas as pd
se = pd.Series(data=[0.1,0.2,0.3,0.4,0.4,0.5,0.2,0.1,0.1],
index=list('abcdefghi'))
Then, I'm replicating each case:
This works as expected:
err1 = [0.2]*9
se.plot(kind="bar", width=1.0, yerr=err1)
This works as expected:
err2 = err1
err2[3] = 0.5
se.plot(kind="bar", width=1.0, yerr=err1)
Now the problem: This doesn't works as expected!
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
se.plot(kind="bar", width=1.0, yerr=err3)
It's not setting different errors for low and up. I found an example here and a similar SO question here, although they are using matplotlib instead of pandas, it should work here.
I'm glad if you have any solution about that.
Thank you.
Strangely, plt.bar works as expected:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [err_low, err_up]
fig, ax = plt.subplots()
ax.bar(se.index, se, width=1.0, yerr=err3)
plt.show()
Output:
A bug/feature/design-decision of pandas maybe?
Based on #Quanghoang comment, I started to think it was a a bug. So, I tried to change the yerr shape, and surprisely, the following code worked:
err_up = [0.3]*9
err_low = [0.1]*9
err3 = [[err_low, err_up]]
print (err3)
se.plot(kind="bar", width=1.0, yerr=err3)
Observe I included a new axis in err3. Now it's a (1,2,N) array. However, the documentation says it should be (2,N).
In addition, a possible work around that I found was set the ax.ylim(0,1). It doesn't solve the problem, but plots the graph correctly.
That is a plot i generated using pyplot and (attempted to) adjust the text using the adjustText library which i also found here.
as you can see, it gets pretty crowded in the parts where 0 < x < 0.1. i was thinking that there's still ample space in 0.8 < y < 1.0 such that they could all fit and label the points pretty well.
my attempt was:
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i],df.tpr[i], str(df.thr1[i])) for i in df.index]
adjust_text(texts,
expand_text=(2,2),
expand_points=(2,2),
expand_objects=(2,2),
force_objects = (2,20),
force_points = (0.1,0.25),
lim=150000,
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'}
)
where my df is a pandas dataframe which can be found here
from what i understood in the docs, i tried varying the bounding boxes and the y-force by making them larger, thinking that it would push the labels further up, but it does not seem to be the case.
I'm the author of adjustText, sorry I just noticed this question. you are having this problem because you have a lot of overlapping texts with exactly the same y-coordinate. It's easy to solve by adding a tiny random shift along the y to the labels (and you do need to increase the force for texts, otherwise along one dimension it works very slowly), like so:
np.random.seed(0)
f, ax = plt.subplots(figsize=(12, 6))
plt.plot(df.fpr,df.tpr,marker='.',ls='-')
texts = [plt.text(df.fpr[i], df.tpr[i]+np.random.random()/100, str(df.thr1[i])) for i in df.index]
plt.margins(y=0.125)
adjust_text(texts,
force_text=(2, 2),
arrowprops=dict(arrowstyle='-',color='red'),
autoalign='y',
only_move={'points':'y','text':'y'},
)
Also notice that I increased the margins along the y axis, it helps a lot with the corners. The result is not quite perfect, limiting the algorithm to just one axis make life more difficult... But it's OK-ish already.
Have to mention, size of the figure is very important, I don't know what yours was.