Related
I am trying to create a chart in Altair with dropdowns. Here's the code
df = pd.DataFrame([["Merc","US",500, "Car_A"], ["BMW","US" ,55, "Car_B"]
, ["BMW","US",40, "Car_C"], ["Merc", "China",650, "Car_D"]
, ["BMW","US",80, "Car_E"], ["Merc", "China",850, "Car_F"]], columns=list("ABCD"))
position_dropdown_Type = alt.binding_select(name = "Type:" , options=[None] + list(df["B"].unique()), labels = ['All'] + list(df["B"].unique()))
position_selection_Type = alt.selection_single(fields=['B'], bind=position_dropdown_Type)
position_dropdown_1_Car = alt.binding_select(name = "Company:", options=[None] + list(df["A"].unique()), labels = ['All'] + list(df["A"].unique()))
position_selection_1_Car = alt.selection_single(fields=['A'], bind=position_dropdown_1_Car, name = "__")
#interval = alt.selection_multi(fields=['GICS_SUB_IND'], bind='legend')
Car_bar=alt.Chart(df).mark_bar().encode(
color = 'A',
y = alt.Y('C', scale=alt.Scale(domain=[0, 1.2*df["C"].max()]),
title = 'Range'),
x = alt.X('D:O', sort = alt.EncodingSortField(field="C", order='ascending', op='max'), title = 'Care_Name')
).add_selection(interval).add_selection(position_selection_Type).transform_filter(position_selection_Type).add_selection(position_selection_1_Car).transform_filter(position_selection_1_Car)
(Car_bar).properties(width = 700)
the code works with a graph like this when all values are selected
However, when I make a selection, the bar width takes the entire width as seen below
Defining size inside mark_bar(size = 10) is not an option as the code will be accessing different datasets with wide range of sample size. Also, since the dropdown list will be quite long, selecting from the legend is also not ideal.
Is there a way to keep the width same for the bars with the selection from dropdown?
edited - removing the properties option at the end also does not solve the issue
It seem the issue is with the properties(width = 700) setting. It is forcing the bar width to be large enough to satisfy the width setting. If you remove that, it will give adapt the bar width accordingly.
Edit: Output plot
From what I can see, boxplot() method expects a sequence of raw values (numbers) as input, from which it then computes percentiles to draw the boxplot(s).
I would like to have a method by which I could pass in the percentiles and get the corresponding boxplot.
For example:
Assume that I have run several benchmarks and for each benchmark I've measured latencies ( floating point values ). Now additionally, I have precomputed the percentiles for these values.
Hence for each benchmark, I have the 25th, 50th, 75th percentile along with the min and max.
Now given these data, I would like to draw the box plots for the benchmarks.
As of 2020, there is a better method than the one in the accepted answer.
The matplotlib.axes.Axes class provides a bxp method, which can be used to draw the boxes and whiskers based on the percentile values. Raw data is only needed for the outliers, and that is optional.
Example:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
boxes = [
{
'label' : "Male height",
'whislo': 162.6, # Bottom whisker position
'q1' : 170.2, # First quartile (25th percentile)
'med' : 175.7, # Median (50th percentile)
'q3' : 180.4, # Third quartile (75th percentile)
'whishi': 187.8, # Top whisker position
'fliers': [] # Outliers
}
]
ax.bxp(boxes, showfliers=False)
ax.set_ylabel("cm")
plt.savefig("boxplot.png")
plt.close()
This produces the following image:
To draw the box plot using just the percentile values and the outliers ( if any ) I made a customized_box_plot function that basically modifies attributes in a basic box plot ( generated from a tiny sample data ) to make it fit according to your percentile values.
The customized_box_plot function
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, (q1_start,
q2_start,
q3_start,
q4_start,
q4_end,
fliers_xy) in enumerate(percentiles):
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
box_plot['boxes'][box_no].set_ydata([q2_start,
q2_start,
q4_start,
q4_start,
q2_start])
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
USAGE
Using inverse logic ( code at the very end ) I extracted the percentile values from this example
>>> percentiles
(-1.0597368367634488, 0.3977683984966961, 1.0298955252405229, 1.6693981537742526, 3.4951447843464449)
(-0.90494930553559483, 0.36916539612108634, 1.0303658700697103, 1.6874542731392828, 3.4951447843464449)
(0.13744105279440233, 1.3300645202649739, 2.6131540656339483, 4.8763411136047647, 9.5751914834437937)
(0.22786243898199182, 1.4120860286080519, 2.637650402506837, 4.9067126578493259, 9.4660357513550899)
(0.0064696168078617741, 0.30586770128093388, 0.70774153557312702, 1.5241965711101928, 3.3092932063051976)
(0.007009744579241136, 0.28627373934008982, 0.66039691869500572, 1.4772725266672091, 3.221716765477217)
(-2.2621660374110544, 5.1901313713883352, 7.7178532139979357, 11.277744848353247, 20.155971739152388)
(-2.2621660374110544, 5.1884411864079532, 7.3357079047721054, 10.792299385806913, 18.842012119715388)
(2.5417888074435702, 5.885996170695587, 7.7271286220368598, 8.9207423361593179, 10.846938621419374)
(2.5971767318505856, 5.753551925927133, 7.6569980004033464, 8.8161056254143233, 10.846938621419374)
Note that to keep this short I haven't shown the outliers vectors which will be the 6th element of each of the percentile array.
Also note that all usual additional kwargs / args can be used since they are simply passed to the boxplot method inside it :
>>> fig, ax = plt.subplots()
>>> b = customized_box_plot(percentiles, ax, redraw=True, notch=0, sym='+', vert=1, whis=1.5)
>>> plt.show()
EXPLANATION
The boxplot method returns a dictionary mapping the components of the boxplot to the individual matplotlib.lines.Line2D instances that were created.
Quoting from the matplotlib.pyplot.boxplot documentation :
That dictionary has the following keys (assuming vertical boxplots):
boxes: the main body of the boxplot showing the quartiles and the median’s confidence intervals if enabled.
medians: horizonal lines at the median of each box.
whiskers: the vertical lines extending to the most extreme, n-outlier data points. caps: the horizontal lines at the ends of the whiskers.
fliers: points representing data that extend beyond the whiskers (outliers).
means: points or lines representing the means.
For example observe the boxplot of a tiny sample data of [-9, -4, 2, 4, 9]
>>> b = ax.boxplot([[-9, -4, 2, 4, 9],])
>>> b
{'boxes': [<matplotlib.lines.Line2D at 0x7fe1f5b21350>],
'caps': [<matplotlib.lines.Line2D at 0x7fe1f54d4e50>,
<matplotlib.lines.Line2D at 0x7fe1f54d0e50>],
'fliers': [<matplotlib.lines.Line2D at 0x7fe1f5b317d0>],
'means': [],
'medians': [<matplotlib.lines.Line2D at 0x7fe1f63549d0>],
'whiskers': [<matplotlib.lines.Line2D at 0x7fe1f5b22e10>,
<matplotlib.lines.Line2D at 0x7fe20c54a510>]}
>>> plt.show()
The matplotlib.lines.Line2D objects have two methods that I'll be using in my function extensively. set_xdata ( or set_ydata ) and get_xdata ( or get_ydata ).
Using these methods we can alter the position of the constituent lines of the base box plot to conform to your percentile values ( which is what the customized_box_plot function does ). After altering the constituent lines' position, you can redraw the canvas using figure.canvas.draw()
Summarizing the mappings from percentile to the coordinates of the various Line2D objects.
The Y Coordinates :
The max ( q4_end - end of 4th quartile ) corresponds to the top most cap Line2D object.
The min ( q1_start - start of the 1st quartile ) corresponds to the lowermost most cap Line2D object.
The median corresponds to the ( q3_start ) median Line2D object.
The 2 whiskers lie between the ends of the boxes and extreme caps ( q1_start and q2_start - lower whisker; q4_start and q4_end - upper whisker )
The box is actually an interesting n shaped line bounded by a cap at the lower portion. The extremes of the n shaped line correspond to the q2_start and the q4_start.
The X Coordinates :
The Central x coordinates ( for multiple box plots are usually 1, 2, 3... )
The library automatically calculates the bounding x coordinates based on the width specified.
INVERSE FUNCTION TO RETRIEVE THE PERCENTILES FROM THE boxplot DICT:
def get_percentiles_from_box_plots(bp):
percentiles = []
for i in range(len(bp['boxes'])):
percentiles.append((bp['caps'][2*i].get_ydata()[0],
bp['boxes'][i].get_ydata()[0],
bp['medians'][i].get_ydata()[0],
bp['boxes'][i].get_ydata()[2],
bp['caps'][2*i + 1].get_ydata()[0],
(bp['fliers'][i].get_xdata(),
bp['fliers'][i].get_ydata())))
return percentiles
NOTE:
The reason why I did not make a completely custom boxplot method is because, there are many features offered by the inbuilt box plot that cannot be fully reproduced.
Also excuse me if I may have unnecessarily explained something that may have been too obvious.
Here is an updated version of this useful routine. Setting the vertices directly appears to work for both filled boxes (patchArtist=True) and unfilled ones.
def customized_box_plot(percentiles, axes, redraw = True, *args, **kwargs):
"""
Generates a customized boxplot based on the given percentile values
"""
n_box = len(percentiles)
box_plot = axes.boxplot([[-9, -4, 2, 4, 9],]*n_box, *args, **kwargs)
# Creates len(percentiles) no of box plots
min_y, max_y = float('inf'), -float('inf')
for box_no, pdata in enumerate(percentiles):
if len(pdata) == 6:
(q1_start, q2_start, q3_start, q4_start, q4_end, fliers_xy) = pdata
elif len(pdata) == 5:
(q1_start, q2_start, q3_start, q4_start, q4_end) = pdata
fliers_xy = None
else:
raise ValueError("Percentile arrays for customized_box_plot must have either 5 or 6 values")
# Lower cap
box_plot['caps'][2*box_no].set_ydata([q1_start, q1_start])
# xdata is determined by the width of the box plot
# Lower whiskers
box_plot['whiskers'][2*box_no].set_ydata([q1_start, q2_start])
# Higher cap
box_plot['caps'][2*box_no + 1].set_ydata([q4_end, q4_end])
# Higher whiskers
box_plot['whiskers'][2*box_no + 1].set_ydata([q4_start, q4_end])
# Box
path = box_plot['boxes'][box_no].get_path()
path.vertices[0][1] = q2_start
path.vertices[1][1] = q2_start
path.vertices[2][1] = q4_start
path.vertices[3][1] = q4_start
path.vertices[4][1] = q2_start
# Median
box_plot['medians'][box_no].set_ydata([q3_start, q3_start])
# Outliers
if fliers_xy is not None and len(fliers_xy[0]) != 0:
# If outliers exist
box_plot['fliers'][box_no].set(xdata = fliers_xy[0],
ydata = fliers_xy[1])
min_y = min(q1_start, min_y, fliers_xy[1].min())
max_y = max(q4_end, max_y, fliers_xy[1].max())
else:
min_y = min(q1_start, min_y)
max_y = max(q4_end, max_y)
# The y axis is rescaled to fit the new box plot completely with 10%
# of the maximum value at both ends
axes.set_ylim([min_y*1.1, max_y*1.1])
# If redraw is set to true, the canvas is updated.
if redraw:
ax.figure.canvas.draw()
return box_plot
Here is a bottom-up approach where the box_plot is build up using matplotlib's vline, Rectangle, and normal plot functions
def boxplot(df, ax=None, box_width=0.2, whisker_size=20, mean_size=10, median_size = 10 , line_width=1.5, xoffset=0,
color=0):
"""Plots a boxplot from existing percentiles.
Parameters
----------
df: pandas DataFrame
ax: pandas AxesSubplot
if to plot on en existing axes
box_width: float
whisker_size: float
size of the bar at the end of each whisker
mean_size: float
size of the mean symbol
color: int or rgb(list)
If int particular color of property cycler is taken. Example of rgb: [1,0,0] (red)
Returns
-------
f, a, boxes, vlines, whisker_tips, mean, median
"""
if type(color) == int:
color = plt.rcParams['axes.prop_cycle'].by_key()['color'][color]
if ax:
a = ax
f = a.get_figure()
else:
f, a = plt.subplots()
boxes = []
vlines = []
xn = []
for row in df.iterrows():
x = row[0] + xoffset
xn.append(x)
# box
y = row[1][25]
height = row[1][75] - row[1][25]
box = plt.Rectangle((x - box_width / 2, y), box_width, height)
a.add_patch(box)
boxes.append(box)
# whiskers
y = (row[1][95] + row[1][5]) / 2
vl = a.vlines(x, row[1][5], row[1][95])
vlines.append(vl)
for b in boxes:
b.set_linewidth(line_width)
b.set_facecolor([1, 1, 1, 1])
b.set_edgecolor(color)
b.set_zorder(2)
for vl in vlines:
vl.set_color(color)
vl.set_linewidth(line_width)
vl.set_zorder(1)
whisker_tips = []
if whisker_size:
g, = a.plot(xn, df[5], ls='')
whisker_tips.append(g)
g, = a.plot(xn, df[95], ls='')
whisker_tips.append(g)
for wt in whisker_tips:
wt.set_markeredgewidth(line_width)
wt.set_color(color)
wt.set_markersize(whisker_size)
wt.set_marker('_')
mean = None
if mean_size:
g, = a.plot(xn, df['mean'], ls='')
g.set_marker('o')
g.set_markersize(mean_size)
g.set_zorder(20)
g.set_markerfacecolor('None')
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
mean = g
median = None
if median_size:
g, = a.plot(xn, df['median'], ls='')
g.set_marker('_')
g.set_markersize(median_size)
g.set_zorder(20)
g.set_markeredgewidth(line_width)
g.set_markeredgecolor(color)
median = g
a.set_ylim(np.nanmin(df), np.nanmax(df))
return f, a, boxes, vlines, whisker_tips, mean, median
This is how it looks in action:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
nopts = 12
df = pd.DataFrame()
df['mean'] = np.random.random(nopts) + 7
df['median'] = np.random.random(nopts) + 7
df[5] = np.random.random(nopts) + 4
df[25] = np.random.random(nopts) + 6
df[75] = np.random.random(nopts) + 8
df[95] = np.random.random(nopts) + 10
out = boxplot(df)
With Plotly, I can easily plot a single lines and fill the area between the line and y == 0:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(
x=[1, 2, 3, 4],
y=[-2, -1.5, 1, 2.5],
fill='tozeroy',
mode='lines',
))
fig.show()
How can I split the filled area in two? In particular, filling with red where y < 0 and with green where y > 0.
I would like to maintain the line as continuous. That means, I am not interested in just drawing two separate filled polygons.
Note that the line does not necessarily have values at y == 0.
I was in need of such a graph so I wrote a function.
def resid_fig(resid, tickvalues):
"""
resid: The y-axis points to be plotted. x-axis is assumed to be linearly increasing
tickvalues: The values you want to be displayed as ticklabels on x-axis. Works for hoverlabels too. Has to be
same length as `resid`. (This is necessary to ignore 'gaps' in graph where two polygons meet.)
"""
#Adjusting array with paddings to connect polygons at zero line on x-axis
index_array = []
start_digit = 0
split_array = np.split(resid,np.where(np.abs(np.diff(np.sign(resid)))==2)[0]+1)
split_array = [np.append(x,0) for x in split_array]
split_array = [np.insert(x,0,0) for x in split_array]
split_array[0] = np.delete(split_array[0],0)
split_array[-1] = np.delete(split_array[-1],-1)
for x in split_array:
index_array.append(np.arange(start_digit,start_digit+len(x)))
start_digit += len(x)-1
#Making an array for ticklabels
flat = []
for x in index_array:
for y in x:
flat.append(y)
flat_counter = Counter(flat)
none_indices = np.where([(flat_counter[x]>1) for x in flat_counter])[0]
custom_tickdata = []
neg_padding = 0
start_pos = 0
for y in range(len(flat)):
for x in range(start_pos,flat[-1]+1):
if x in none_indices:
custom_tickdata.append('')
break
custom_tickdata.append(tickvalues[x-neg_padding])
neg_padding +=1
start_pos = 1+x
#Making an array for hoverlabels
custom_hoverdata=[]
sublist = []
for x in custom_tickdata:
if x == '':
custom_hoverdata.append(sublist)
sublist = []
sublist.append(x)
continue
sublist.append(x)
sublist2 = sublist.copy()
custom_hoverdata.append(sublist2)
#Creating figure
fig = go.Figure()
idx = 0
for x,y in zip(split_array,index_array):
color = 'rgba(219,43,57,0.8)' if x[1]<0 else 'rgba(47,191,113,0.8)'
if (idx==0 and x[0] < 0):
color= 'rgba(219,43,57,0.8)'
fig.add_scatter(y=x, x=y, fill='tozeroy', fillcolor=color, line_color=color, customdata=custom_hoverdata[idx],
hovertemplate='%{customdata}<extra></extra>',legendgroup='mytrace',
showlegend=False if idx>0 else True)
idx += 1
fig.update_layout()
fig.update_xaxes(tickformat='', hoverformat='',tickmode = 'array',
tickvals = np.arange(index_array[-1][-1]+1),
ticktext = custom_tickdata)
fig.update_traces(mode='lines')
fig.show()
Example-
resid_fig([-2,-5,7,11,3,2,-1,1,-1,1], [1,2,3,4,5,6,7,8,9,10])
Now, for the caveats-
It does use separate polygons but I have combined all the traces into a single legendgroup so clicking on legend turns all of them on or off. About the legend colour, one way is to change 0 to 1 in showlegend=False if idx>0 in the fig.add_scatter() call. It then shows two legends, red and green still in the same legendgroup though so they still turn on and off together.
The function works by first separating continuous positive and negative values into arrays, adding 0 to the end and start of each array so the polygons can meet at the x-axis. This means the figure does not scale as well but depending on the use-case, it might not matter as much. This does not affect the hoverlabels or ticklabels as they are blank on these converging points.
The most important one, the graph does not work as intended when any of the point in the passed array is 0. I'm sure it can be modified to work for it but I have no use for it and the question doesn't ask for it either.
I am plotting some data using bokeh using a for loop to iterate over my columns in the dataframe. For some reason the box select and lasso tools which I have managed to have as linked in plots explicitly plotted (i.e. not generated with a for loop) does not seem to work now.
Do I need to increment some bokeh function within the for loop?
#example dataframe
array = {'variable': ['var1', 'var2', 'var3', 'var4'],
'var1': [np.random.rand(10)],
'var2': [np.random.rand(10)],
'var3': [np.random.rand(10)],
'var4': [np.random.rand(10)]}
cols = ['var1',
'var2',
'var3',
'var4']
df = pd.DataFrame(array, columns = cols)
w = 500
h = 400
#collect plots in a list (start with an empty)
plots = []
#iterate over the columns in the dataframe
# specify the tools in TOOLS
#add additional lines to show tolerance bands etc
for c in df[cols]:
source = ColumnDataSource(data = dict(x = df.index, y = df[c]))
TOOLS = "pan,wheel_zoom,box_zoom,reset,save,box_select,lasso_select"
f = figure(tools = TOOLS, width = w, plot_height = h, title = c + ' Run Chart',
x_axis_label = 'Run ID', y_axis_label = c)
f.line('x', 'y', source = source, name = 'data')
f.triangle('x', 'y', source = source)
#data mean line
f.line(df.index, df[c].mean(), color = 'orange')
#tolerance lines
f.line (df.index, df[c + 'USL'][0], color = 'red', line_dash = 'dashed', line_width = 2)
f.line (df.index, df[c + 'LSL'][0], color = 'red', line_dash = 'dashed', line_width = 2)
#append the new plot in this loop to the existing list of plots
plots.append(f)
#link all the x_ranges
for i in plots:
i.x_range = plots[0].x_range
#plot
p = gridplot(plots, ncols = 2)
output_notebook()
show(p)
I expect to produce plots which are linked and allow me to box or lasso select some points on one chart and for them to be highlighted on the others. However, the plots only let me select on one plot with no linked behaviour.
SOLUTION
This may seem a bit of a noob problem, but I am sure someone else will come across this, so here is the answer!!!
Bokeh works by referring to a datasource object (the columndatasource object). You can pass your dataframe completely into this and then call explicit x and y values within the glyph creation (e.g. my f.line, f.triangle etc).
So I moved the 'source' outside of the loop to prevent it being reset each iteration and just passed my df to it. I then within the loop, call the iteration index + descriptor string (USL, LSL, mean) for the y values and the 'index' for my x values.
I add a box select tool explicitly with a 'name' defined so that when the box selects, it only selects those glyphs that I want it to select (i.e. don't want it to select my constant value mean and spec limit lines).
Also, be careful that if you want to output to a html or something, that you probably will need to supress your in-notebook output as bokeh does not like having duplicate plots open. I have not included my html output solution here.
In terms of adding linked lasso objects for loop generated plots, I could only find an explicit box select tool generator so not sure this is possible.
So here it is:
#keep the source out of the loop to stop it resetting every time
Source = ColumnDataSource(df)
for c in cols:
TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
f = figure(tools = TOOLS, width = w, plot_height = h, title = c + ' Run Chart',
x_axis_label = 'Run ID', y_axis_label = c)
f.line(x = 'index', y = c , source = Source, name = 'data')
f.triangle(x = 'index', y = c, source = Source, name = 'data')
#data mean line
f.line(x = 'index', y = c + '_mean', source = Source, color = 'orange')
#tolerance lines
f.line (x = 'index', y = c + 'USL', color = 'red', line_dash = 'dashed', line_width = 2, source = Source)
f.line (x = 'index', y = c + 'LSL', color = 'red', line_dash = 'dashed', line_width = 2, source = Source)
# Add BoxSelect tool - this allows points on one plot to be highligted on all linked plots. Note only the delta info
# is linked using name='data'. Again names can be used to ensure only the relevant glyphs are highlighted.
bxselect1 = BoxSelectTool(renderers=f.select(name='data'))
f.add_tools(bxselect1)
plots.append(f)
#tie the x_ranges together so that panning is linked between plots
for i in plots:
i.x_range = plots[0].x_range
forp = gridplot(plots, ncols = 2)
show(forp)
I can draw a boxplot from data:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(100)
plt.boxplot(data)
Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).
Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:
median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)
iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.
Why do you want to do so? what you are doing is already pretty direct.
Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.
B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]
It returns an array of the shape (2,) for each whiskers, the second element is the value we want:
[item.get_ydata()[1] for item in B['whiskers']]
I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.
The function is:
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)
labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()
Outputs the following from get_box_plot_data:
label lower_whisker lower_quartile median upper_quartile upper_whisker
0 data1 -2.491652 -0.587869 0.047543 0.696750 2.559301
1 data2 2.351567 4.310068 4.984103 5.665910 7.489808
2 data3 7.227794 9.278931 9.947674 10.661581 12.733275
And produces the following plot:
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
equal to
upper_whisker = data.max()
lower_whisker = data.min()
if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR