Interpretation of boxplot [duplicate]

Interpretation of boxplot [duplicate] - python

This question already has answers here:
Why is matplotlib's notched boxplot folding back on itself?
(1 answer)
strange shape of the boxplot using matplotlib
(1 answer)
Unintended Notched Boxplot from Matplotlib, Error from Seaborn
(2 answers)
Closed last year.
I am trying to create a box plot with matplotlib library of python. The code is given below.
fig, ax = plt.subplots(figsize=(8, 6))
bp = ax.boxplot([corr_df['bi'], corr_df['ndsi'], corr_df['dbsi'], corr_df['mbi']], patch_artist = True, notch ='True', vert = 1)
ax.set_title("Spearman’s correlation coefficient for Soil indices", fontsize=14)
ax.set_xlabel("Indices", fontsize=14)
ax.set_ylabel("Spearman’s correlation coefficient", fontsize=14)
colors = ['#088A08', '#FFFF00','#01DFD7', '#FF00FF', '#3A01DF']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
ax.grid()
ax.set_xticklabels(['bi', 'ndsi', 'dbsi', 'mbi'])
This creates an image like this :
I am not able to understand the 1st and 3rd boxplot. These two (box plots of bi and dbsi) have neck-like structures in them, which the other two boxplots don't have. What does this show? The interpretation of the boxplot as described on the web doesn't include this part.

In your example, the argument notch is set to True so according to the doc, it displays:
notch bool, default: False
Whether to draw a notched boxplot (True), or a rectangular boxplot (False). The notches represent the confidence interval (CI) around the median. The documentation for bootstrap describes how the locations of the notches are computed by default, but their locations may also be overridden by setting the conf_intervals parameter.
Specifically the behavior (flipped appearance) you're describing is documented as follow:
Note
In cases where the values of the CI are less than the lower quartile
or greater than the upper quartile, the notches will extend beyond the
box, giving it a distinctive "flipped" appearance. This is expected
behavior and consistent with other statistical visualization packages.
You will find more details in this answer.

Related

Visualizing Prediction and Test values for comparison [duplicate]

This question already has answers here:
How do I equalize the scales of the x-axis and y-axis?
(5 answers)
Closed 8 months ago.
This post was edited and submitted for review 8 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I'd like to make comparing this Prediction and Test values easier, so I'm thinking two ways to achieve that:
Scale the X and Y axis to the same scale
Plot a linear line (y=x)
Really like to have some way to either 'exclude' the outliers or perhaps 'zoom in' to the area where the points are dense, without manually excluding the outliers from the dataset (so its done automatically). Is this possible?
sns.scatterplot(y_pred, y_true)
plt.grid()
Looked around and tested plt.axis('equal') as mentioned on another question but it didn't seem quite right. Tried using plt.plot((0,0), (30,30)) to create the linear plot but it didn't show anything. Any other input on how to visualise this would be really appreciated as well. Thanks!

There are short ways to achieve everything you've suggested:
Force scaled axes with matplotlib.axes.Axes.set_aspect.
Add an infinite line with slope 1 through he origin with matplotlib.axes.Axes.axline
Set your plot to interactive mode, so you can pan and zoom. The way to do this depends on your environment and is explained in the docs.
Best to combine them all.
import matplotlib.pyplot as plt
from numpy import random
plt.ion() # activates interactive mode in most environments
plt.scatter(random.random_sample(10), random.random_sample(10))
ax = plt.gca()
ax.axline((0, 0), slope=1)
ax.set_aspect('equal', adjustable='datalim') # force equal aspect

To plot the linear line:
plt.plot([0,30], [0,30])
To scale x and y axis to same scale (see doc for set_aspect):
plt.xlim(0, 30)
plt.ylim(0, 30)
plt.gca().set_aspect('equal', adjustable='box')
plt.draw()
From the doc for set_aspect:
Axes.set_aspect(aspect, adjustable=None, anchor=None, share=False)
Set the aspect ratio of the axes scaling, i.e. y/x-scale
aspect='equal': same as aspect=1, i.e. same scaling for x and y.

seaborn - how to use `multiple` parameter in seaborn.histplot?

seaborn.histplot takes a keyword argument called multiple with one of {'layer', 'dodge', 'stack', 'fill'} values. I presume it handles how multiple bars overlap, or when hue is used. but, the examples and documentation doesn't make it clear on when to use what type of multiple. Any information from the community will be helpful!!

A picture says more, etc.
from matplotlib import pyplot as plt
import seaborn as sns
penguins = sns.load_dataset("penguins")
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
kws = ["layer", "dodge", "stack", "fill"]
for kw, ax in zip(kws, axes.flat):
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple=kw, ax=ax)
ax.set_title(kw)
plt.show()
The docs say "Approach to resolving multiple elements when semantic mapping creates subsets. Only relevant with univariate data." meaning this is only of relevance when categories are plotted within one graph:
layer - overlayed categories (giving rise to ... interesting color combinations)
dodge - categories side by side (not applicable to pure KDE plots for the obvious reasons)
stack - stacked categories
fill - categories add up to 100%.

I think the visible examples are pretty good (in comparison to other "documentation" I have seen).
The default for the parameter "multiple" is layer which just the different sub-histograms on top of each other. This is helpfull if you want to compare the form(skewness, variation, location of median/average on x-axis ect.) of each sub-histogram and to compare the forms to each other.
stack piles the bars on top of each other. This would be most usefull if you want to have an idea of the proportion (e.g. which subcategory is more dominantly distributed on which area of the x-axis and a quick look on where there are gaps for which subcategory).
dogde will split the columns next to each other. This is very similar to stack and in my opinion just a variation with the same intent.
If I remember right, fill will fill the whole plot.
This is not as useful in my opinion because the lack of the excact form of each distribution could lead to the user miss some important insights.

partially visible spine matplotlib [duplicate]

This question already has an answer here:
Changing the length of axis lines in matplotlib
(1 answer)
Closed 1 year ago.
I am interested in creating a plot where only part of the spine is visible (say only for positive values), while the plot is shown for both negative and positive values.
set_position # seems to only set the point where it intersects with the other axis
set_visible # is an on-off switch. It does not allow for partial visibility.
Is there a way to do this?

With ax as the axes, if the x-axis is to show only between 0 and 0.5, then:
ax.spines['bottom'].set_bounds((0, 0.5))
You might need to set the ticks, as well, so, for instance:
ax.set_xticks([0, 0.25, 0.5])

Why is the legend shown in a Seaborn JointGrid incorrect?

I am experimenting with JointGrid from Seaborn. I used .plot_joint() to plot my scatter plot, group-colored using the hue parameter. I have filtered my dataset to only include 2 of the 5 groups, to prevent too much overlap in the plots.
The plotted points appear correct, in that they match what I expect from the two groups I chose. Additionally, I double-checked my filtering by viewing the filtered dataframe. That too was correct as it contained only the two groups I chose.
However the legend that is automatically plotted along with the scatterplot is incorrect. It shows 4 groups (not sure why not 5), and the coloring is also incorrect. For 2 groups I would expect only the Red and Blue colors (the first 2 colors in the Set1 palette), but my 2nd group is colored with the 4th color in the Set1 palette.
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
# initialize Joint Grid
grid = sns.JointGrid(data=df_tmp, x='MP', y='PTS')
# plot scatter (main plot)
grid = grid.plot_joint(sns.scatterplot, data=df_tmp, hue='Kmeans_Clusters',
palette='Set1')
# plot marginal distplot for cluster 0, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].MP, ax=grid.ax_marg_x,
vertical=False, color='firebrick', label='Cluster0')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].PTS, ax=grid.ax_marg_y,
vertical=True, color='firebrick', label='Cluster0')
# plot marginal distplot for cluster 3, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].MP, ax=grid.ax_marg_x,
vertical=False, color='steelblue', label='Cluster3')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].PTS, ax=grid.ax_marg_y,
vertical=True, color='steelblue', label='Cluster3')
plt.suptitle('PTS vs MP, Cluster 0 & 3\n1982-2019', y=1.05, fontsize=20)
plt.show()
jointgrid_incorrect_legend_and_coloring
--- Update---
I just tried this with a simple scatterplot (no JointGrid) and I can repeat my previous observation. Is there just something I am not understanding with the hue parameter and the scatterplot() function?
I do not see this issue with lmplot()
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
sns.scatterplot(data=df_tmp, y='PTS', x='MP', hue='Kmeans_Clusters', palette='Set1')
plt.title('PTS vs MP\n1982-2019')
plt.xlabel('Minutes Played Annually')
plt.ylabel('Points Scored Annually')
plt.show()

Again, once I fine tuned my searches, I was able to find the solution. In fact, here's another stackoverflow question that asks the same thing and is answered in detail: The `hue` parameter in Seaborn.relplot() skips an integer when given numerical data?.
Pasting the solution I used, as described in the link above:
"""
An alternative is to make sure the values are treated categorical Unfortunately, even if you plug in the numbers as strings, they will be converted to numbers falling back to the same mechanism described above. This may be seen as a bug.
However, one choice you have is to use real categories, like e.g. single letters.
'cluster':list("ABCDE")
works fine,
"""

Weird behavior of matplotlibs boxplot when using the notch shape

I am encountering some weird behavior in matplotlib's boxplot function when I am using the "notch" shape. I am using some code that I have written a while ago and never had those issues -- I am wondering what the problem is. Any ideas?
When I turn the notch shape off it looks normal though
This would be the code:
def boxplot_modified(data):
fig = plt.figure(figsize=(8,6))
ax = plt.subplot(111)
bplot = plt.boxplot(data,
#notch=True, # notch shape
vert=True, # vertical box aligmnent
sym='ko', # red circle for outliers
patch_artist=True, # fill with color
)
# choosing custom colors to fill the boxes
colors = 3*['lightgreen'] + 3*['lightblue'], 'lightblue', 'lightblue', 'lightblue']
for patch, color in zip(bplot['boxes'], colors):
patch.set_facecolor(color)
# modifying the whiskers: straight lines, black, wider
for whisker in bplot['whiskers']:
whisker.set(color='black', linewidth=1.2, linestyle='-')
# making the caps a little bit wider
for cap in bplot['caps']:
cap.set(linewidth=1.2)
# hiding axis ticks
plt.tick_params(axis="both", which="both", bottom="off", top="off",
labelbottom="on", left="off", right="off", labelleft="on")
# adding horizontal grid lines
ax.yaxis.grid(True)
# remove axis spines
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["bottom"].set_visible(True)
ax.spines["left"].set_visible(True)
plt.xticks([y+1 for y in range(len(data))], 8*['x'])
# raised title
#plt.text(2, 1, 'Modified',
# horizontalalignment='center',
# fontsize=18)
plt.tight_layout()
plt.show()
boxplot_modified(df.values)
and when I make a plain plot without the customization, the problem still occurs:
def boxplot(data):
fig = plt.figure(figsize=(8,6))
ax = plt.subplot(111)
bplot = plt.boxplot(data,
notch=True, # notch shape
vert=True, # vertical box aligmnent
sym='ko', # red circle for outliers
patch_artist=True, # fill with color
)
plt.show()
boxplot(df.values)

Okay, as it turns out, this is actually a correct behavior ;)
From Wikipedia:
Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide to significance of difference of medians; if the notches of two boxes do not overlap, this offers evidence of a statistically significant difference between the medians. The width of the notches is proportional to the interquartile range of the sample and inversely proportional to the square root of the size of the sample. However, there is uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples). One convention is to use +/-1.58*IQR/sqrt(n).
This was also discussed in an issue on GitHub; R produces a similar output as evidence that this behaviour is "correct."
Thus, if we have this weird "flipped" appearance in the notched box plots, it simply means that the 1st quartile has a lower value than the confidence of the mean and vice versa for the 3rd quartile. Although it looks ugly, it's actually useful information about the (un)confidence of the median.
A bootstrapping (random sampling with replacement to estimate parameters of a sampling distribution, here: confidence intervals) might reduce this effect:
From the plt.boxplot documentation:
bootstrap : None (default) or integer
Specifies whether to bootstrap the confidence intervals
around the median for notched boxplots. If bootstrap==None,
no bootstrapping is performed, and notches are calculated
using a Gaussian-based asymptotic approximation (see McGill, R.,
Tukey, J.W., and Larsen, W.A., 1978, and Kendall and Stuart,
1967). Otherwise, bootstrap specifies the number of times to
bootstrap the median to determine it's 95% confidence intervals.
Values between 1000 and 10000 are recommended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.