Center nested boxplots in Python/Seaborn with unequal classes

Center nested boxplots in Python/Seaborn with unequal classes - python

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.

Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

Related

How can I plot the sum of values (rather than the count) using seaborn violinplot?

I have a data set with various quantities which I would like to make a violin plot of using seaborn. However, instead of visualizing the COUNT of occurrences of all of the quantities I'd like to display the SUM of the quantities.
For example, if I have a data set like this...
df = pd.DataFrame({'quantity':[1,1,1,4]})
sns.violinplot(data=df)
Actual Result
I really want it to display something like this...
df = pd.DataFrame({'quantity':[1,1,1,4,4,4,4]})
sns.violinplot(data=df)
Expected Result
My data set is around 500k values ranging between 1 and 10k so its not possible to transform the data as above (ie [1,1,2,3] -> [1,1,2,2,3,3,3]).
I know I could do something like this below to get the values I want and then plot with a bar plot or something...
df.quantity.value_counts() * df.quantity.value_counts().index
however, I really like seaborns violin plot and the ability to pair it with catplot so if anyone knows a way to do this in seaborn I'd be very grateful.

How do I create a multiline plot using seaborn?

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.

Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)

This:
sns.lineplot(data=data_preproc)
will do what you want.

See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

Discrepancy between Seaborn plotted mean and calculated mean. (Python/Pandas)

Hello there wonderful people of StackOverflow!
I have been getting to grips with Python and was starting to feel pretty confident that I knew what I was doing until this doozy came up:
I am plotting and comparing two subselections of a dataframe where "Type" = "area" and "". Seaborn plots a boxplot of these and marks the mean, but when I calculate the mean using .mean() it gives a different answer. Here's the code:
plotdata = df[df['Type'].isin(['A','B'])]
g = sns.violinplot(x="Type", y="value", data=plotdata, inner="quartile")
plt.ylim(ymin=-4, ymax=4) # This is to zoom on the plot to make the 0 line clearer
This is the resulting plot, note how the means are ~-0.1 and ~1.5
But when I calculate them with:
print(df_long[df_long['charttype'].isin(['area'])]['error'].mean())
print(df_long[df_long['charttype'].isin(['angle'])]['error'].mean())
It returns:
0.014542483333332705
-2.024809368191722
So my question is, why don't these numbers match?

Total misunderstanding of basic statistics was the problem!
Box plots (which are inside the seaborn violin plots) plot the interquartile range and the MEDIAN, whereas I was later calculating the MEAN.
Just needed to sleep on it and hey presto all becomes clear.

seaborn: separate groups in factorplot

I'm using seaborn to create a factorplot. Some entries in my data are missing, so it is hard to understand which bar belongs to which group.
In this example, the factorplot analyzes the answers in a survey. People are grouped by the answer they gave to a first question. Then for each of these groups, the distribution for a second question is plotted.
Sometimes the data is rather sparse. That means some bars are missing, and the boundaries between the groups are unclear. Can you tell wether the black bar in the middle belongs to group 2 or 3 ?
I would like to add separators between the groups. A vertical line for example would be nice.
In the pandas docs, I can't find anything like this.
Is there a way to add a visual separation between the factorplot groups ?

You may use an axvline to create a vertical line spanning over the complete height of the axes. You would position that line in between the categories, i.e. at positions [1.5, 2.5, ...] in terms of axes labels. In case your plot is categorical, those would rather be [0.5, 1.5, ...]
E.g.
ax = df.plot(...)
for i in range(len(categories)-1):
ax.axvline(i+0.5, color="grey")

How to connect boxplot median values

It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.

You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.

You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Center nested boxplots in Python/Seaborn with unequal classes - python

Related

How can I plot the sum of values (rather than the count) using seaborn violinplot?

How do I create a multiline plot using seaborn?

Discrepancy between Seaborn plotted mean and calculated mean. (Python/Pandas)

seaborn: separate groups in factorplot

How to connect boxplot median values

Categories

Resources