seaborn: separate groups in factorplot - python

I'm using seaborn to create a factorplot. Some entries in my data are missing, so it is hard to understand which bar belongs to which group.
In this example, the factorplot analyzes the answers in a survey. People are grouped by the answer they gave to a first question. Then for each of these groups, the distribution for a second question is plotted.
Sometimes the data is rather sparse. That means some bars are missing, and the boundaries between the groups are unclear. Can you tell wether the black bar in the middle belongs to group 2 or 3 ?
I would like to add separators between the groups. A vertical line for example would be nice.
In the pandas docs, I can't find anything like this.
Is there a way to add a visual separation between the factorplot groups ?

You may use an axvline to create a vertical line spanning over the complete height of the axes. You would position that line in between the categories, i.e. at positions [1.5, 2.5, ...] in terms of axes labels. In case your plot is categorical, those would rather be [0.5, 1.5, ...]
E.g.
ax = df.plot(...)
for i in range(len(categories)-1):
ax.axvline(i+0.5, color="grey")

Related

seaborn - how to use `multiple` parameter in seaborn.histplot?

seaborn.histplot takes a keyword argument called multiple with one of {'layer', 'dodge', 'stack', 'fill'} values. I presume it handles how multiple bars overlap, or when hue is used. but, the examples and documentation doesn't make it clear on when to use what type of multiple. Any information from the community will be helpful!!
A picture says more, etc.
from matplotlib import pyplot as plt
import seaborn as sns
penguins = sns.load_dataset("penguins")
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
kws = ["layer", "dodge", "stack", "fill"]
for kw, ax in zip(kws, axes.flat):
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple=kw, ax=ax)
ax.set_title(kw)
plt.show()
The docs say "Approach to resolving multiple elements when semantic mapping creates subsets. Only relevant with univariate data." meaning this is only of relevance when categories are plotted within one graph:
layer - overlayed categories (giving rise to ... interesting color combinations)
dodge - categories side by side (not applicable to pure KDE plots for the obvious reasons)
stack - stacked categories
fill - categories add up to 100%.
I think the visible examples are pretty good (in comparison to other "documentation" I have seen).
The default for the parameter "multiple" is layer which just the different sub-histograms on top of each other. This is helpfull if you want to compare the form(skewness, variation, location of median/average on x-axis ect.) of each sub-histogram and to compare the forms to each other.
stack piles the bars on top of each other. This would be most usefull if you want to have an idea of the proportion (e.g. which subcategory is more dominantly distributed on which area of the x-axis and a quick look on where there are gaps for which subcategory).
dogde will split the columns next to each other. This is very similar to stack and in my opinion just a variation with the same intent.
If I remember right, fill will fill the whole plot.
This is not as useful in my opinion because the lack of the excact form of each distribution could lead to the user miss some important insights.

Why is the legend shown in a Seaborn JointGrid incorrect?

I am experimenting with JointGrid from Seaborn. I used .plot_joint() to plot my scatter plot, group-colored using the hue parameter. I have filtered my dataset to only include 2 of the 5 groups, to prevent too much overlap in the plots.
The plotted points appear correct, in that they match what I expect from the two groups I chose. Additionally, I double-checked my filtering by viewing the filtered dataframe. That too was correct as it contained only the two groups I chose.
However the legend that is automatically plotted along with the scatterplot is incorrect. It shows 4 groups (not sure why not 5), and the coloring is also incorrect. For 2 groups I would expect only the Red and Blue colors (the first 2 colors in the Set1 palette), but my 2nd group is colored with the 4th color in the Set1 palette.
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
# initialize Joint Grid
grid = sns.JointGrid(data=df_tmp, x='MP', y='PTS')
# plot scatter (main plot)
grid = grid.plot_joint(sns.scatterplot, data=df_tmp, hue='Kmeans_Clusters',
palette='Set1')
# plot marginal distplot for cluster 0, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].MP, ax=grid.ax_marg_x,
vertical=False, color='firebrick', label='Cluster0')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].PTS, ax=grid.ax_marg_y,
vertical=True, color='firebrick', label='Cluster0')
# plot marginal distplot for cluster 3, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].MP, ax=grid.ax_marg_x,
vertical=False, color='steelblue', label='Cluster3')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].PTS, ax=grid.ax_marg_y,
vertical=True, color='steelblue', label='Cluster3')
plt.suptitle('PTS vs MP, Cluster 0 & 3\n1982-2019', y=1.05, fontsize=20)
plt.show()
jointgrid_incorrect_legend_and_coloring
--- Update---
I just tried this with a simple scatterplot (no JointGrid) and I can repeat my previous observation. Is there just something I am not understanding with the hue parameter and the scatterplot() function?
I do not see this issue with lmplot()
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
sns.scatterplot(data=df_tmp, y='PTS', x='MP', hue='Kmeans_Clusters', palette='Set1')
plt.title('PTS vs MP\n1982-2019')
plt.xlabel('Minutes Played Annually')
plt.ylabel('Points Scored Annually')
plt.show()
Again, once I fine tuned my searches, I was able to find the solution. In fact, here's another stackoverflow question that asks the same thing and is answered in detail: The `hue` parameter in Seaborn.relplot() skips an integer when given numerical data?.
Pasting the solution I used, as described in the link above:
"""
An alternative is to make sure the values are treated categorical Unfortunately, even if you plug in the numbers as strings, they will be converted to numbers falling back to the same mechanism described above. This may be seen as a bug.
However, one choice you have is to use real categories, like e.g. single letters.
'cluster':list("ABCDE")
works fine,
"""

Center nested boxplots in Python/Seaborn with unequal classes

I have grouped data from which I want to generated boxplots using seaborn. However, not every group has all classes. As a result, the boxplots are not centered if classes are missing within one group:
Figure
The graph is generated using the following code:
sns.boxplot(x="label2", y="value", hue="variable",palette="Blues")
Is there any way to force seaborn to center theses boxes? I didn't find any approbiate way.
Thank you in advance.
Yes there is but you are not going to like it.
Centering these will mean that you will have the same y value for median values, so normalize your data so that the median is 0.5 for each y value for each value of x. That will give you the plot you want, but you should note that somewhere in the plot so people will not be confused.

How do I create a multiline plot using seaborn?

I am trying out Seaborn to make my plot visually better than matplotlib. I have a dataset which has a column 'Year' which I want to plot on the X-axis and 4 Columns say A,B,C,D on the Y-axis using different coloured lines. I was trying to do this using the sns.lineplot method but it allows for only one variable on the X-axis and one on the Y-axis. I tried doing this
sns.lineplot(data_preproc['Year'],data_preproc['A'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['B'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['C'], err_style=None)
sns.lineplot(data_preproc['Year'],data_preproc['D'], err_style=None)
But this way I don't get a legend in the plot to show which coloured line corresponds to what. I tried checking the documentation but couldn't find a proper way to do this.
Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from its "wide format" (one column per measurement type) into long format (one column for all measurement values, one column to indicate the type) is pandas.melt. Given a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with
sns.lineplot(x='Year', y='value', hue='variable',
data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be adapted to your liking.)
This:
sns.lineplot(data=data_preproc)
will do what you want.
See the documentation:
sns.lineplot(x="Year", y="signal", hue="label", data=data_preproc)
You probably need to re-organize your dataframe in a suitable way so that there is one column for the x data, one for the y data, and one which holds the label for the data point.
You can also just use matplotlib.pyplot. If you import seaborn, much of the improved design is also used for "regular" matplotlib plots. Seaborn is really "just" a collection of methods which conveniently feed data and plot parameters to matplotlib.

Change certain squares in a seaborn heat map

Say I have a heat map that looks like this (axes are trimmed off):
I want to be able to alter certain squares to denote statistical significance. I know that I could mask out the squares that are not statistically significant, but I still want to retain that information (and not set the values to zero). Options for doing this include 1) making the text on certain squares bold, 2) adding a hatch-like functionality so that certain squares have stippling, or 3) adding a symbol to certain squares.
What should I do?
One approach is to access the Text objects directly and change their weight/style. The below code will take some sample data and try to make every entry equal to 118 stand out:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights, annot=True, fmt="d")
for text in ax.texts:
text.set_size(14)
if text.get_text() == '118':
text.set_size(18)
text.set_weight('bold')
text.set_style('italic')
I'm not a matplotlib/seaborn expert, but it appears to me that requiring an individual cell in the heatmap to be hatched would require a bit of work. In short, the heatmap is a Collection of matplotlib Patches, and the hatching of a collection can only be set on the collection as a whole. To set the hatch of an individual cell, you need them to be distinct patches, and things get messy. Perhaps (hopefully) someone more knowledgeable than I can come along and say that this is wrong, and that it's quite easy -- but if I had to guess, I'd say that changing the text style will be easier than setting a hatch.
You could plot twice, applying a mask to the cells you do not want to emphasize the second time:
import numpy as np
import seaborn as sns
x = np.random.randn(10, 10)
sns.heatmap(x, annot=True)
sns.heatmap(x, mask=x < 1, cbar=False,
annot=True, annot_kws={"weight": "bold"})

Categories