seaborn - how to use `multiple` parameter in seaborn.histplot? - python

seaborn.histplot takes a keyword argument called multiple with one of {'layer', 'dodge', 'stack', 'fill'} values. I presume it handles how multiple bars overlap, or when hue is used. but, the examples and documentation doesn't make it clear on when to use what type of multiple. Any information from the community will be helpful!!

A picture says more, etc.
from matplotlib import pyplot as plt
import seaborn as sns
penguins = sns.load_dataset("penguins")
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
kws = ["layer", "dodge", "stack", "fill"]
for kw, ax in zip(kws, axes.flat):
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple=kw, ax=ax)
ax.set_title(kw)
plt.show()
The docs say "Approach to resolving multiple elements when semantic mapping creates subsets. Only relevant with univariate data." meaning this is only of relevance when categories are plotted within one graph:
layer - overlayed categories (giving rise to ... interesting color combinations)
dodge - categories side by side (not applicable to pure KDE plots for the obvious reasons)
stack - stacked categories
fill - categories add up to 100%.

I think the visible examples are pretty good (in comparison to other "documentation" I have seen).
The default for the parameter "multiple" is layer which just the different sub-histograms on top of each other. This is helpfull if you want to compare the form(skewness, variation, location of median/average on x-axis ect.) of each sub-histogram and to compare the forms to each other.
stack piles the bars on top of each other. This would be most usefull if you want to have an idea of the proportion (e.g. which subcategory is more dominantly distributed on which area of the x-axis and a quick look on where there are gaps for which subcategory).
dogde will split the columns next to each other. This is very similar to stack and in my opinion just a variation with the same intent.
If I remember right, fill will fill the whole plot.
This is not as useful in my opinion because the lack of the excact form of each distribution could lead to the user miss some important insights.

Related

How to reproduce this legend with multiple curves?

I've been working hard on a package of functions for my work, and I'm stuck on a layout problem. Sometimes I need to work with a lot of columns subplots (1 row x N columns) and the standard matplotlib legend sometimes is not helpful and makes it hard to visualize all the data.
I've been trying to create something like the picture below. I already tried to create a subplot for the curves and another one for the legends (and display the x-axis scale as a horizontal plot). Also, I tried to spine the x-axis, but when I have a lot of curves plotted inside the same subplots the legend becomes huge.
The following image is from a software. I'd like to create a similar look. Notice that these legends are "static": it remains fixed independent of the zooming. Another observation is, I don't need all the ticks or anything like that.
What I'm already have is the following (the code is a mess, becouse I'm trying many different solutions and it is not organized nor pythonic yet.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,2, sharey = True)
ax[0].semilogx(np.zeros_like(dados.Depth)+0.02, dados.Depth)
ax[0].semilogx(dados.AHT90, dados.Depth, label = 'aht90')
ax[0].set_xlim(0.2,2000)
ax[0].grid(True, which = 'both', axis = 'both')
axres1 = ax[0].twiny()
axres1.semilogx(dados.AHT90, dados.Depth, label = 'aht90')
axres1.set_xlim(0.2 , 2000)
axres1.set_xticks(np.logspace(np.log10(0.2),np.log10(2000),2))
axres1.spines["top"].set_position(("axes", 1.02))
axres1.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
axres1.tick_params(axis='both', which='both', labelsize=6)
axres1.set_xlabel('sss')#, labelsize = 5)
axres2 = ax[0].twiny()
axres2.semilogx(dados.AHT10, dados.Depth, label = 'aht90')
axres2.set_xlim(0.2 , 2000)
axres2.set_xticks(np.logspace(np.log10(0.2),np.log10(2000),2))
axres2.spines["top"].set_position(("axes", 1.1))
axres2.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
axres2.tick_params(axis='both', which='both', labelsize=6)
axres2.set_xlabel('aht10')#, labelsize = 5)
fig.show()
and the result is:
But well, I'm facing some issues on make a kind of make it automatic. If I add more curves, the prameter "set position" it is not practical to keep setting the position "by hand"
set_position(("axes", 1.02))
and another problem is, more curves I add, that kind of "legend" keep growing upward, and I have to adjust the subplot size with
fig.subplots_adjust(top=0.75)
And I'm also want to make the adjustment automatic, without keeping updating that parameter whenever I add more curves

Vary legend properties for different data in seaborn scatterplots

I have created a seaborn scatterplot for a dataset, where I set the sizes parameter to one column, and the hue parameter to another. Now the hue parameter only consists of five different values and is supposed to help classifying my data, while the sizes parameter consists of a lot more to represent actual numeric data. In this current data set, my hue values only consist of 0, 2, and 4, but in the "brief" legend option, the legend labels are not synchronized to that, which is very confusing. In the "full" legend option, the hue-labels are correct, but the size-labels are way too many. Therefore I would like to display the full legend for my hue parameter, but only a brief legend for the sizes parameter, because it consists of lots of unique values.
How the overcrowded "full" legend looks
The "brief" legend that is confusingly labeled
Edit: I edited some code in that demonstrates the issue for a random dataset. To specify my question again, I want the "shape" parameters to get fully depicted on the legend, while the "size" parameters have to be shortened (equivalent to the legend setting "brief").
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
x_condition=np.arange(0,20,1)
y_condition=np.arange(0,20,1)
size=np.random.randint(0,200,20)
# I haven't made a random distribution here, because I wanted to make sure it contains at least one of each [0,2,4]
shape=[0,2,0,4]*5
df=pd.DataFrame({"x_condition":x_condition,"y_condition":y_condition,"size":size,"shape":shape})
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="brief")
plt.show()
sns.scatterplot("x_condition", "y_condition", hue="shape", size="size", data=df, palette="coolwarm", legend="full")
plt.show()

seaborn: separate groups in factorplot

I'm using seaborn to create a factorplot. Some entries in my data are missing, so it is hard to understand which bar belongs to which group.
In this example, the factorplot analyzes the answers in a survey. People are grouped by the answer they gave to a first question. Then for each of these groups, the distribution for a second question is plotted.
Sometimes the data is rather sparse. That means some bars are missing, and the boundaries between the groups are unclear. Can you tell wether the black bar in the middle belongs to group 2 or 3 ?
I would like to add separators between the groups. A vertical line for example would be nice.
In the pandas docs, I can't find anything like this.
Is there a way to add a visual separation between the factorplot groups ?
You may use an axvline to create a vertical line spanning over the complete height of the axes. You would position that line in between the categories, i.e. at positions [1.5, 2.5, ...] in terms of axes labels. In case your plot is categorical, those would rather be [0.5, 1.5, ...]
E.g.
ax = df.plot(...)
for i in range(len(categories)-1):
ax.axvline(i+0.5, color="grey")

Compare 1 independent vs many dependent variables using seaborn pairplot in an horizontal plot

The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)
You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.

Change certain squares in a seaborn heat map

Say I have a heat map that looks like this (axes are trimmed off):
I want to be able to alter certain squares to denote statistical significance. I know that I could mask out the squares that are not statistically significant, but I still want to retain that information (and not set the values to zero). Options for doing this include 1) making the text on certain squares bold, 2) adding a hatch-like functionality so that certain squares have stippling, or 3) adding a symbol to certain squares.
What should I do?
One approach is to access the Text objects directly and change their weight/style. The below code will take some sample data and try to make every entry equal to 118 stand out:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights, annot=True, fmt="d")
for text in ax.texts:
text.set_size(14)
if text.get_text() == '118':
text.set_size(18)
text.set_weight('bold')
text.set_style('italic')
I'm not a matplotlib/seaborn expert, but it appears to me that requiring an individual cell in the heatmap to be hatched would require a bit of work. In short, the heatmap is a Collection of matplotlib Patches, and the hatching of a collection can only be set on the collection as a whole. To set the hatch of an individual cell, you need them to be distinct patches, and things get messy. Perhaps (hopefully) someone more knowledgeable than I can come along and say that this is wrong, and that it's quite easy -- but if I had to guess, I'd say that changing the text style will be easier than setting a hatch.
You could plot twice, applying a mask to the cells you do not want to emphasize the second time:
import numpy as np
import seaborn as sns
x = np.random.randn(10, 10)
sns.heatmap(x, annot=True)
sns.heatmap(x, mask=x < 1, cbar=False,
annot=True, annot_kws={"weight": "bold"})

Categories