Say I have a heat map that looks like this (axes are trimmed off):
I want to be able to alter certain squares to denote statistical significance. I know that I could mask out the squares that are not statistically significant, but I still want to retain that information (and not set the values to zero). Options for doing this include 1) making the text on certain squares bold, 2) adding a hatch-like functionality so that certain squares have stippling, or 3) adding a symbol to certain squares.
What should I do?
One approach is to access the Text objects directly and change their weight/style. The below code will take some sample data and try to make every entry equal to 118 stand out:
flights = sns.load_dataset("flights")
flights = flights.pivot("month", "year", "passengers")
ax = sns.heatmap(flights, annot=True, fmt="d")
for text in ax.texts:
text.set_size(14)
if text.get_text() == '118':
text.set_size(18)
text.set_weight('bold')
text.set_style('italic')
I'm not a matplotlib/seaborn expert, but it appears to me that requiring an individual cell in the heatmap to be hatched would require a bit of work. In short, the heatmap is a Collection of matplotlib Patches, and the hatching of a collection can only be set on the collection as a whole. To set the hatch of an individual cell, you need them to be distinct patches, and things get messy. Perhaps (hopefully) someone more knowledgeable than I can come along and say that this is wrong, and that it's quite easy -- but if I had to guess, I'd say that changing the text style will be easier than setting a hatch.
You could plot twice, applying a mask to the cells you do not want to emphasize the second time:
import numpy as np
import seaborn as sns
x = np.random.randn(10, 10)
sns.heatmap(x, annot=True)
sns.heatmap(x, mask=x < 1, cbar=False,
annot=True, annot_kws={"weight": "bold"})
Related
seaborn.histplot takes a keyword argument called multiple with one of {'layer', 'dodge', 'stack', 'fill'} values. I presume it handles how multiple bars overlap, or when hue is used. but, the examples and documentation doesn't make it clear on when to use what type of multiple. Any information from the community will be helpful!!
A picture says more, etc.
from matplotlib import pyplot as plt
import seaborn as sns
penguins = sns.load_dataset("penguins")
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
kws = ["layer", "dodge", "stack", "fill"]
for kw, ax in zip(kws, axes.flat):
sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple=kw, ax=ax)
ax.set_title(kw)
plt.show()
The docs say "Approach to resolving multiple elements when semantic mapping creates subsets. Only relevant with univariate data." meaning this is only of relevance when categories are plotted within one graph:
layer - overlayed categories (giving rise to ... interesting color combinations)
dodge - categories side by side (not applicable to pure KDE plots for the obvious reasons)
stack - stacked categories
fill - categories add up to 100%.
I think the visible examples are pretty good (in comparison to other "documentation" I have seen).
The default for the parameter "multiple" is layer which just the different sub-histograms on top of each other. This is helpfull if you want to compare the form(skewness, variation, location of median/average on x-axis ect.) of each sub-histogram and to compare the forms to each other.
stack piles the bars on top of each other. This would be most usefull if you want to have an idea of the proportion (e.g. which subcategory is more dominantly distributed on which area of the x-axis and a quick look on where there are gaps for which subcategory).
dogde will split the columns next to each other. This is very similar to stack and in my opinion just a variation with the same intent.
If I remember right, fill will fill the whole plot.
This is not as useful in my opinion because the lack of the excact form of each distribution could lead to the user miss some important insights.
I am experimenting with JointGrid from Seaborn. I used .plot_joint() to plot my scatter plot, group-colored using the hue parameter. I have filtered my dataset to only include 2 of the 5 groups, to prevent too much overlap in the plots.
The plotted points appear correct, in that they match what I expect from the two groups I chose. Additionally, I double-checked my filtering by viewing the filtered dataframe. That too was correct as it contained only the two groups I chose.
However the legend that is automatically plotted along with the scatterplot is incorrect. It shows 4 groups (not sure why not 5), and the coloring is also incorrect. For 2 groups I would expect only the Red and Blue colors (the first 2 colors in the Set1 palette), but my 2nd group is colored with the 4th color in the Set1 palette.
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
# initialize Joint Grid
grid = sns.JointGrid(data=df_tmp, x='MP', y='PTS')
# plot scatter (main plot)
grid = grid.plot_joint(sns.scatterplot, data=df_tmp, hue='Kmeans_Clusters',
palette='Set1')
# plot marginal distplot for cluster 0, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].MP, ax=grid.ax_marg_x,
vertical=False, color='firebrick', label='Cluster0')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 0].PTS, ax=grid.ax_marg_y,
vertical=True, color='firebrick', label='Cluster0')
# plot marginal distplot for cluster 3, X & Y
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].MP, ax=grid.ax_marg_x,
vertical=False, color='steelblue', label='Cluster3')
sns.distplot(df_tmp[df_tmp.Kmeans_Clusters == 3].PTS, ax=grid.ax_marg_y,
vertical=True, color='steelblue', label='Cluster3')
plt.suptitle('PTS vs MP, Cluster 0 & 3\n1982-2019', y=1.05, fontsize=20)
plt.show()
jointgrid_incorrect_legend_and_coloring
--- Update---
I just tried this with a simple scatterplot (no JointGrid) and I can repeat my previous observation. Is there just something I am not understanding with the hue parameter and the scatterplot() function?
I do not see this issue with lmplot()
plt.rcParams['figure.figsize'] = (12, 4)
df_tmp = df[df.Kmeans_Clusters.isin([0, 3])].copy()
sns.scatterplot(data=df_tmp, y='PTS', x='MP', hue='Kmeans_Clusters', palette='Set1')
plt.title('PTS vs MP\n1982-2019')
plt.xlabel('Minutes Played Annually')
plt.ylabel('Points Scored Annually')
plt.show()
Again, once I fine tuned my searches, I was able to find the solution. In fact, here's another stackoverflow question that asks the same thing and is answered in detail: The `hue` parameter in Seaborn.relplot() skips an integer when given numerical data?.
Pasting the solution I used, as described in the link above:
"""
An alternative is to make sure the values are treated categorical Unfortunately, even if you plug in the numbers as strings, they will be converted to numbers falling back to the same mechanism described above. This may be seen as a bug.
However, one choice you have is to use real categories, like e.g. single letters.
'cluster':list("ABCDE")
works fine,
"""
I am using a bit of code to run and generate reports with Python. This code takes information from an online survey tool and runs basic statistics on the data then generates a word document based on the results. I am creating a number of graphs along the way. I have the following function to help me build some of the histograms.
def histogram_by(df, df_column, sort_by, height):
"""
df = location of the data
df_column = column in the data frame with the required data
sort_by = the column used to catagorize the data
height = the calculated height of the subplots, changes depending on number of plots
"""
f, ax = generate_subplots(df[sort_by].nunique(), height)
df[df_column].hist(
ax=ax,
by=df[sort_by],
xrot=360,
bins=np.linspace(1, 5, 9))
plt.tight_layout()
plt.savefig('plt.png')
So in the first picture it shows what the graphs looks like when there is enough data to force integers. This happens in most cases.
In the second picture there is not enough data to force the Y-Axis to make integers, so it creates floats. It also appears that the graphs in this version are a bit wider in comparison to the 'correct' output. Any ideas?
The amount of data changes based on how many people answered the surveys. Is there any way to force the Y-Axis to use integers instead of defaulting to floats?
Thanks for taking the time to help me out.
Best,
Chris
First create a minimal example of the issue.
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(4,1.7))
data = np.random.randint(1,9, size=52)
ax.hist(data, bins=np.arange(0,9)+0.5, ec="k")
plt.show()
Now, you can get rid of the decimals on the y axis by telling the default AutoLocator to use only integers
ax.locator_params(axis='y', integer=True)
Result:
In a classifieds website I maintain, I'm comparing classifieds that receive greater-than-median views vs classifieds that are below median in this criterion. I call the former "high performance" classifieds. Here's a simple countplot showing this:
The hue is simply the number of photos the classified had.
My question is - is there a plot type in seaborn or matplotlib which shows proportions instead of absolute counts?
I essentially want the same countplot, but with each bar as a % of the total items in that particular category. For example, notice that in the countplot, classifieds with 3 photos make up a much larger proportion of the high perf category. It takes a while to glean that information. If each bar's height was instead represented by its % contribution to its category, it'd be a much easier comparison. That's why I'm looking for what I'm looking for.
An illustrative example would be great.
Instead of trying to find a special case plotting function that would do exactly what you want, I would suggest to consider keeping data generation and visualization separate. At the end what you want is to plot a bar graph of some values, so the idea would be to generate the data in such a way that they can easily be plotted.
To this end, you may crosstab the two columns in question and divide each row (or column) in the resulting table by its sum. This table can then easily be plotted using the pandas plotting wrapper.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
plt.rcParams["figure.figsize"] = 5.6, 7.0
n = 100
df = pd.DataFrame({"performance": np.random.choice([0,1], size=n, p=[0.7,0.3]),
"photo" : np.random.choice(range(4), size=n, p=[0.6,0.1,0.2,0.1]),
"someothervalue" : np.random.randn(n) })
fig, (ax,ax2, ax3) = plt.subplots(nrows=3)
freq = pd.crosstab(df["performance"],df["photo"])
freq.plot(kind="bar", ax=ax)
relative = freq.div(freq.sum(axis=1), axis=0)
relative.plot(kind="bar", ax=ax2)
relative = freq.div(freq.sum(axis=0), axis=1)
relative.plot(kind="bar", ax=ax3)
ax.set_title("countplot of absolute frequency")
ax2.set_title("barplot of relative frequency by performance")
ax3.set_title("barplot of relative frequency by photo")
for a in [ax, ax2, ax3]: a.legend(title="Photo", loc=6, bbox_to_anchor=(1.02,0.5))
plt.subplots_adjust(right=0.8,hspace=0.6)
plt.show()
The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)
You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.