Overlayed seaborn distplots sharing x axis - python

sorry if this is too basic, this is my first question to the forum:
I'm using the titanic dataset for practice and
I'm trying to plot two distributions of the variable 'Age', one only with passengers that survived and another with the passenger that perished. But for some reason, they don't share the same x-axis when plotted together.
Here's my code so far:
df_age = df[df['Age'].notnull()]
dfage_survived = dfage[dfage.Survived == 1]
dfage_perished = dfage[dfage.Survived == 0]
sns.set(style="white", palette="muted", color_codes=True)
fig = plt.figure(constrained_layout=True, figsize=(8, 8))
spec = fig.add_gridspec(3, 2)
ax1 = fig.add_subplot(spec[0, 0])
ax1 = sns.barplot(x='Sex', y = 'Survived', data =df)
ax2 = fig.add_subplot(spec[0, 1])
ax2 = sns.barplot(x='Embarked', y = 'Survived', data =df)
ax3 = fig.add_subplot(spec[1, 0])
ax3 = sns.barplot(x='Pclass', y ='Survived', data =df)
ax4 = fig.add_subplot(spec[1, 1])
ax4 = sns.barplot(x='SibSp', y ='Survived', data=df)
ax5 = fig.add_subplot(spec[2, :])
ax5_1 = sns.distplot(dfage_survived['Age'], kde = False, label = 'Survived')
ax5_2 = sns.distplot(dfage_perished['Age'], kde = False, label = 'Perished')
plt.legend(prop={'size': 12})
OUTPUT:
OUTPUT:

You must set bins for each sns.distplot call, otherwise sns will set the bins for you, which are based on the minimum element and maximum element, and since these are different for perished and survived, the bars won't line up. Use the bins parameter to set appropriate bins (see here https://seaborn.pydata.org/generated/seaborn.distplot.html)

The bins of the histogram are dividing the range between the smallest and largest x into equal parts. Both sets have different minimal and maximal values. Moreover, your data is discrete, so the bin boundaries should best be placed in-between the integer values. The bins can be set explicitly: sns.distplot(..., bins=np.arange(-0.5, 86, 5)) for both.
A simpler approach, however, is to make use of Seaborn's hue= parameter to make seaborn take care of dividing the groups and creating both histograms in one go.
Note that sns.distplot has been replaced by sns.histplot in the latest version (0.11). If you want both histograms stacked, you can add the parameter multiple='stack'.
To obtain a stand-alone example, the code below uses the standard Seaborn Titanic dataset, which uses the column names in lowercase.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
df = sns.load_dataset('titanic')
sns.set(style="white", palette="muted", color_codes=True)
fig = plt.figure(constrained_layout=True, figsize=(8, 3))
spec = fig.add_gridspec(1, 2)
ax5 = fig.add_subplot(spec[0, :])
sns.histplot(df, x='age', bins=np.arange(-0.5, 86, 5), kde=False, hue='survived', legend=True, ax=ax5)
ax5.legend(['Yes', 'No'], title='Survived?', prop={'size': 12})
plt.show()

Related

Matplotlib stacked histogram numpy.ndarray error

I am trying to make a stacked histogram using matplotlib by looping through the categories in the dataframe and assigning the bar color based on a dictionary.
I get this error on the ax1.hist() call. How should I fix it?
AttributeError: 'numpy.ndarray' object has no attribute 'hist'
Reproducible Example
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
plt.style.use('seaborn-whitegrid')
y = [1,5,9,2,4,2,5,6,1]
cat = ['A','B','B','B','A','B','B','B','B']
df = pd.DataFrame(list(zip(y,cat)), columns =['y', 'cat'])
fig, axes = plt.subplots(3,3, figsize=(5,5), constrained_layout=True)
fig.suptitle('Histograms')
ax1 = axes[0]
mycolorsdict = {'A':'magenta', 'B':'blue'}
for key, batch in df.groupby(['cat']):
ax1.hist(batch.y, label=key, color=mycolorsdict[key],
density=False, cumulative=False, edgecolor='black',
orientation='horizontal', stacked=True)
Updated effort, still not working
This is close, but it is not stacking (should see stacks at y=5); I think maybe because of the loop?
mycolorsdict = {'A':'magenta', 'B':'blue'}
for ii, ax in enumerate(axes.flat):
for key, batch in df.groupby(['cat']):
ax.hist(batch.y,
label=key, color=mycolorsdict[key],density=False, edgecolor='black',
cumulative=False, orientation='horizontal', stacked=True)
To draw on a specific subplot, two indices are needed (row, column), so axes[0,0] for the first subplot. The error message comes from using ax1 = axes[0] instead of ax1 = axes[0,0].
Now, to create a stacked histogram via ax.hist(), all the y-data need to be provided at the same time. The code below shows how this can be done starting from the result of groupby. Also note, that when your values are discrete, it is important to explicitly set the bin boundaries making sure that the values fall precisely between these boundaries. Setting the boundaries at the halves is one way.
Things can be simplified a lot using seaborn's histplot(). Here is a breakdown of the parameters used:
data=df the dataframe
y='y' gives the dataframe column for histogram. Use x= (instead of y=) for a vertical histogram.
hue='cat' gives the dataframe column to create mulitple groups
palette=mycolorsdict; the palette defines the coloring; there are many ways to assign a palette, one of which is a dictionary on the hue values
discrete=True: when working with discrete data, seaborn sets the appropriate bin boundaries
multiple='stack' creates a stacked histogram, depending on the hue categories
alpha=1: default seaborn sets an alpha of 0.75; optionally this can be changed
ax=axes[0, 1]: draw on the 2nd subplot of the 1st row
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
y = [1, 5, 9, 2, 4, 2, 5, 6, 1]
cat = ['A', 'B', 'B', 'B', 'A', 'B', 'B', 'B', 'B']
df = pd.DataFrame({'y':y, 'cat':cat})
fig, axes = plt.subplots(3, 3, figsize=(20, 10), constrained_layout=True)
fig.suptitle('Histograms')
mycolorsdict = {'A': 'magenta', 'B': 'blue'}
groups = df.groupby(['cat'])
axes[0, 0].hist([batch.y for _, batch in groups],
label=[key for key, _ in groups], color=[mycolorsdict[key] for key, _ in groups], density=False,
edgecolor='black',
cumulative=False, orientation='horizontal', stacked=True, bins=np.arange(0.5, 10))
axes[0, 0].legend()
sns.histplot(data=df, y='y', hue='cat', palette=mycolorsdict, discrete=True, multiple='stack', alpha=1, ax=axes[0, 1])
plt.show()

How to remove or hide y-axis ticklabels from a matplotlib / seaborn plot

I made a plot that looks like this
I want to turn off the ticklabels along the y axis. And to do that I am using
plt.tick_params(labelleft=False, left=False)
And now the plot looks like this. Even though the labels are turned off the scale 1e67 still remains.
Turning off the scale 1e67 would make the plot look better. How do I do that?
seaborn is used to draw the plot, but it's just a high-level API for matplotlib.
The functions called to remove the y-axis labels and ticks are matplotlib methods.
After creating the plot, use .set().
.set(yticklabels=[]) should remove tick labels.
This doesn't work if you use .set_title(), but you can use .set(title='')
.set(ylabel=None) should remove the axis label.
.tick_params(left=False) will remove the ticks.
Similarly, for the x-axis: How to remove or hide x-axis labels from a seaborn / matplotlib plot?
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
Example 1
import seaborn as sns
import matplotlib.pyplot as plt
# load data
exercise = sns.load_dataset('exercise')
pen = sns.load_dataset('penguins')
# create figures
fig, ax = plt.subplots(2, 1, figsize=(8, 8))
# plot data
g1 = sns.boxplot(x='time', y='pulse', hue='kind', data=exercise, ax=ax[0])
g2 = sns.boxplot(x='species', y='body_mass_g', hue='sex', data=pen, ax=ax[1])
plt.show()
Remove Labels
fig, ax = plt.subplots(2, 1, figsize=(8, 8))
g1 = sns.boxplot(x='time', y='pulse', hue='kind', data=exercise, ax=ax[0])
g1.set(yticklabels=[]) # remove the tick labels
g1.set(title='Exercise: Pulse by Time for Exercise Type') # add a title
g1.set(ylabel=None) # remove the axis label
g2 = sns.boxplot(x='species', y='body_mass_g', hue='sex', data=pen, ax=ax[1])
g2.set(yticklabels=[])
g2.set(title='Penguins: Body Mass by Species for Gender')
g2.set(ylabel=None) # remove the y-axis label
g2.tick_params(left=False) # remove the ticks
plt.tight_layout()
plt.show()
Example 2
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# sinusoidal sample data
sample_length = range(1, 1+1) # number of columns of frequencies
rads = np.arange(0, 2*np.pi, 0.01)
data = np.array([(np.cos(t*rads)*10**67) + 3*10**67 for t in sample_length])
df = pd.DataFrame(data.T, index=pd.Series(rads.tolist(), name='radians'), columns=[f'freq: {i}x' for i in sample_length])
df.reset_index(inplace=True)
# plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot('radians', 'freq: 1x', data=df)
# or skip the previous two lines and plot df directly
# ax = df.plot(x='radians', y='freq: 1x', figsize=(8, 8), legend=False)
Remove Labels
# plot
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot('radians', 'freq: 1x', data=df)
# or skip the previous two lines and plot df directly
# ax = df.plot(x='radians', y='freq: 1x', figsize=(8, 8), legend=False)
ax.set(yticklabels=[]) # remove the tick labels
ax.tick_params(left=False) # remove the ticks

Shift bar locations on multi-bar bar plot

much searching has not yielded a working solution to a python matplotlib problem. I'm sure I'm missing something simple...
MWE:
import pandas as pd
import matplotlib.pyplot as plt
#MWE plot
T = [1, 2, 3, 4, 5, 6]
n = len(T)
d1 = list(zip([500]*n, [250]*n))
d2 = list(zip([250]*n, [125]*n))
df1 = pd.DataFrame(data=d1, index=T)
df2 = pd.DataFrame(data=d2, index=T)
fig = plt.figure()
ax = fig.add_subplot(111)
df1.plot(kind='bar', stacked=True, align='edge', width=-0.4, ax=ax)
df2.plot(kind='bar', stacked=True, align='edge', width=0.4, ax=ax)
plt.show()
Generates:
Shifted Plot
No matter what parameters I play around with, that first bar is cut off on the left. If I only plot a single bar (i.e. not clusters of bars), the bars are not cut off and in fact there is nice even white space on both sides.
I hard-coded the data for this MWE; however, I am trying to find a generic way to ensure the correct alignment since I will likely produce a LOT of these plots with varying numbers of items on the x axis and potentially a varying number of bars in each cluster.
How do I shift the bars so that the they are spaced correctly on the x axis with even white space?
It all depends on the width that you put in your plots. Put some xlim.
import pandas as pd
import matplotlib.pyplot as plt
#MWE plot
T = [1, 2, 3, 4, 5, 6]
n = len(T)
d1 = list(zip([500]*n, [250]*n))
d2 = list(zip([250]*n, [125]*n))
df1 = pd.DataFrame(data=d1, index=T)
df2 = pd.DataFrame(data=d2, index=T)
fig = plt.figure()
ax = fig.add_subplot(111)
df1.plot(kind='bar', stacked=True, align='edge', width=-0.4, ax=ax)
df2.plot(kind='bar', stacked=True, align='edge', width=0.4, ax=ax)
plt.xlim(-.4,5.4)
plt.show()
Hope it works!

How to make xticks evenly spaced despite their value?

I am trying to generate a plot with x-axis being a geometric sequence while the y axis is a number between 0.0 and 1.0. My code looks like this:
form matplotlib import pyplot as plt
plt.xticks(X)
plt.plot(X,Y)
plt.show()
which generates a plot like this:
As you can see, I am explicitly setting the x-axis ticks to the ones belonging to the geometric sequence.
My question:Is it possible to make x-ticks evenly spaced despite their value, as the initial terms of the sequence are small, and crowded together. Kind of like logarithmic scale, which would be ideal if dealing with powers of a base, but not for a geometric sequence, I think, as is the case here.
You can do it by plotting your variable as a function of the "natural" variable that parametrizes your curve. For example:
n = 12
a = np.arange(n)
x = 2**a
y = np.random.rand(n)
fig = plt.figure(1, figsize=(7,7))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(x,y)
ax1.xaxis.set_ticks(x)
ax2.plot(a, y) #we plot y as a function of a, which parametrizes x
ax2.xaxis.set_ticks(a) #set the ticks to be a
ax2.xaxis.set_ticklabels(x) # change the ticks' names to x
which produces:
I had the same problem and spent several hours trying to find something appropriate. But it appears to be really easy and you do not need to make any parameterization or play with some x-ticks positions, etc.
The only thing you need to do is just to plot your x-values as str, not int: plot(x.astype('str'), y)
By modifying the code from the previous answer you will get:
n = 12
a = np.arange(n)
x = 2**a
y = np.random.rand(n)
fig = plt.figure(1, figsize=(7,7))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.plot(x,y)
ax1.xaxis.set_ticks(x)
ax2.plot(x.astype('str'), y)
Seaborn has a bunch of categorical plot handling natively this kind of task.
Such as pointplot:
sns.pointplot(x="x", y="y", data=df, ax=ax)
Exemple
fig, [ax1, ax2] = plt.subplots(2, figsize=(7,7))
sns.lineplot(data=df, x="x", y="y", ax=ax1) #relational plot
sns.pointplot(data=df, x="x", y="y", ax=ax2) #categorical plot
In case of using Pandas Dataframe:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
n = 12
df = pd.DataFrame(dict(
X=2**np.arange(n),
Y=np.random.randint(1, 9, size=n),
)).set_index('X')
# index is reset in order to use as xticks
df.reset_index(inplace=True)
fig = plt.figure()
ax1 = plt.subplot(111)
df['Y'].plot(kind='bar', ax=ax1, figsize=(7, 7), use_index=True)
# set_ticklabels used to place original indexes
ax1.xaxis.set_ticklabels(df['X'])
convert int to str:
X = list(map(str, X))
plt.xticks(X)
plt.plot(X,Y)
plt.show()

Histogram with Boxplot above in Python

Hi I wanted to draw a histogram with a boxplot appearing the top of the histogram showing the Q1,Q2 and Q3 as well as the outliers. Example phone is below. (I am using Python and Pandas)
I have checked several examples using matplotlib.pyplot but hardly came out with a good example. And I also wanted to have the histogram curve appearing like in the image below.
I also tried seaborn and it provided me the shape line along with the histogram but didnt find a way to incorporate with boxpot above it.
can anyone help me with this to have this on matplotlib.pyplot or using pyplot
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")
x = np.random.randn(100)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,
gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(x, ax=ax_box)
sns.distplot(x, ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
From seaborn v0.11.2, sns.distplot is deprecated. Use sns.histplot for axes-level plots instead.
np.random.seed(2022)
x = np.random.randn(100)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(x=x, ax=ax_box)
sns.histplot(x=x, bins=12, kde=True, stat='density', ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
Solution using only matplotlib, just because:
# start the plot: 2 rows, because we want the boxplot on the first row
# and the hist on the second
fig, ax = plt.subplots(
2, figsize=(7, 5), sharex=True,
gridspec_kw={"height_ratios": (.3, .7)} # the boxplot gets 30% of the vertical space
)
# the boxplot
ax[0].boxplot(data, vert=False)
# removing borders
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
# the histogram
ax[1].hist(data)
# and we are good to go
plt.show()
Expanding on the answer from #mwaskom, I made a little adaptable function.
import seaborn as sns
def histogram_boxplot(data, xlabel = None, title = None, font_scale=2, figsize=(9,8), bins = None):
""" Boxplot and histogram combined
data: 1-d data array
xlabel: xlabel
title: title
font_scale: the scale of the font (default 2)
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
example use: histogram_boxplot(np.random.rand(100), bins = 20, title="Fancy plot")
"""
sns.set(font_scale=font_scale)
f2, (ax_box2, ax_hist2) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)}, figsize=figsize)
sns.boxplot(data, ax=ax_box2)
sns.distplot(data, ax=ax_hist2, bins=bins) if bins else sns.distplot(data, ax=ax_hist2)
if xlabel: ax_hist2.set(xlabel=xlabel)
if title: ax_box2.set(title=title)
plt.show()
histogram_boxplot(np.random.randn(100), bins = 20, title="Fancy plot", xlabel="Some values")
Image
def histogram_boxplot(feature, figsize=(15,10), bins=None):
f,(ax_box,ax_hist)=plt.subplots(nrows=2,sharex=True, gridspec_kw={'height_ratios':(.25,.75)},figsize=figsize)
sns.distplot(feature,kde=False,ax=ax_hist,bins=bins)
sns.boxplot(feature,ax=ax_box, color='Red')
ax_hist.axvline(np.mean(feature),color='g',linestyle='-')
ax_hist.axvline(np.median(feature),color='y',linestyle='--')

Categories