Matplotlib stacked histogram numpy.ndarray error - python

I am trying to make a stacked histogram using matplotlib by looping through the categories in the dataframe and assigning the bar color based on a dictionary.
I get this error on the ax1.hist() call. How should I fix it?
AttributeError: 'numpy.ndarray' object has no attribute 'hist'
Reproducible Example
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
plt.style.use('seaborn-whitegrid')
y = [1,5,9,2,4,2,5,6,1]
cat = ['A','B','B','B','A','B','B','B','B']
df = pd.DataFrame(list(zip(y,cat)), columns =['y', 'cat'])
fig, axes = plt.subplots(3,3, figsize=(5,5), constrained_layout=True)
fig.suptitle('Histograms')
ax1 = axes[0]
mycolorsdict = {'A':'magenta', 'B':'blue'}
for key, batch in df.groupby(['cat']):
ax1.hist(batch.y, label=key, color=mycolorsdict[key],
density=False, cumulative=False, edgecolor='black',
orientation='horizontal', stacked=True)
Updated effort, still not working
This is close, but it is not stacking (should see stacks at y=5); I think maybe because of the loop?
mycolorsdict = {'A':'magenta', 'B':'blue'}
for ii, ax in enumerate(axes.flat):
for key, batch in df.groupby(['cat']):
ax.hist(batch.y,
label=key, color=mycolorsdict[key],density=False, edgecolor='black',
cumulative=False, orientation='horizontal', stacked=True)

To draw on a specific subplot, two indices are needed (row, column), so axes[0,0] for the first subplot. The error message comes from using ax1 = axes[0] instead of ax1 = axes[0,0].
Now, to create a stacked histogram via ax.hist(), all the y-data need to be provided at the same time. The code below shows how this can be done starting from the result of groupby. Also note, that when your values are discrete, it is important to explicitly set the bin boundaries making sure that the values fall precisely between these boundaries. Setting the boundaries at the halves is one way.
Things can be simplified a lot using seaborn's histplot(). Here is a breakdown of the parameters used:
data=df the dataframe
y='y' gives the dataframe column for histogram. Use x= (instead of y=) for a vertical histogram.
hue='cat' gives the dataframe column to create mulitple groups
palette=mycolorsdict; the palette defines the coloring; there are many ways to assign a palette, one of which is a dictionary on the hue values
discrete=True: when working with discrete data, seaborn sets the appropriate bin boundaries
multiple='stack' creates a stacked histogram, depending on the hue categories
alpha=1: default seaborn sets an alpha of 0.75; optionally this can be changed
ax=axes[0, 1]: draw on the 2nd subplot of the 1st row
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
y = [1, 5, 9, 2, 4, 2, 5, 6, 1]
cat = ['A', 'B', 'B', 'B', 'A', 'B', 'B', 'B', 'B']
df = pd.DataFrame({'y':y, 'cat':cat})
fig, axes = plt.subplots(3, 3, figsize=(20, 10), constrained_layout=True)
fig.suptitle('Histograms')
mycolorsdict = {'A': 'magenta', 'B': 'blue'}
groups = df.groupby(['cat'])
axes[0, 0].hist([batch.y for _, batch in groups],
label=[key for key, _ in groups], color=[mycolorsdict[key] for key, _ in groups], density=False,
edgecolor='black',
cumulative=False, orientation='horizontal', stacked=True, bins=np.arange(0.5, 10))
axes[0, 0].legend()
sns.histplot(data=df, y='y', hue='cat', palette=mycolorsdict, discrete=True, multiple='stack', alpha=1, ax=axes[0, 1])
plt.show()

Related

Aligning subplots with a pyplot barplot and seaborn heatmap

I am attempting to place a Seaborn time-based heatmap on top of a bar chart, indicating the number of patients in each bin/timeframe. I can successfully make an individual heatmap and bar plot, but combining the two does not work as intended.
import pandas as pd
import numpy as np
import seaborn as sb
from matplotlib import pyplot as plt
# Mock data
patient_counts = [650, 28, 8]
missings_df = pd.DataFrame(np.array([[-15.8, 600/650, 580/650, 590/650],
[488.2, 20/23, 21/23, 21/23],
[992.2, 7/8, 8/8, 8/8]]),
columns=['time', 'Resp. (/min)', 'SpO2', 'Blood Pressure'])
missings_df.set_index('time', inplace=True)
# Plot heatmap
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(26, 16), sharex=True, gridspec_kw={'height_ratios': [5, 1]})
sb.heatmap(missings_df.T, cmap="Blues", cbar_kws={"shrink": .8}, ax=ax1, xticklabels=False)
plt.xlabel('Time (hours)')
# Plot line graph under heatmap to show nr. of patients in each bin
x_ticks = [time for time in missings_df.index]
ax2.bar([i for i, _ in enumerate(x_ticks)], patient_counts, align='center')
plt.xticks([i for i, _ in enumerate(x_ticks)], x_ticks)
plt.show()
This code gives me the graph below. As you can see, there are two issues:
The bar plot extends too far
The first and second bar are not aligned with the top graph, where the tick of the first plot does not line up with the centre of the bar either.
I've tried looking online but could not find a good resource to fix the issues.. Any ideas?
A problem is that the colorbar takes away space from the heatmap, making its plot narrower than the bar plot. You can create a 2x2 grid to make room for the colorbar, and remove the empty subplot. Change sharex=True to sharex='col' to prevent the colorbar getting the same x-axis as the heatmap.
Another problem is that the heatmap has its cell borders at positions 0, 1, 2, ..., so their centers are at 0.5, 1.5, 2.5, .... You can put the bars at these centers instead of at their default positions:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
fig, ((ax1, cbar_ax), (ax2, dummy_ax)) = plt.subplots(nrows=2, ncols=2, figsize=(26, 16), sharex='col',
gridspec_kw={'height_ratios': [5, 1], 'width_ratios': [20, 1]})
missings_df = np.random.rand(3, 3)
sns.heatmap(missings_df.T, cmap="Blues", cbar_ax=cbar_ax, xticklabels=False, linewidths=2, ax=ax1)
ax2.set_xlabel('Time (hours)')
patient_counts = np.random.randint(10, 50, 3)
x_ticks = ['Time1', 'Time2', 'Time3']
x_tick_pos = [i + 0.5 for i in range(len(x_ticks))]
ax2.bar(x_tick_pos, patient_counts, align='center')
ax2.set_xticks(x_tick_pos)
ax2.set_xticklabels(x_ticks)
dummy_ax.axis('off')
plt.tight_layout()
plt.show()
PS: Be careful not to mix the "functional" interface with the "object-oriented" interface to matplotlib. So, try not to use plt.xlabel() as it is not obvious that it will be applied to the "current" ax (ax2 in the code of the question).

Overlayed seaborn distplots sharing x axis

sorry if this is too basic, this is my first question to the forum:
I'm using the titanic dataset for practice and
I'm trying to plot two distributions of the variable 'Age', one only with passengers that survived and another with the passenger that perished. But for some reason, they don't share the same x-axis when plotted together.
Here's my code so far:
df_age = df[df['Age'].notnull()]
dfage_survived = dfage[dfage.Survived == 1]
dfage_perished = dfage[dfage.Survived == 0]
sns.set(style="white", palette="muted", color_codes=True)
fig = plt.figure(constrained_layout=True, figsize=(8, 8))
spec = fig.add_gridspec(3, 2)
ax1 = fig.add_subplot(spec[0, 0])
ax1 = sns.barplot(x='Sex', y = 'Survived', data =df)
ax2 = fig.add_subplot(spec[0, 1])
ax2 = sns.barplot(x='Embarked', y = 'Survived', data =df)
ax3 = fig.add_subplot(spec[1, 0])
ax3 = sns.barplot(x='Pclass', y ='Survived', data =df)
ax4 = fig.add_subplot(spec[1, 1])
ax4 = sns.barplot(x='SibSp', y ='Survived', data=df)
ax5 = fig.add_subplot(spec[2, :])
ax5_1 = sns.distplot(dfage_survived['Age'], kde = False, label = 'Survived')
ax5_2 = sns.distplot(dfage_perished['Age'], kde = False, label = 'Perished')
plt.legend(prop={'size': 12})
OUTPUT:
OUTPUT:
You must set bins for each sns.distplot call, otherwise sns will set the bins for you, which are based on the minimum element and maximum element, and since these are different for perished and survived, the bars won't line up. Use the bins parameter to set appropriate bins (see here https://seaborn.pydata.org/generated/seaborn.distplot.html)
The bins of the histogram are dividing the range between the smallest and largest x into equal parts. Both sets have different minimal and maximal values. Moreover, your data is discrete, so the bin boundaries should best be placed in-between the integer values. The bins can be set explicitly: sns.distplot(..., bins=np.arange(-0.5, 86, 5)) for both.
A simpler approach, however, is to make use of Seaborn's hue= parameter to make seaborn take care of dividing the groups and creating both histograms in one go.
Note that sns.distplot has been replaced by sns.histplot in the latest version (0.11). If you want both histograms stacked, you can add the parameter multiple='stack'.
To obtain a stand-alone example, the code below uses the standard Seaborn Titanic dataset, which uses the column names in lowercase.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
df = sns.load_dataset('titanic')
sns.set(style="white", palette="muted", color_codes=True)
fig = plt.figure(constrained_layout=True, figsize=(8, 3))
spec = fig.add_gridspec(1, 2)
ax5 = fig.add_subplot(spec[0, :])
sns.histplot(df, x='age', bins=np.arange(-0.5, 86, 5), kde=False, hue='survived', legend=True, ax=ax5)
ax5.legend(['Yes', 'No'], title='Survived?', prop={'size': 12})
plt.show()

Shift bar locations on multi-bar bar plot

much searching has not yielded a working solution to a python matplotlib problem. I'm sure I'm missing something simple...
MWE:
import pandas as pd
import matplotlib.pyplot as plt
#MWE plot
T = [1, 2, 3, 4, 5, 6]
n = len(T)
d1 = list(zip([500]*n, [250]*n))
d2 = list(zip([250]*n, [125]*n))
df1 = pd.DataFrame(data=d1, index=T)
df2 = pd.DataFrame(data=d2, index=T)
fig = plt.figure()
ax = fig.add_subplot(111)
df1.plot(kind='bar', stacked=True, align='edge', width=-0.4, ax=ax)
df2.plot(kind='bar', stacked=True, align='edge', width=0.4, ax=ax)
plt.show()
Generates:
Shifted Plot
No matter what parameters I play around with, that first bar is cut off on the left. If I only plot a single bar (i.e. not clusters of bars), the bars are not cut off and in fact there is nice even white space on both sides.
I hard-coded the data for this MWE; however, I am trying to find a generic way to ensure the correct alignment since I will likely produce a LOT of these plots with varying numbers of items on the x axis and potentially a varying number of bars in each cluster.
How do I shift the bars so that the they are spaced correctly on the x axis with even white space?
It all depends on the width that you put in your plots. Put some xlim.
import pandas as pd
import matplotlib.pyplot as plt
#MWE plot
T = [1, 2, 3, 4, 5, 6]
n = len(T)
d1 = list(zip([500]*n, [250]*n))
d2 = list(zip([250]*n, [125]*n))
df1 = pd.DataFrame(data=d1, index=T)
df2 = pd.DataFrame(data=d2, index=T)
fig = plt.figure()
ax = fig.add_subplot(111)
df1.plot(kind='bar', stacked=True, align='edge', width=-0.4, ax=ax)
df2.plot(kind='bar', stacked=True, align='edge', width=0.4, ax=ax)
plt.xlim(-.4,5.4)
plt.show()
Hope it works!

Matplotlib: how to give xticks values from a list

I have the following code:
import matplotlib.pyplot as plt
import numpy as np
xticks = ['A','B','C']
Scores = np.array([[5,7],[4,6],[8,3]])
colors = ['red','blue']
fig, ax = plt.subplots()
ax.hist(Scores,bins=3,density=True,histtype='bar',color=colors)
plt.show()
Which gives the following output:
I have two questions:
How can I make the height of bars represent the values in Scores e.g. the left most red column should be of height 5 and left most blue column should be of height 7, and so on.
How can I assign values across x-axis from xticks list e.g. the left two columns should have 'A' written under them, the next two 'B' and so on.
You confound a histogram with a bar plot. Here you want a bar plot. If you want to use pandas, this is going to be very easy:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
xticks = ['A','B','C']
Scores = np.array([[5,7],[4,6],[8,3]])
colors = ['red','blue']
names = ["Cat", "Dog"]
fig, ax = plt.subplots()
pd.DataFrame(Scores, index=xticks, columns=names).plot.bar(color=colors, ax=ax)
plt.show()
If using matplotlib alone, it's slighlty more complicated, because each column needs to be plotted independently,
import matplotlib.pyplot as plt
import numpy as np
xticks = ['A','B','C']
Scores = np.array([[5,7],[4,6],[8,3]])
colors = ['red','blue']
names = ["Cat", "Dog"]
fig, ax = plt.subplots()
x = np.arange(len(Scores))
ax.bar(x-0.2, Scores[:,0], color=colors[0], width=0.4, label=names[0])
ax.bar(x+0.2, Scores[:,1], color=colors[1], width=0.4, label=names[1])
ax.set(xticks=x, xticklabels=xticks)
ax.legend()
plt.show()
You already did a lot of the work for the histogram. Now you just need some bar plots.
import matplotlib.pyplot as plt
import numpy as np
xticks = ['A','B','C']
Scores = np.array([[5,7],[4,6],[8,3]])
colors = ['red','blue']
fig, ax = plt.subplots()
# Width of bars
w=.2
# Plot both separately
ax.bar([1,2,3],Scores[:,0],width=w,color=colors[0])
ax.bar(np.add([1,2,3],w),Scores[:,1],width=w,color=colors[1])
# Assumes you want ticks in the middle
ax.set_xticks(ticks=np.add([1,2,3],w/2))
ax.set_xticklabels(xticks)
plt.show()
plt.xticks(range(0, 6), ('A', 'A', 'B', 'B', 'C', 'C')) would work to answer question part 2 I believe. I'm not sure about the heights, as I haven't made histograms.

Label Points in Seaborn lmplot (python) with multiple plots

I'm trying to add labels to each data point in my lmplot. I want to label each data point by an index. Right now my code is the following:
p1=sns.lmplot(x="target", y="source", col="color", hue="color",
data=ddf, col_wrap=2, ci=None, palette="muted",
scatter_kws={"s": 50, "alpha": 1})
def label_point(x, y, val, ax):
a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
for i, point in a.iterrows():
ax.text(point['x']+.02, point['y'], str(point['val']))
label_point(ddf.target, ddf.source, ddf.chip, plt.gca())
This plots all the labels onto the last plot.
lmplot with labels
I tried label_point(ddf.target, ddf.source, ddf.chip, plt.gcf()) instead to use the whole figure rather than the current axes but then it throws an error.
ValueError: Image size of 163205x147206 pixels is too large.
It must be less than 2^16 in each direction.
The problem is, how should the labeling function know which plot to label, if the entire dataset is passed to it?!
As an example, you can use pandas' .groupby to loop through the unique colors and create a seaborn.regplot for each of them. Then it's easy to label each axes individually.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
import pandas as pd
import seaborn as sns
def label_point(df, ax):
for i, point in df.iterrows():
ax.annotate("{:.1f}".format(point['val']), xy = (point['x'], point['y']),
xytext=(2,-2), textcoords="offset points")
df = pd.DataFrame({"x": np.sort(np.random.rand(50)),
"y": np.cumsum(np.random.randn(50)),
"val" : np.random.randint(10,31, size=50),
"color" : np.random.randint(0,3,size=50 )})
colors = ["crimson", "indigo", "limegreen"]
fig, axes = plt.subplots(2,2, sharex=True, sharey=True)
for (c, grp), ax in zip(df.groupby("color"), axes.flat):
sns.regplot(x="x", y="y", data=grp, color=colors[c], ax=ax,
scatter_kws={"s": 25, "alpha": 1})
label_point(grp, ax)
axes.flatten()[-1].remove()
plt.show()

Categories