I have the following dataframe and would like to differentiate the minor decimal differences in each "step" with a different color scheme in a heatmap.
Sample data:
Sample Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
A 64.847 54.821 20.897 39.733 23.257 74.942 75.945
B 64.885 54.767 20.828 39.613 23.093 74.963 75.928
C 65.036 54.772 20.939 39.835 23.283 74.944 75.871
D 64.869 54.740 21.039 39.889 23.322 74.925 75.894
E 64.911 54.730 20.858 39.608 23.101 74.956 75.930
F 64.838 54.749 20.707 39.394 22.984 74.929 75.941
G 64.887 54.781 20.948 39.748 23.238 74.957 75.909
H 64.903 54.720 20.783 39.540 23.028 74.898 75.911
I 64.875 54.761 20.911 39.695 23.082 74.897 75.866
J 64.839 54.717 20.692 39.377 22.853 74.849 75.939
K 64.857 54.736 20.934 39.699 23.130 74.880 75.903
L 64.754 54.746 20.777 39.536 22.991 74.877 75.902
M 64.798 54.811 20.963 39.824 23.187 74.886 75.895
An example of what I am looking for:
My first approach would be based on a figure with multiple subplots. Number of plots would equal number of columns in your dataframe; the gap between the plots could be shrinked down to zero:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
Result:
Not sure if this will lead to a helpful or even self describing visualization of data, but that's your choice - perhaps this helps to start...
Supplemental:
Regarding adding the colorbars: of course you can. But - besides not knowing the background of your data and the purpose of the visualization - I'd like to add some thoughts on all that:
First: adding all those colorbars as a separate bunch of bars on one side or below the heatmap is probably possible, but I find it already quite hard to read the data, plus: you already have all those annotations - it would mess all up I think.
Additionally: in the meantime #ImportanceOfBeingErnest provided such a beutiful solution on that topic, that this would be not too meaningful imo here.
Second: if you really want to stick to the heatmap thing, perhaps splitting up and giving every column its colorbar would suit better:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, figsize=(10, 3))
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c)
if i>0:
a.yaxis.set_ticks([])
f.tight_layout()
However, all that said - I dare to doubt that this is the best visualization for your data. Of course, I don't know what you want to say, see or find with these plots, but that's the point: if the visualization type would fit to the needs, I guess I'd know (or at least could imagine).
Just for example:
A simple df.plot() results in
and I feel that this tells more about different characteristics of your columns within some tenths of a second than the heatmap.
Or are you explicitely after the differences to each columns' means?
(df - df.mean()).plot()
... or the distribution of each column around them?
(df - df.mean()).boxplot()
What I want to say: data visualization becomes powerful when a plot begins to tell sth about the underlying data before you begin/have to explain anything...
I suppose the problem can be divided into several parts.
Getting several heatmaps with different colormaps into the same picture. This can be done masking the complete array column-wise, plot each masked array seperately via imshow and apply a different colormap. To visualize the concept:
Obtaining variable number of distinct colormaps. Matplotlib provides a large number of colormaps, however, they are in general very different concerning luminosity and saturation. Here it seems desireable to have colormaps of differing hue, but otherwise same saturation and luminosity.
An option is to create the colormaps on the fly, choosing n different (and equally spaced) hues, and create a colormap using the same saturation and luminosity.
Obtaining a distinct colorbar for each column. Since the values within columns might be on totally different scales, a colorbar for each column would be needed to know the values shown, e.g. in the first column the brightest color may correspond to a value of 1, while in the second column it may correspond to a value of 100. Several colorbars can be created inside of the axes of a GridSpec which is placed next to the actual heatmap axes. The number of columns and rows of that gridspec would be dependent of the number of columns in the dataframe.
In total this may then look as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
def get_hsvcmap(i, N, rot=0.):
nsc = 24
chsv = mcolors.rgb_to_hsv(plt.cm.hsv(((np.arange(N)/N)+rot) % 1.)[i,:3])
rhsv = mcolors.rgb_to_hsv(plt.cm.Reds(np.linspace(.2,1,nsc))[:,:3])
arhsv = np.tile(chsv,nsc).reshape(nsc,3)
arhsv[:,1:] = rhsv[:,1:]
rgb = mcolors.hsv_to_rgb(arhsv)
return mcolors.LinearSegmentedColormap.from_list("",rgb)
def columnwise_heatmap(array, ax=None, **kw):
ax = ax or plt.gca()
premask = np.tile(np.arange(array.shape[1]), array.shape[0]).reshape(array.shape)
images = []
for i in range(array.shape[1]):
col = np.ma.array(array, mask = premask != i)
im = ax.imshow(col, cmap=get_hsvcmap(i, array.shape[1], rot=0.5), **kw)
images.append(im)
return images
### Create some dataset
ind = list("ABCDEFGHIJKLM")
m = len(ind)
n = 8
df = pd.DataFrame(np.random.randn(m,n) + np.random.randint(20,70,n),
index=ind, columns=[f"Step {i}" for i in range(2,2+n)])
### Plot data
fig, ax = plt.subplots(figsize=(8,4.5))
ims = columnwise_heatmap(df.values, ax=ax, aspect="auto")
ax.set(xticks=np.arange(len(df.columns)), yticks=np.arange(len(df)),
xticklabels=df.columns, yticklabels=df.index)
ax.tick_params(bottom=False, top=False,
labelbottom=False, labeltop=True, left=False)
### Optionally add colorbars.
fig.subplots_adjust(left=0.06, right=0.65)
rows = 3
cols = len(df.columns) // rows + int(len(df.columns)%rows > 0)
gs = GridSpec(rows, cols)
gs.update(left=0.7, right=0.95, wspace=1, hspace=0.3)
for i, im in enumerate(ims):
cax = fig.add_subplot(gs[i//cols, i % cols])
fig.colorbar(im, cax = cax)
cax.set_title(df.columns[i], fontsize=10)
plt.show()
Related
My dataset is in the form of :
Data[0] = [headValue,x0,x1,..xN]
Data[1] = [headValue_ya,ya0,ya1,..yaN]
Data[2] = [headValue_yb,yb0,yb1,..ybN]
...
Data[n] = [headvalue_yz,yz0,yz1,..yzN]
I want to plot f(y*) = x, so I can visualize all Lineplots in the same figure with different colors, each color determined by the headervalue_y*.
I also want to add a colorbar whose color matching the lines and therefore the header values, so we can link visually which header value leads to which behaviour.
Here is what I am aiming for :(Plot from Lacroix B, Letort G, Pitayu L, et al. Microtubule Dynamics Scale with Cell Size to Set Spindle Length and Assembly Timing. Dev Cell. 2018;45(4):496–511.e6. doi:10.1016/j.devcel.2018.04.022)
I have trouble adding the colorbar, I have tried to extract N colors from a colormap (N is my number of different headValues, or column -1) and then adding for each line plot the color corresponding here is my code to clarify:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
Data = [['Time',0,0.33,..200],[0.269,4,4.005,...11],[0.362,4,3.999,...16.21],...[0.347,4,3.84,...15.8]]
headValues = [0.269,0.362,0.335,0.323,0.161,0.338,0.341,0.428,0.245,0.305,0.305,0.314,0.299,0.395,0.32,0.437,0.203,0.41,0.392,0.347]
# the differents headValues_y* of each column here in a list but also in Data
# with headValue[0] = Data[1][0], headValue[1] = Data[2][0] ...
cmap = mpl.cm.get_cmap('rainbow') # I choose my colormap
rgba = [] # the color container
for value in headValues:
rgba.append(cmap(value)) # so rgba will contain a different color for each headValue
fig, (ax,ax1) = plt.subplots(2,1) # creating my figure and two axes to put the Lines and the colorbar
c = 0 # index for my colors
for i in range(1, len(Data)):
ax.plot( Data[0][1:], Data[i][1:] , color = rgba[c])
# Data[0][1:] is x, Data[i][1:] is y, and the color associated with Data[i][0]
c += 1
fig.colorbar(mpl.cm.ScalarMappable(cmap= mpl.colors.ListedColormap(rgba)), cax=ax1, orientation='horizontal')
# here I create my scalarMappable for my lineplot and with the previously selected colors 'rgba'
plt.show()
The current result:
How to add the colorbar on the side or the bottom of the first axis ?
How to properly add a scale to this colorbar correspondig to different headValues ?
How to make the colorbar scale and colors match to the different lines on the plot with the link One color = One headValue ?
I have tried to work with scatterplot which are more convenient to use with scalarMappable but no solution allows me to do all these things at once.
Here is a possible approach. As the 'headValues' aren't sorted, nor equally spaced and one is even used twice, it is not fully clear what the most-desired result would be.
Some remarks:
The standard way of creating a colorbar in matplotlib doesn't need a separate subplot. Matplotlib will reduce the existing plot a bit and put the colorbar next to it (or below for a vertical bar).
Converting the 'headValues' to a numpy array allows for compact code, e.g. writing rgba = cmap(headValues) directly calculates the complete array.
Calling cmap on unchanged values will map 0 to the lowest color and 1 to the highest color, so for values only between 0.16 and 0.44 they all will be mapped to quite similar colors. One approach is to create a norm to map 0.16 to the lowest color and 0.44 to the highest. In code: norm = plt.Normalize(headValues.min(), headValues.max()) and then calculate rgba = cmap(norm(headValues)).
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
headValues = np.array([0.269, 0.362, 0.335, 0.323, 0.161, 0.338, 0.341, 0.428, 0.245, 0.305, 0.305, 0.314, 0.299, 0.395, 0.32, 0.437, 0.203, 0.41, 0.392, 0.347])
x = np.linspace(0, 200, 500)
# create Data similar to the data in the question
Data = [['Time'] + list(x)] + [[val] + list(np.sqrt(4 * x) * val + 4) for val in headValues]
headValues = np.array([d[0] for d in Data[1:]])
order = np.argsort(headValues)
inverse_order = np.argsort(order)
cmap = mpl.cm.get_cmap('rainbow')
rgba = cmap(np.linspace(0, 1, len(headValues))) # evenly spaced colors
fig, ax = plt.subplots(1, 1)
for i in range(1, len(Data)):
ax.plot(Data[0][1:], Data[i][1:], color=rgba[inverse_order[i-1]])
# Data[0][1:] is x, Data[i][1:] is y, and the color associated with Data[i-1][0]
cbar = fig.colorbar(mpl.cm.ScalarMappable(cmap=mpl.colors.ListedColormap(rgba)), orientation='vertical',
ticks=np.linspace(0, 1, len(rgba) * 2 + 1)[1::2])
cbar.set_ticklabels(headValues[order])
plt.show()
Alternatively, the colors can be assigned using their position in the colormap, but without creating
cmap = mpl.cm.get_cmap('rainbow')
norm = plt.Normalize(headValues.min(), headValues.max())
fig, ax = plt.subplots(1, 1)
for i in range(1, len(Data)):
ax.plot(Data[0][1:], Data[i][1:], color=cmap(norm(Data[i][0])))
cbar = fig.colorbar(mpl.cm.ScalarMappable(cmap=cmap, norm=norm))
To get ticks for each of the 'headValues', these ticks can be set explicitly. As putting a label for each tick will result in overlapping text, labels that are too close to other labels can be replaced by an empty string:
headValues.sort()
cbar2 = fig.colorbar(mpl.cm.ScalarMappable(cmap=cmap, norm=norm), ticks=headValues)
cbar2.set_ticklabels([val if val < next - 0.007 else '' for val, next in zip(headValues[:-1], headValues[1:])]
+ [headValues[-1]])
At the left the result of the first approach (colors in segments), at the right the alternative colorbars (color depending on value):
Currently, I'm working on an introductory paper on data manipulation and such; however... the CSV I'm working on has some things I wish to do a scatter graph on!
I want a scatter graph to show me the volume sold on certain items as well as their average price, differentiating all data according to their region (Through colours I assume).
So what I want is to know if I can add the region column as a quantitative value
or if there's a way to make this possible...
It's my first time using Python and I'm confused way too often
I'm not sure if this is what you mean, but here is some working code, assuming you have data in the format of [(country, volume, price), ...]. If not, you can change the inputs to the scatter method as needed.
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
n_countries = 50
# get the data into "countries", for example
countries = ...
# in this example: countries is [('BS', 21, 25), ('WZ', 98, 25), ...]
df = pd.DataFrame(countries)
# arbitrary method to get a color
def get_color(i, max_i):
cmap = matplotlib.cm.get_cmap('Spectral')
return cmap(i/max_i)
# get the figure and axis - make a larger figure to fit more points
# add labels for metric names
def get_fig_ax():
fig = plt.figure(figsize=(14,14))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('volume')
ax.set_ylabel('price')
return fig, ax
# switch around the assignments depending on your data
def get_x_y_labels():
x = df[1]
y = df[2]
labels = df[0]
return x, y, labels
offset = 1 # offset just so annotations aren't on top of points
x, y, labels = get_x_y_labels()
fig, ax = get_fig_ax()
# add a point and annotation for each of the labels/regions
for i, region in enumerate(labels):
ax.annotate(region, (x[i] + offset, y[i] + offset))
# note that you must use "label" for "legend" to work
ax.scatter(x[i], y[i], color=get_color(i, len(x)), label=region)
# Add the legend just outside of the plot.
# The .1, 0 at the end will put it outside
ax.legend(loc='upper right', bbox_to_anchor=(1, 1, .1, 0))
plt.show()
I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.
It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:
Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.
I am trying to replicate the following image in matplotlib and it seems barh is my only option. Though it appears that you can't stack barh graphs so I don't know what to do
If you know of a better python library to draw this kind of thing, please let me know.
This is all I could come up with as a start:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np
import matplotlib.pyplot as plt
people = ('A','B','C','D','E','F','G','H')
y_pos = np.arange(len(people))
bottomdata = 3 + 10 * np.random.rand(len(people))
topdata = 3 + 10 * np.random.rand(len(people))
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax.barh(y_pos, bottomdata,color='r',align='center')
ax.barh(y_pos, topdata,color='g',align='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(people)
ax.set_xlabel('Distance')
plt.show()
I would then have to add labels individually using ax.text which would be tedious. Ideally I would like to just specify the width of the part to be inserted then it updates the center of that section with a string of my choosing. The labels on the outside (e.g. 3800) I can add myself later, it is mainly the labeling over the bar section itself and creating this stacked method in a nice way I'm having problems with. Can you even specify a 'distance' i.e. span of color in any way?
Edit 2: for more heterogeneous data. (I've left the above method since I find it more usual to work with the same number of records per series)
Answering the two parts of the question:
a) barh returns a container of handles to all the patches that it drew. You can use the coordinates of the patches to aid the text positions.
b) Following these two answers to the question that I noted before (see Horizontal stacked bar chart in Matplotlib), you can stack bar graphs horizontally by setting the 'left' input.
and additionally c) handling data that is less uniform in shape.
Below is one way you could handle data that is less uniform in shape is simply to process each segment independently.
import numpy as np
import matplotlib.pyplot as plt
# some labels for each row
people = ('A','B','C','D','E','F','G','H')
r = len(people)
# how many data points overall (average of 3 per person)
n = r * 3
# which person does each segment belong to?
rows = np.random.randint(0, r, (n,))
# how wide is the segment?
widths = np.random.randint(3,12, n,)
# what label to put on the segment (xrange in py2.7, range for py3)
labels = range(n)
colors ='rgbwmc'
patch_handles = []
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
left = np.zeros(r,)
row_counts = np.zeros(r,)
for (r, w, l) in zip(rows, widths, labels):
print r, w, l
patch_handles.append(ax.barh(r, w, align='center', left=left[r],
color=colors[int(row_counts[r]) % len(colors)]))
left[r] += w
row_counts[r] += 1
# we know there is only one patch but could enumerate if expanded
patch = patch_handles[-1][0]
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
ax.text(x, y, "%d%%" % (l), ha='center',va='center')
y_pos = np.arange(8)
ax.set_yticks(y_pos)
ax.set_yticklabels(people)
ax.set_xlabel('Distance')
plt.show()
Which produces a graph like this , with a different number of segments present in each series.
Note that this is not particularly efficient since each segment used an individual call to ax.barh. There may be more efficient methods (e.g. by padding a matrix with zero-width segments or nan values) but this likely to be problem-specific and is a distinct question.
Edit: updated to answer both parts of the question.
import numpy as np
import matplotlib.pyplot as plt
people = ('A','B','C','D','E','F','G','H')
segments = 4
# generate some multi-dimensional data & arbitrary labels
data = 3 + 10* np.random.rand(segments, len(people))
percentages = (np.random.randint(5,20, (len(people), segments)))
y_pos = np.arange(len(people))
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
colors ='rgbwmc'
patch_handles = []
left = np.zeros(len(people)) # left alignment of data starts at zero
for i, d in enumerate(data):
patch_handles.append(ax.barh(y_pos, d,
color=colors[i%len(colors)], align='center',
left=left))
# accumulate the left-hand offsets
left += d
# go through all of the bar segments and annotate
for j in range(len(patch_handles)):
for i, patch in enumerate(patch_handles[j].get_children()):
bl = patch.get_xy()
x = 0.5*patch.get_width() + bl[0]
y = 0.5*patch.get_height() + bl[1]
ax.text(x,y, "%d%%" % (percentages[i,j]), ha='center')
ax.set_yticks(y_pos)
ax.set_yticklabels(people)
ax.set_xlabel('Distance')
plt.show()
You can achieve a result along these lines (note: the percentages I used have nothing to do with the bar widths, as the relationship in the example seems unclear):
See Horizontal stacked bar chart in Matplotlib for some ideas on stacking horizontal bar plots.
Imports and Test DataFrame
Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2
For vertical stacked bars see Stacked Bar Chart with Centered Labels
import pandas as pd
import numpy as np
# create sample data as shown in the OP
np.random.seed(365)
people = ('A','B','C','D','E','F','G','H')
bottomdata = 3 + 10 * np.random.rand(len(people))
topdata = 3 + 10 * np.random.rand(len(people))
# create the dataframe
df = pd.DataFrame({'Female': bottomdata, 'Male': topdata}, index=people)
# display(df)
Female Male
A 12.41 7.42
B 9.42 4.10
C 9.85 7.38
D 8.89 10.53
E 8.44 5.92
F 6.68 11.86
G 10.67 12.97
H 6.05 7.87
Updated with matplotlib v3.4.2
Use matplotlib.pyplot.bar_label
See How to add value labels on a bar chart for additional details and examples with .bar_label.
labels = [f'{v.get_width():.2f}%' if v.get_width() > 0 else '' for v in c ] for python < 3.8, without the assignment expression (:=).
Plotted using pandas.DataFrame.plot with kind='barh'
ax = df.plot(kind='barh', stacked=True, figsize=(8, 6))
for c in ax.containers:
# customize the label to account for cases when there might not be a bar section
labels = [f'{w:.2f}%' if (w := v.get_width()) > 0 else '' for v in c ]
# set the bar label
ax.bar_label(c, labels=labels, label_type='center')
# uncomment and use the next line if there are no nan or 0 length sections; just use fmt to add a % (the previous two lines of code are not needed, in this case)
# ax.bar_label(c, fmt='%.2f%%', label_type='center')
# move the legend
ax.legend(bbox_to_anchor=(1.025, 1), loc='upper left', borderaxespad=0.)
# add labels
ax.set_ylabel("People", fontsize=18)
ax.set_xlabel("Percent", fontsize=18)
plt.show()
Using seaborn
sns.barplot does not have an option for stacked bar plots, however, sns.histplot and sns.displot can be used to create horizontal stacked bars.
seaborn typically requires the dataframe to be in a long, instead of wide, format, so use pandas.DataFrame.melt to reshape the dataframe.
Reshape dataframe
# convert the dataframe to a long form
df = df.reset_index()
df = df.rename(columns={'index': 'People'})
dfm = df.melt(id_vars='People', var_name='Gender', value_name='Percent')
# display(dfm)
People Gender Percent
0 A Female 12.414557
1 B Female 9.416027
2 C Female 9.846105
3 D Female 8.885621
4 E Female 8.438872
5 F Female 6.680709
6 G Female 10.666258
7 H Female 6.050124
8 A Male 7.420860
9 B Male 4.104433
10 C Male 7.383738
11 D Male 10.526158
12 E Male 5.916262
13 F Male 11.857227
14 G Male 12.966913
15 H Male 7.865684
sns.histplot: axes-level plot
fig, axe = plt.subplots(figsize=(8, 6))
sns.histplot(data=dfm, y='People', hue='Gender', discrete=True, weights='Percent', multiple='stack', ax=axe)
# iterate through each set of containers
for c in axe.containers:
# add bar annotations
axe.bar_label(c, fmt='%.2f%%', label_type='center')
axe.set_xlabel('Percent')
plt.show()
sns.displot: figure-level plot
g = sns.displot(data=dfm, y='People', hue='Gender', discrete=True, weights='Percent', multiple='stack', height=6)
# iterate through each facet / supbplot
for axe in g.axes.flat:
# iteate through each set of containers
for c in axe.containers:
# add the bar annotations
axe.bar_label(c, fmt='%.2f%%', label_type='center')
axe.set_xlabel('Percent')
plt.show()
Original Answer - before matplotlib v3.4.2
The easiest way to plot a horizontal or vertical stacked bar, is to load the data into a pandas.DataFrame
This will plot, and annotate correctly, even when all categories ('People'), don't have all segments (e.g. some value is 0 or NaN)
Once the data is in the dataframe:
It's easier to manipulate and analyze
It can be plotted with the matplotlib engine, using:
pandas.DataFrame.plot.barh
label_text = f'{width}' for annotations
pandas.DataFrame.plot.bar
label_text = f'{height}' for annotations
SO: Vertical Stacked Bar Chart with Centered Labels
These methods return a matplotlib.axes.Axes or a numpy.ndarray of them.
Using the .patches method unpacks a list of matplotlib.patches.Rectangle objects, one for each of the sections of the stacked bar.
Each .Rectangle has methods for extracting the various values that define the rectangle.
Each .Rectangle is in order from left the right, and bottom to top, so all the .Rectangle objects, for each level, appear in order, when iterating through .patches.
The labels are made using an f-string, label_text = f'{width:.2f}%', so any additional text can be added as needed.
Plot and Annotate
Plotting the bar, is 1 line, the remainder is annotating the rectangles
# plot the dataframe with 1 line
ax = df.plot.barh(stacked=True, figsize=(8, 6))
# .patches is everything inside of the chart
for rect in ax.patches:
# Find where everything is located
height = rect.get_height()
width = rect.get_width()
x = rect.get_x()
y = rect.get_y()
# The height of the bar is the data value and can be used as the label
label_text = f'{width:.2f}%' # f'{width:.2f}' to format decimal values
# ax.text(x, y, text)
label_x = x + width / 2
label_y = y + height / 2
# only plot labels greater than given width
if width > 0:
ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)
# move the legend
ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)
# add labels
ax.set_ylabel("People", fontsize=18)
ax.set_xlabel("Percent", fontsize=18)
plt.show()
Example with Missing Segment
# set one of the dataframe values to 0
df.iloc[4, 1] = 0
Note the annotations are all in the correct location from df.
For this case, the above answers work perfectly. The issue I had, and didn't find a plug-and-play solution online, was that I often have to plot stacked bars in multi-subplot figures, with many values, which tend to have very non-homogenous amplitudes.
(Note: I work usually with pandas dataframes, and matplotlib. I couldn't make the bar_label() method of matplotlib to work all the times.)
So, I just give a kind of ad-hoc, but easily generalizable solution. In this example, I was working with single-row dataframes (for power-exchange monitoring purposes per hour), so, my dataframe (df) had just one row.
(I provide an example figure to show how this can be useful in very densely-packed plots)
[enter image description here][1]
[1]: https://i.stack.imgur.com/9akd8.png
'''
This implementation produces a stacked, horizontal bar plot.
df --> pandas dataframe. Columns are used as the iterator, and only the firs value of each column is used.
waterfall--> bool: if True, apart from the stack-direction, also a perpendicular offset is added.
cyclic_offset_x --> list (of any length) or None: loop through these values to use as x-offset pixels.
cyclic_offset_y --> list (of any length) or None: loop through these values to use as y-offset pixels.
ax --> matplotlib Axes, or None: if None, creates a new axis and figure.
'''
def magic_stacked_bar(df, waterfall=False, cyclic_offset_x=None, cyclic_offset_y=None, ax=None):
if isinstance(cyclic_offset_x, type(None)):
cyclic_offset_x = [0, 0]
if isinstance(cyclic_offset_y, type(None)):
cyclic_offset_y = [0, 0]
ax0 = ax
if isinstance(ax, type(None)):
fig, ax = plt.subplots()
fig.set_size_inches(19, 10)
cycler = 0;
prev = 0 # summation variable to make it stacked
for c in df.columns:
if waterfall:
y = c ; label = "" # bidirectional stack
else:
y = 0; label = c # unidirectional stack
ax.barh(y=y, width=df[c].values[0], height=1, left=prev, label = label)
prev += df[c].values[0] # add to sum-stack
offset_x = cyclic_offset_x[divmod(cycler, len(cyclic_offset_x))[1]]
offset_y = cyclic_offset_y[divmod(cycler, len(cyclic_offset_y))[1]]
ax.annotate(text="{}".format(int(df[c].values[0])), xy=(prev - df[c].values / 2, y),
xytext=(offset_x, offset_y), textcoords='offset pixels',
ha='center', va='top', fontsize=8,
arrowprops=dict(facecolor='black', shrink=0.01, width=0.3, headwidth=0.3),
bbox=dict(boxstyle='round', facecolor='grey', alpha=0.5))
cycler += 1
if not waterfall:
ax.legend() # if waterfall, the index annotates the columns. If
# waterfall ==False, the legend annotates the columns
if isinstance(ax0, type(None)):
ax.set_title("Voi la")
ax.set_xlabel("UltraWatts")
plt.show()
else:
return ax
''' (Sometimes, it is more tedious and requires some custom functions to make the labels look alright.
'''
A, B = 80,80
n_units = df.shape[1]
cyclic_offset_x = -A*np.cos(2*np.pi / (2*n_units) *np.arange(n_units))
cyclic_offset_y = B*np.sin(2*np.pi / (2*n_units) * np.arange(n_units)) + B/2
I want to create a bar chart of two series (say 'A' and 'B') contained in a Pandas dataframe. If I wanted to just plot them using a different y-axis, I can use secondary_y:
df = pd.DataFrame(np.random.uniform(size=10).reshape(5,2),columns=['A','B'])
df['A'] = df['A'] * 100
df.plot(secondary_y=['A'])
but if I want to create bar graphs, the equivalent command is ignored (it doesn't put different scales on the y-axis), so the bars from 'A' are so big that the bars from 'B' are cannot be distinguished:
df.plot(kind='bar',secondary_y=['A'])
How can I do this in pandas directly? or how would you create such graph?
I'm using pandas 0.10.1 and matplotlib version 1.2.1.
Don't think pandas graphing supports this. Did some manual matplotlib code.. you can tweak it further
import pylab as pl
fig = pl.figure()
ax1 = pl.subplot(111,ylabel='A')
#ax2 = gcf().add_axes(ax1.get_position(), sharex=ax1, frameon=False, ylabel='axes2')
ax2 =ax1.twinx()
ax2.set_ylabel('B')
ax1.bar(df.index,df.A.values, width =0.4, color ='g', align = 'center')
ax2.bar(df.index,df.B.values, width = 0.4, color='r', align = 'edge')
ax1.legend(['A'], loc = 'upper left')
ax2.legend(['B'], loc = 'upper right')
fig.show()
I am sure there are ways to force the one bar further tweak it. move bars further apart, one slightly transparent etc.
Ok, I had the same problem recently and even if it's an old question, I think that I can give an answer for this problem, in case if someone else lost his mind with this. Joop gave the bases of the thing to do, and it's easy when you only have (for exemple) two columns in your dataframe, but it becomes really nasty when you have a different numbers of columns for the two axis, due to the fact that you need to play with the position argument of the pandas plot() function. In my exemple I use seaborn but it's optionnal :
import pandas as pd
import seaborn as sns
import pylab as plt
import numpy as np
df1 = pd.DataFrame(np.array([[i*99 for i in range(11)]]).transpose(), columns = ["100"], index = [i for i in range(11)])
df2 = pd.DataFrame(np.array([[i for i in range(11)], [i*2 for i in range(11)]]).transpose(), columns = ["1", "2"], index = [i for i in range(11)])
fig, ax = plt.subplots()
ax2 = ax.twinx()
# we must define the length of each column.
df1_len = len(df1.columns.values)
df2_len = len(df2.columns.values)
column_width = 0.8 / (df1_len + df2_len)
# we calculate the position of each column in the plot. This value is based on the position definition :
# Specify relative alignments for bar plot layout. From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)
# http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.plot.html
df1_posi = 0.5 + (df2_len/float(df1_len)) * 0.5
df2_posi = 0.5 - (df1_len/float(df2_len)) * 0.5
# In order to have nice color, I use the default color palette of seaborn
df1.plot(kind='bar', ax=ax, width=column_width*df1_len, color=sns.color_palette()[:df1_len], position=df1_posi)
df2.plot(kind='bar', ax=ax2, width=column_width*df2_len, color=sns.color_palette()[df1_len:df1_len+df2_len], position=df2_posi)
ax.legend(loc="upper left")
# Pandas add line at x = 0 for each dataframe.
ax.lines[0].set_visible(False)
ax2.lines[0].set_visible(False)
# Specific to seaborn, we have to remove the background line
ax2.grid(b=False, axis='both')
# We need to add some space, the xlim don't manage the new positions
column_length = (ax2.get_xlim()[1] - abs(ax2.get_xlim()[0])) / float(len(df1.index))
ax2.set_xlim([ax2.get_xlim()[0] - column_length, ax2.get_xlim()[1] + column_length])
fig.patch.set_facecolor('white')
plt.show()
And the result : http://i.stack.imgur.com/LZjK8.png
I didn't test every possibilities but it looks like it works fine whatever the number of columns in each dataframe you use.