How can I plot a dendrogram right on top of a matrix of values, reordered appropriately to reflect the clustering, in Python? An example is the following figure:
This is Figure 6 from: A panel of induced pluripotent stem cells from chimpanzees: a resource for comparative functional genomics
I use scipy.cluster.dendrogram to make my dendrogram and perform hierarchical clustering on a matrix of data. How can I then plot the data as a matrix where the rows have been reordered to reflect a clustering induced by the cutting the dendrogram at a particular threshold, and have the dendrogram plotted alongside the matrix? I know how to plot the dendrogram in scipy, but not how to plot the intensity matrix of data with the right scale bar next to it.
The question does not define matrix very well: "matrix of values", "matrix of data". I assume that you mean a distance matrix. In other words, element D_ij in the symmetric nonnegative N-by-N distance matrix D denotes the distance between two feature vectors, x_i and x_j. Is that correct?
If so, then try this (edited June 13, 2010, to reflect two different dendrograms).
Tested in python 3.10 and matplotlib 3.5.1
import numpy as np
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from scipy.spatial.distance import squareform
# Generate random features and distance matrix.
np.random.seed(200) # for reproducible data
x = np.random.rand(40)
D = np.zeros([40, 40])
for i in range(40):
for j in range(40):
D[i,j] = abs(x[i] - x[j])
condensedD = squareform(D)
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_axes([0.09, 0.1, 0.2, 0.6])
Y = sch.linkage(condensedD, method='centroid')
Z1 = sch.dendrogram(Y, orientation='left')
ax1.set_xticks([])
ax1.set_yticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3, 0.71, 0.6, 0.2])
Y = sch.linkage(condensedD, method='single')
Z2 = sch.dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
# Plot distance matrix.
axmatrix = fig.add_axes([0.3, 0.1, 0.6, 0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([]) # remove axis labels
axmatrix.set_yticks([]) # remove axis labels
# Plot colorbar.
axcolor = fig.add_axes([0.91, 0.1, 0.02, 0.6])
plt.colorbar(im, cax=axcolor)
plt.show()
fig.savefig('dendrogram.png')
Edit: For different colors, adjust the cmap attribute in imshow. See the scipy/matplotlib docs for examples. That page also describes how to create your own colormap. For convenience, I recommend using a preexisting colormap. In my example, I used YlGnBu.
Edit: add_axes (see documentation here) accepts a list or tuple: (left, bottom, width, height). For example, (0.5,0,0.5,1) adds an Axes on the right half of the figure. (0,0.5,1,0.5) adds an Axes on the top half of the figure.
Most people probably use add_subplot for its convenience. I like add_axes for its control.
To remove the border, use add_axes([left,bottom,width,height], frame_on=False). See example here.
If in addition to the matrix and dendrogram it is required to show the labels of the elements, the following code can be used, that shows all the labels rotating the x labels and changing the font size to avoid overlapping on the x axis. It requires moving the colorbar to have space for the y labels:
axmatrix.set_xticks(range(40))
axmatrix.set_xticklabels(idx1, minor=False)
axmatrix.xaxis.set_label_position('bottom')
axmatrix.xaxis.tick_bottom()
pylab.xticks(rotation=-90, fontsize=8)
axmatrix.set_yticks(range(40))
axmatrix.set_yticklabels(idx2, minor=False)
axmatrix.yaxis.set_label_position('right')
axmatrix.yaxis.tick_right()
axcolor = fig.add_axes([0.94,0.1,0.02,0.6])
The result obtained is this (with a different color map):
Related
I am trying to plot contours of data that his been binned using numpy.hist2d, except the bins are set using numpy.logscale (equal binning in log space).
Unfortunately, this results in a strange behavior that I can't seem to resolve: the placement of the contours does not match the location of the points in x/y. I plot both the 2d histogram of the data, and the contours, and they do not overlap.
It looks like what is actually happening is the contours are being placed on the physical location of the plot in linear space where I expect them to be placed in log space.
It's a strange phenomenon that I think can be best described by the following plots, using identical data but binned in different ways.:
Here is a minimum working example to produce the logbinned data:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(loc=500, scale=100,size=10000)
y = np.random.normal(loc=600, scale=60, size=10000)
nbins = 50
bins = (np.logspace(np.log10(10),np.log10(1000),nbins),np.logspace(np.log10(10),np.log10(1000),nbins))
HH, xe, ye = np.histogram2d(x,y,bins=bins)
plt.hist2d(x,y,bins=bins,cmin=1);
grid = HH.transpose()
extent = np.array([xe.min(), xe.max(), ye.min(), ye.max()])
cs = plt.contourf(grid,2,extent=extent,extend='max',cmap='plasma',alpha=0.5,zorder=100)
plt.contour(grid,2,extent=extent,colors='k',zorder=100)
plt.yscale('log')
plt.xscale('log')
It's fairly clear what is happening -- the contour is getting misplaced do the scaling of the bins. I'd like to be able to plot the histogram and the contour here together.
If anyone has an idea of how to resolve this, that would be very helpful - thanks!
This is your problem:
cs = plt.contourf(grid,2,extent=extent,...)
You are passing in a single 2d array specifying the values of the histograms, but you aren't passing the x and y coordinates these data correspond to. By only passing in extent there's no way for pyplot to do anything other than assume that the underlying grid is uniform, stretched out to fit extent.
So instead what you have to do is to define x and y components for each value in grid. You have to think a bit how to do this, because you have (n, n)-shaped data and (n+1,)-shaped edges to go with it. We should probably choose the center of each bin to associate a data point with. So we need to find the midpoint of each bin, and pass those arrays to contour[f].
Something like this:
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng()
size = 10000
x = rng.normal(loc=500, scale=100, size=size)
y = rng.normal(loc=600, scale=60, size=size)
nbins = 50
bins = (np.geomspace(10, 1000, nbins),) * 2
HH, xe, ye = np.histogram2d(x, y, bins=bins)
fig, ax = plt.subplots()
ax.hist2d(x, y, bins=bins, cmin=1)
grid = HH.transpose()
# compute bin midpoints
midpoints = (xe[1:] + xe[:-1])/2, (ye[1:] + ye[:-1])/2
cs = ax.contourf(*midpoints, grid, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(*midpoints, grid, levels=2, colors='k', zorder=100)
# these are a red herring during debugging:
#ax.set_yscale('log')
#ax.set_xscale('log')
(I've cleaned up your code a bit.)
Alternatively, if you want to avoid having those white strips at the top and edge, you can keep your bin edges, and pad your grid with zeros:
grid_padded = np.pad(grid, [(0, 1)])
cs = ax.contourf(xe, ye, grid_padded, levels=2, extend='max', cmap='plasma', alpha=0.5, zorder=100)
ax.contour(xe, ye, grid_padded, levels=2, colors='k', zorder=100)
This gives us something like
This seems prettier, but if you think about your data this is less exact, because your data points are shifted with respect to the bin coordinates they correspond to. If you look closely you can see the contours being shifted with respect to the output of hist2d. You could fix this by generating geomspaces with one more final value which you only use for this final plotting step, and again use the midpoints of these edges (complete with a last auxiliary one).
this plot shows the voltage course lines of a simulated neuron:
i would like to place a zoom in plot in the upper right corner so that you can see the current fluctuations of the lines better (the scale here is such that you can hardly see them)
attached the code for the plot
factor to define voltage-amplitude heights
v_amp_factor = 1/(50)
##### distances between lines and x-axis
offset = np.cumsum(distance_comps_middle)/meter
offset = (offset/max(offset))*10
plt.close(plot_name)
voltage_course = plt.figure(plot_name)
for ii in comps_to_plot:
plt.plot(time_vector/ms, offset[ii] - v_amp_factor*(voltage_matrix[ii, :]-V_res)/mV, "#000000")
plt.yticks(np.linspace(0,10, int(length_neuron/mm)+1),range(0,int(length_neuron/mm)+1,1))
plt.xlabel('Time/ms', fontsize=16)
plt.gca().invert_yaxis() # inverts y-axis => - v_amp_factor*(.... has to be written above
##### no grid
plt.grid(False)
plt.savefig('plot_name', dpi=600)
plt.show(plot_name)
parameter description:
Parameters
----------
plot_name : string
This defines how the plot window will be named.
time_vector : list of time values
Vector contains the time points, that correspond to the voltage values
of the voltage matrix.
voltage_matrix : matrix of membrane potentials
Matrix has one row for each compartment and one columns for each time
step. Number of columns has to be the same as the length of the time vector
comps_to_plot : vector of integers
This vector includes the numbers of the compartments, which should be part of the plot.
distance_comps_middle : list of lengths
This list contains the distances of the compartments, which allows that
the lines in the plot are spaced according to the real compartment distances
length_neuron : length
Defines the total length of the neuron.
V_res : voltage
Defines the resting potential of the model.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import zoomed_inset_axes
# let's plot something similar to your stuff
t = np.linspace(-5, 5, 2001)
y = np.exp(-20*t**2)
fig, ax = plt.subplots()
for i in range(14):
start = 900-10*i
ax.plot(t[1000:1500], -5*y[start:start+500]/(1+i*0.3)+i, 'k')
ax.set_ylim((15, -10)) ; ax.set_yticks(range(14))
# now, create an inset axis, in the upper right corner, with
# a zoom factor of two
zax = zoomed_inset_axes(ax, 2, loc=1)
# plot again (PLOT AGAIN) the same stuff as before in the new axes
for i in range(14):
start = 900-10*i
zax.plot(t[1000:1500], -5*y[start:start+500]/(1+i*0.3)+i, 'k')
# and eventually specify the x, y limits for the zoomed plot,
# as well as the tick positions
zax.set_xlim((0.2, 0.7)) ; zax.set_xticks((0.2, 0.3, 0.4, 0.5, 0.6, 0.7))
zax.set_ylim((1, -6)) ; zax.set_yticks([1]+[-i*0.5 for i in range(12)]) ;
# that's all folks
plt.show()
Doing some research I found a method that should be exactly what you are looking for using the inset_axes method of matplotlib, you can find a nice working example in the docs. I hope it helps, I would have tried to apply it to your code, but without the data you used to plot there wasn't much I could do.
As you can find in the docs this method will allow you to do something as shown in this image:
I'm trying to visualise a dataset in 3D which consists of a time series (along y) of x-z data, using Python and Matplotlib.
I'd like to create a plot like the one below (which was made in Python: http://austringer.net/wp/index.php/2011/05/20/plotting-a-dolphin-biosonar-click-train/), but where the colour varies with Z - i.e. so the intensity is shown by a colormap as well as the peak height, for clarity.
An example showing the colormap in Z is (apparently made using MATLAB):
This effect can be created using the waterfall plot option in MATLAB, but I understand there is no direct equivalent of this in Python.
I have also tried using the plot_surface option in Python (below), which works ok, but I'd like to 'force' the lines running over the surface to only be in the x direction (i.e. making it look more like a stacked time series than a surface). Is this possible?
Any help or advice greatly welcomed. Thanks.
I have generated a function that replicates the matlab waterfall behaviour in matplotlib, but I don't think it is the best solution when it comes to performance.
I started from two examples in matplotlib documentation: multicolor lines and multiple lines in 3d plot. From these examples, I only saw possible to draw lines whose color varies following a given colormap according to its z value following the example, which is reshaping the input array to draw the line by segments of 2 points and setting the color of the segment to the z mean value between the 2 points.
Thus, given the input matrixes n,m matrixes X,Y and Z, the function loops over the smallest dimension between n,m to plot each line like in the example, by 2 points segments, where the reshaping to plot by segments is done reshaping the array with the same code as the example.
def waterfall_plot(fig,ax,X,Y,Z):
'''
Make a waterfall plot
Input:
fig,ax : matplotlib figure and axes to populate
Z : n,m numpy array. Must be a 2d array even if only one line should be plotted
X,Y : n,m array
'''
# Set normalization to the same values for all plots
norm = plt.Normalize(Z.min().min(), Z.max().max())
# Check sizes to loop always over the smallest dimension
n,m = Z.shape
if n>m:
X=X.T; Y=Y.T; Z=Z.T
m,n = n,m
for j in range(n):
# reshape the X,Z into pairs
points = np.array([X[j,:], Z[j,:]]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
lc = LineCollection(segments, cmap='plasma', norm=norm)
# Set the values used for colormapping
lc.set_array((Z[j,1:]+Z[j,:-1])/2)
lc.set_linewidth(2) # set linewidth a little larger to see properly the colormap variation
line = ax.add_collection3d(lc,zs=(Y[j,1:]+Y[j,:-1])/2, zdir='y') # add line to axes
fig.colorbar(lc) # add colorbar, as the normalization is the same for all, it doesent matter which of the lc objects we use
Therefore, plots looking like matlab waterfall can be easily generated with the same input matrixes as a matplotlib surface plot:
import numpy as np; import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from mpl_toolkits.mplot3d import Axes3D
# Generate data
x = np.linspace(-2,2, 500)
y = np.linspace(-2,2, 40)
X,Y = np.meshgrid(x,y)
Z = np.sin(X**2+Y**2)
# Generate waterfall plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
waterfall_plot(fig,ax,X,Y,Z)
ax.set_xlabel('X') ; ax.set_xlim3d(-2,2)
ax.set_ylabel('Y') ; ax.set_ylim3d(-2,2)
ax.set_zlabel('Z') ; ax.set_zlim3d(-1,1)
The function assumes that when generating the meshgrid, the x array is the longest, and by default the lines have fixed y, and its the x coordinate what varies. However, if the size of the y dimension is larger, the matrixes are transposed, generating the lines with fixed x. Thus, generating the meshgrid with the sizes inverted (len(x)=40 and len(y)=500) yields:
with a pandas dataframe with the x axis as the first column and each spectra as another column
offset=0
for c in s.columns[1:]:
plt.plot(s.wavelength,s[c]+offset)
offset+=.25
plt.xlim([1325,1375])
I have a massive scatterplot (~100,000 points) that I'm generating in matplotlib. Each point has a location in this x/y space, and I'd like to generate contours containing certain percentiles of the total number of points.
Is there a function in matplotlib which will do this? I've looked into contour(), but I'd have to write my own function to work in this way.
Thanks!
Basically, you're wanting a density estimate of some sort. There multiple ways to do this:
Use a 2D histogram of some sort (e.g. matplotlib.pyplot.hist2d or matplotlib.pyplot.hexbin) (You could also display the results as contours--just use numpy.histogram2d and then contour the resulting array.)
Make a kernel-density estimate (KDE) and contour the results. A KDE is essentially a smoothed histogram. Instead of a point falling into a particular bin, it adds a weight to surrounding bins (usually in the shape of a gaussian "bell curve").
Using a 2D histogram is simple and easy to understand, but fundementally gives "blocky" results.
There are some wrinkles to doing the second one "correctly" (i.e. there's no one correct way). I won't go into the details here, but if you want to interpret the results statistically, you need to read up on it (particularly the bandwidth selection).
At any rate, here's an example of the differences. I'm going to plot each one similarly, so I won't use contours, but you could just as easily plot the 2D histogram or gaussian KDE using a contour plot:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kde
np.random.seed(1977)
# Generate 200 correlated x,y points
data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200)
x, y = data.T
nbins = 20
fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True)
axes[0, 0].set_title('Scatterplot')
axes[0, 0].plot(x, y, 'ko')
axes[0, 1].set_title('Hexbin plot')
axes[0, 1].hexbin(x, y, gridsize=nbins)
axes[1, 0].set_title('2D Histogram')
axes[1, 0].hist2d(x, y, bins=nbins)
# Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
k = kde.gaussian_kde(data.T)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
axes[1, 1].set_title('Gaussian KDE')
axes[1, 1].pcolormesh(xi, yi, zi.reshape(xi.shape))
fig.tight_layout()
plt.show()
One caveat: With very large numbers of points, scipy.stats.gaussian_kde will become very slow. It's fairly easy to speed it up by making an approximation--just take the 2D histogram and blur it with a guassian filter of the right radius and covariance. I can give an example if you'd like.
One other caveat: If you're doing this in a non-cartesian coordinate system, none of these methods apply! Getting density estimates on a spherical shell is a bit more complicated.
I have the same question.
If you want to plot contours, which contain some part of points you can use following algorithm:
create 2d histogram
h2, xedges, yedges = np.histogram2d(X, Y, bibs = [30, 30])
h2 is now 2d matrix containing integers which is number of points in some rectangle
hravel = np.sort(np.ravel(h2))[-1] #all possible cases for rectangles
hcumsum = np.sumsum(hravel)
ugly hack,
let give for every point in h2 2d matrix the cumulative number of points for rectangle which contain number of points equal or greater to that we analyze currently.
hunique = np.unique(hravel)
hsum = np.sum(h2)
for h in hunique:
h2[h2 == h] = hcumsum[np.argwhere(hravel == h)[-1]]/hsum
now plot contour for h2, it will be the contour which containing some amount of all points
I am not able to draw a simple, vertical arrow in the following log-log plot:
#!/usr/bin/python2
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.yscale('log')
plt.xscale('log')
plt.ylim((1e-20,1e-10))
plt.xlim((1e-12,1))
plt.arrow(0.00006666, 1e-20, 0, 1e-8 - 1e-20, length_includes_head=True)
plt.savefig('test.pdf')
It just doesn't show. From the documentation it appears as if all the arguments, like width, height and so on relate to the scale of the axis. This is very counter-intuitive. I tried using twin() of the axisartist package to define an axis on top of mine with limits (0,1), (0,1) to have more control over the arrow's parameters, but I couldn't figure out how to have a completely independent axis on top of the primary one.
Any ideas?
I was looking for an answer to this question, and found a useful answer! You can specify any "mathtext" character (matplotlib's version of LaTeX) as a marker. Try:
plt.plot(x,y, 'ko', marker=r'$\downarrow$', markersize=20)
This will plot a downward pointing, black arrow at position (x,y) that looks good on any plot (even log-log).
See: matplotlib.org/users/mathtext.html#mathtext-tutorial for more symbols you can use.
Subplots approach
After creating the subplots do the following
Align the positions
Use set_axis_off() to turn the axis off (ticks, labels, etc)
Draw the arrow!
So a few lines gets whats you want!
E.g.
#!/usr/bin/python2
import matplotlib.pyplot as plt
hax = plt.subplot(1,2,1)
plt.yscale('log')
plt.xscale('log')
plt.ylim((1e-20,1e-10))
plt.xlim((1e-12,1))
hax2 = plt.subplot(1,2,2)
plt.arrow(0.1, 1, 0, 1, length_includes_head=True)
hax.set_position([0.1, 0.1, 0.8, 0.8])
hax2.set_position([0.1, 0.1, 0.8, 0.8])
hax2.set_axis_off()
plt.savefig('test.pdf')
Rescale data
Alternatively a possibly easier approach, though the axis labels may be tricky, is to rescale the data.
i.e.
import numpy
# Other import commands and data input
plt.plot(numpy.log10(x), numpy.log10(y)))
Not a great solution, but a decent result if you can handle the tick labels!
I know this thread has been dead for a long time now, but I figure posting my solution might be helpful for anyone else trying to figure out how to draw arrows on log-scale plots efficiently.
As an alternative to what others have already posted, you could use a transformation object to input the arrow coordinates not in the scale of the original axes but in the (linear) scale of the "axes coordinates". What I mean by axes coordinates are those that are normalized to [0,1] (horizontal range) by [0,1] (vertical range), where the point (0,0) would be the bottom-left corner and the point (1,1) would be the top-right, and so on. Then you could simply include an arrow by:
plt.arrow(0.1, 0.1, 0.9, 0.9, transform=plot1.transAxes, length_includes_head=True)
This gives an arrow that spans diagonally over 4/5 of the plot's horizontal and vertical range, from the bottom-left to the top-right (where plot1 is the subplot name).
If you want to do this in general, where exact coordinates (x0,y0) and (x1,y1) in the log-space can be specified for the arrow, this is not too difficult if you write two functions fx(x) and fy(y) that transform from the original coordinates to these "axes" coordinates. I've given an example of how the original code posted by the OP could be modified to implement this below (apologies for not including the images the code produces, I don't have the required reputation yet).
#!/usr/bin/python3
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
# functions fx and fy take log-scale coordinates to 'axes' coordinates
ax = 1E-12 # [ax,bx] is range of horizontal axis
bx = 1E0
def fx(x):
return (np.log(x) - np.log(ax))/(np.log(bx) - np.log(ax))
ay = 1E-20 # [ay,by] is range of vertical axis
by = 1E-10
def fy(y):
return (np.log(y) - np.log(ay))/(np.log(by) - np.log(ay))
plot1 = plt.subplot(111)
plt.xscale('log')
plt.yscale('log')
plt.xlim(ax, bx)
plt.ylim(ay, by)
# transformed coordinates for arrow from (1E-10,1E-18) to (1E-4,1E-16)
x0 = fx(1E-10)
y0 = fy(1E-18)
x1 = fx(1E-4) - fx(1E-10)
y1 = fy(1E-16) - fy(1E-18)
plt.arrow(
x0, y0, x1, y1, # input transformed arrow coordinates
transform = plot1.transAxes, # tell matplotlib to use axes coordinates
facecolor = 'black',
length_includes_head=True
)
plt.grid(True)
plt.savefig('test.pdf')