Line-based heatmap or 2D line histogram - python

I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.

It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:

Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.

Related

How to add (or annotate) value labels (or frequencies) on a matplotlib "histogram" chart

I want to add frequency labels to the histogram generated using plt.hist.
Here is the data :
np.random.seed(30)
d = np.random.randint(1, 101, size = 25)
print(sorted(d))
I looked up other questions on stackoverflow like :
Adding value labels on a matplotlib bar chart
and their answers, but apparantly, the objects returnded by plt.plot(kind='bar') are different than than those returned by plt.hist, and I got errors while using the 'get_height' or 'get width' functions, as suggested in some of the answers for bar plot.
Similarly, couldn't find the solution by going through the matplotlib documentation on histograms.
got this error
Here is how I managed it. If anyone has some suggestions to improve my answer, (specifically the for loop and using n=0, n=n+1, I think there must be a better way to write the for loop without having to use n in this manner), I'd welcome it.
# import base packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate data
np.random.seed(30)
d = np.random.randint(1, 101, size = 25)
print(sorted(d))
# generate histogram
# a histogram returns 3 objects : n (i.e. frequncies), bins, patches
freq, bins, patches = plt.hist(d, edgecolor='white', label='d', bins=range(1,101,10))
# x coordinate for labels
bin_centers = np.diff(bins)*0.5 + bins[:-1]
n = 0
for fr, x, patch in zip(freq, bin_centers, patches):
height = int(freq[n])
plt.annotate("{}".format(height),
xy = (x, height), # top left corner of the histogram bar
xytext = (0,0.2), # offsetting label position above its bar
textcoords = "offset points", # Offset (in points) from the *xy* value
ha = 'center', va = 'bottom'
)
n = n+1
plt.legend()
plt.show;

Using timedelta to properly allign peaks on graph

I am using scipy.signal library to find the peaks of a time graph. I inputted the y values of my pandas series. And it gave me the location of the the peaks. Now i am trying to use the locations from the find_peaks function to return the position in time of the peaks. Here is my function:
def turn_peaks_to_time_series(df,t_interval):
df_values = df['l'].values
fig, ax1 = plt.subplots()
x_of_peaks, _ = find_peaks(df_values, height=None)
y_of_peaks = df_values[x_of_peaks]
x_values_to_t_values = lambda x : timedelta(minutes=x) * t_interval
time_initial = np.min(df.index)
t_of_peaks = [ time_initial + x_values_to_t_values(int(i)) for i in x_of_peaks ] #source of issue
ax1.plot(t_of_peaks, y_of_peaks, "rp",label='peak') #plot peaks on graph
ax1.plot(df.index,df.l) # plot df line
plt.show()
However, peaks are not properly aligning
I know the issue is with my x_values_to_t_values function. In addition, any suggesting to optimize my code are very welcomed.
Turns out i was trying to reinvent the wheel. The solution to my problem was extremely simple. Also I adjusted the code to be more general.
def turn_peaks_to_time_series(series):
series_values = series.values
series_index = series.index
fig, ax1 = plt.subplots()
x_of_peaks, _ = find_peaks(series_values, height=None)
y_of_peaks = series_values[x_of_peaks]
ax1.plot(series_index[x_of_peaks], y_of_peaks, "rp",label='peak') #plot peaks on graph
ax1.plot(series_index,series_values) # plot df line
plt.show()

Annotated heatmap with multiple color schemes

I have the following dataframe and would like to differentiate the minor decimal differences in each "step" with a different color scheme in a heatmap.
Sample data:
Sample Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
A 64.847 54.821 20.897 39.733 23.257 74.942 75.945
B 64.885 54.767 20.828 39.613 23.093 74.963 75.928
C 65.036 54.772 20.939 39.835 23.283 74.944 75.871
D 64.869 54.740 21.039 39.889 23.322 74.925 75.894
E 64.911 54.730 20.858 39.608 23.101 74.956 75.930
F 64.838 54.749 20.707 39.394 22.984 74.929 75.941
G 64.887 54.781 20.948 39.748 23.238 74.957 75.909
H 64.903 54.720 20.783 39.540 23.028 74.898 75.911
I 64.875 54.761 20.911 39.695 23.082 74.897 75.866
J 64.839 54.717 20.692 39.377 22.853 74.849 75.939
K 64.857 54.736 20.934 39.699 23.130 74.880 75.903
L 64.754 54.746 20.777 39.536 22.991 74.877 75.902
M 64.798 54.811 20.963 39.824 23.187 74.886 75.895
An example of what I am looking for:
My first approach would be based on a figure with multiple subplots. Number of plots would equal number of columns in your dataframe; the gap between the plots could be shrinked down to zero:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, gridspec_kw={'wspace': 0})
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c, cbar=False)
if i>0:
a.yaxis.set_ticks([])
Result:
Not sure if this will lead to a helpful or even self describing visualization of data, but that's your choice - perhaps this helps to start...
Supplemental:
Regarding adding the colorbars: of course you can. But - besides not knowing the background of your data and the purpose of the visualization - I'd like to add some thoughts on all that:
First: adding all those colorbars as a separate bunch of bars on one side or below the heatmap is probably possible, but I find it already quite hard to read the data, plus: you already have all those annotations - it would mess all up I think.
Additionally: in the meantime #ImportanceOfBeingErnest provided such a beutiful solution on that topic, that this would be not too meaningful imo here.
Second: if you really want to stick to the heatmap thing, perhaps splitting up and giving every column its colorbar would suit better:
cm = ['Blues', 'Reds', 'Greens', 'Oranges', 'Purples', 'bone', 'winter']
f, axs = plt.subplots(1, df.columns.size, figsize=(10, 3))
for i, (s, a, c) in enumerate(zip(df.columns, axs, cm)):
sns.heatmap(np.array([df[s].values]).T, yticklabels=df.index, xticklabels=[s], annot=True, fmt='.2f', ax=a, cmap=c)
if i>0:
a.yaxis.set_ticks([])
f.tight_layout()
However, all that said - I dare to doubt that this is the best visualization for your data. Of course, I don't know what you want to say, see or find with these plots, but that's the point: if the visualization type would fit to the needs, I guess I'd know (or at least could imagine).
Just for example:
A simple df.plot() results in
and I feel that this tells more about different characteristics of your columns within some tenths of a second than the heatmap.
Or are you explicitely after the differences to each columns' means?
(df - df.mean()).plot()
... or the distribution of each column around them?
(df - df.mean()).boxplot()
What I want to say: data visualization becomes powerful when a plot begins to tell sth about the underlying data before you begin/have to explain anything...
I suppose the problem can be divided into several parts.
Getting several heatmaps with different colormaps into the same picture. This can be done masking the complete array column-wise, plot each masked array seperately via imshow and apply a different colormap. To visualize the concept:
Obtaining variable number of distinct colormaps. Matplotlib provides a large number of colormaps, however, they are in general very different concerning luminosity and saturation. Here it seems desireable to have colormaps of differing hue, but otherwise same saturation and luminosity.
An option is to create the colormaps on the fly, choosing n different (and equally spaced) hues, and create a colormap using the same saturation and luminosity.
Obtaining a distinct colorbar for each column. Since the values within columns might be on totally different scales, a colorbar for each column would be needed to know the values shown, e.g. in the first column the brightest color may correspond to a value of 1, while in the second column it may correspond to a value of 100. Several colorbars can be created inside of the axes of a GridSpec which is placed next to the actual heatmap axes. The number of columns and rows of that gridspec would be dependent of the number of columns in the dataframe.
In total this may then look as follows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
def get_hsvcmap(i, N, rot=0.):
nsc = 24
chsv = mcolors.rgb_to_hsv(plt.cm.hsv(((np.arange(N)/N)+rot) % 1.)[i,:3])
rhsv = mcolors.rgb_to_hsv(plt.cm.Reds(np.linspace(.2,1,nsc))[:,:3])
arhsv = np.tile(chsv,nsc).reshape(nsc,3)
arhsv[:,1:] = rhsv[:,1:]
rgb = mcolors.hsv_to_rgb(arhsv)
return mcolors.LinearSegmentedColormap.from_list("",rgb)
def columnwise_heatmap(array, ax=None, **kw):
ax = ax or plt.gca()
premask = np.tile(np.arange(array.shape[1]), array.shape[0]).reshape(array.shape)
images = []
for i in range(array.shape[1]):
col = np.ma.array(array, mask = premask != i)
im = ax.imshow(col, cmap=get_hsvcmap(i, array.shape[1], rot=0.5), **kw)
images.append(im)
return images
### Create some dataset
ind = list("ABCDEFGHIJKLM")
m = len(ind)
n = 8
df = pd.DataFrame(np.random.randn(m,n) + np.random.randint(20,70,n),
index=ind, columns=[f"Step {i}" for i in range(2,2+n)])
### Plot data
fig, ax = plt.subplots(figsize=(8,4.5))
ims = columnwise_heatmap(df.values, ax=ax, aspect="auto")
ax.set(xticks=np.arange(len(df.columns)), yticks=np.arange(len(df)),
xticklabels=df.columns, yticklabels=df.index)
ax.tick_params(bottom=False, top=False,
labelbottom=False, labeltop=True, left=False)
### Optionally add colorbars.
fig.subplots_adjust(left=0.06, right=0.65)
rows = 3
cols = len(df.columns) // rows + int(len(df.columns)%rows > 0)
gs = GridSpec(rows, cols)
gs.update(left=0.7, right=0.95, wspace=1, hspace=0.3)
for i, im in enumerate(ims):
cax = fig.add_subplot(gs[i//cols, i % cols])
fig.colorbar(im, cax = cax)
cax.set_title(df.columns[i], fontsize=10)
plt.show()

Bokeh: linking a line plot and a scatter plot

I have a line plot and a scatter plot that are conceptually linked by sample IDs, i.e. each dot on the 2D scatter plot corresponds to a line on the line plot.
While I have done linked plotting before using scatter plots, I have not seen examples of this for the situation above - where I select dots and thus selectively view lines.
Is it possible to link dots on a scatter plot to a line on a line plot? If so, is there an example implementation available online?
Searching the web for bokeh link line and scatter plot yields no examples online, as of 14 August 2018.
I know this is a little late - but maybe this snippet of code will help?
import numpy as np
from bokeh.io import output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.models import Circle,MultiLine
def play():
x = np.linspace(0,10,100)
y = np.random.rand(100)
xs = np.random.rand(100,3)
ys = np.random.normal(size=(100,3))
xp = [list(xi) for xi in xs] # Because Multi-List does not like numpy arrays
yp = [list(yi) for yi in ys]
output_file('play.html')
source = ColumnDataSource(data=dict(x=x,y=y,xp=xp,yp=yp))
TOOLS = 'box_select'
left = figure(tools=TOOLS,plot_width=700,plot_height=700)
c1 = left.circle('x','y',source=source)
c1.nonselection_glyph = Circle(fill_color='gray',fill_alpha=0.4,
line_color=None)
c1.selection_glyph = Circle(fill_color='orange',line_color=None)
right = figure(tools=TOOLS,plot_width=700,plot_height=700)
c2 = right.multi_line(xs='xp',ys='yp',source=source)
c2.nonselection_glyph = MultiLine(line_color='gray',line_alpha=0.2)
c2.selection_glyph = MultiLine(line_color='orange')
p = gridplot([[left, right]])
show(p)
As things turn out, I was able to make this happen by using HoloViews rather than Bokeh. The relevant example for making this work comes from the Selection1d tap stream.
http://holoviews.org/reference/streams/bokeh/Selection1D_tap.html#selection1d-tap
I will do an annotated version of the example below.
First, we begin with imports. (Note: all of this assumes work is being done in the Jupyter notebook.)
import numpy as np
import holoviews as hv
from holoviews.streams import Selection1D
from scipy import stats
hv.extension('bokeh')
First off, we set some styling options for the charts. In my experience, I usually build the chart before styling it, though.
%%opts Scatter [color_index=2 tools=['tap', 'hover'] width=600] {+framewise} (marker='triangle' cmap='Set1' size=10)
%%opts Overlay [toolbar='above' legend_position='right'] Curve (line_color='black') {+framewise}
This function below generates data.
def gen_samples(N, corr=0.8):
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
return np.random.multivariate_normal(means, covs, N)
data = [('Week %d' % (i%10), np.random.rand(), chr(65+np.random.randint(5)), i) for i in range(100)]
sample_data = hv.NdOverlay({i: hv.Points(gen_samples(np.random.randint(1000, 5000), r2))
for _, r2, _, i in data})
The real magic begins here. First off, we set up a scatterplot using the hv.Scatter object.
points = hv.Scatter(data, ['Date', 'r2'], ['block', 'id']).redim.range(r2=(0., 1))
Then, we create a Selection1D stream. It pulls in points from the points object.
stream = Selection1D(source=points)
We then create a function to display the regression plot on the right. There's an empty plot that is the "default", and then there's a callback that hv.DynamicMap calls on.
empty = (hv.Points(np.random.rand(0, 2)) * hv.Curve(np.random.rand(0, 2))).relabel('No selection')
def regression(index):
if not index:
return empty
scatter = sample_data[index[0]]
xs, ys = scatter['x'], scatter['y']
slope, intercep, rval, pval, std = stats.linregress(xs, ys)
xs = np.linspace(*scatter.range(0)+(2,))
reg = slope*xs+intercep
return (scatter * hv.Curve((xs, reg))).relabel('r2: %.3f' % slope)
Now, we create the DynamicMap which dynamically loads the regression curve data.
reg = hv.DynamicMap(regression, kdims=[], streams=[stream])
# Ignoring annotation for average - it is not relevant here.
average = hv.Curve(points, 'Date', 'r2').aggregate(function=np.mean)
Finally, we display the plots.
points * average + reg
The most important thing I learned from building this is that the indices for the points have to be lined up with the indices for the regression curves.
I hope this helps others building awesome viz using HoloViews!

Generate a heatmap in MatPlotLib using a scatter data set

My question is almost exactly similar to this one. However, I'm not satisfied with the answers, because I want to generate an actual heatmap, without explicitely binning the data.
To be precise, I would like to display the function that is the result of a convolution between the scatter data and a custom kernel, such as 1/x^2.
How should I implement this with matplotlib?
EDIT: Basically, what I have done is this. The result is here. I'd like to keep everything, the axis, the title, the labels and so on. Basically just change the plot to be like I described, while re-implementing as little as possible.
Convert your time series data into a numeric format with matplotlib.dats.date2num. Lay down a rectangular grid that spans your x and y ranges and do your convolution on that plot. Make a pseudo-color plot of your convolution and then reformat the x labels to be dates.
The label formatting is a little messy, but reasonably well documented. You just need to replace AutoDateFormatter with DateFormatter and an appropriate formatting string.
You'll need to tweak the constants in the convolution for your data.
import numpy as np
import datetime as dt
import pylab as plt
import matplotlib.dates as dates
t0 = dt.date.today()
t1 = t0+dt.timedelta(days=10)
times = np.linspace(dates.date2num(t0), dates.date2num(t1), 10)
dt = times[-1]-times[0]
price = 100 - (times-times.mean())**2
dp = price.max() - price.min()
volume = np.linspace(1, 100, 10)
tgrid = np.linspace(times.min(), times.max(), 100)
pgrid = np.linspace(70, 110, 100)
tgrid, pgrid = np.meshgrid(tgrid, pgrid)
heat = np.zeros_like(tgrid)
for t,p,v in zip(times, price, volume):
delt = (t-tgrid)**2
delp = (p-pgrid)**2
heat += v/( delt + delp*1.e-2 + 5.e-1 )**2
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pcolormesh(tgrid, pgrid, heat, cmap='gist_heat_r')
plt.scatter(times, price, volume, marker='x')
locator = dates.DayLocator()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(dates.AutoDateFormatter(locator))
fig.autofmt_xdate()
plt.show()

Categories