Generate a heatmap in MatPlotLib using a scatter data set

Generate a heatmap in MatPlotLib using a scatter data set - python

My question is almost exactly similar to this one. However, I'm not satisfied with the answers, because I want to generate an actual heatmap, without explicitely binning the data.
To be precise, I would like to display the function that is the result of a convolution between the scatter data and a custom kernel, such as 1/x^2.
How should I implement this with matplotlib?
EDIT: Basically, what I have done is this. The result is here. I'd like to keep everything, the axis, the title, the labels and so on. Basically just change the plot to be like I described, while re-implementing as little as possible.

Convert your time series data into a numeric format with matplotlib.dats.date2num. Lay down a rectangular grid that spans your x and y ranges and do your convolution on that plot. Make a pseudo-color plot of your convolution and then reformat the x labels to be dates.
The label formatting is a little messy, but reasonably well documented. You just need to replace AutoDateFormatter with DateFormatter and an appropriate formatting string.
You'll need to tweak the constants in the convolution for your data.
import numpy as np
import datetime as dt
import pylab as plt
import matplotlib.dates as dates
t0 = dt.date.today()
t1 = t0+dt.timedelta(days=10)
times = np.linspace(dates.date2num(t0), dates.date2num(t1), 10)
dt = times[-1]-times[0]
price = 100 - (times-times.mean())**2
dp = price.max() - price.min()
volume = np.linspace(1, 100, 10)
tgrid = np.linspace(times.min(), times.max(), 100)
pgrid = np.linspace(70, 110, 100)
tgrid, pgrid = np.meshgrid(tgrid, pgrid)
heat = np.zeros_like(tgrid)
for t,p,v in zip(times, price, volume):
delt = (t-tgrid)**2
delp = (p-pgrid)**2
heat += v/( delt + delp*1.e-2 + 5.e-1 )**2
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pcolormesh(tgrid, pgrid, heat, cmap='gist_heat_r')
plt.scatter(times, price, volume, marker='x')
locator = dates.DayLocator()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(dates.AutoDateFormatter(locator))
fig.autofmt_xdate()
plt.show()

Related

Polar coordinates, datetimes

I am trying to plot some time dependent data in a polar coordinates system. The problem is that I got such an output that I have no clue where to start: times, locators, … not “correctly” drawn.
I need some major ticks each 2 hours, minor tick every hour, and a day label in correspondence with a day transition. In normal coordinates it seems ok but when switching to polar it seems more complicated and this is very confusing. I miss smt but don’t know what.
I have tried with
p_locator = mpolar.ThetaLocator(mdates.AutoDateLocator(minticks=24, maxticks=24))
p_formatter = mpolar.ThetaFormatter()
or with
p_locator = mdates.AutoDateLocator()
p_formatter = mdates.DateFormatter("%H:%M")
but no success. I think I missed how matplotlib works internally with datetimeobjects. Not just the ticks, locator and co. are "wrong" but the data don't even fit the full circle (-> should I apply a scale transformation?)
I would really appreciate some help to understand the mechanism behind it.
Update the axis with
ax.set_xticks(time_ticks)
ax.xaxis.set_major_formatter(p_formatter)
ax.xaxis.set_major_locator(p_locator)
Here a code sample
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
import matplotlib.projections.polar as mpolar
import matplotlib.ticker as mticker
import datetime
import random
N_MAJOR_TICKS = 12 # amount of major ticks
IS_MULTI_DAYS = True
def random_daily_hours(n_times, base_date=datetime.datetime.today()): # ordered list of day-hours
secs = sorted([random.randint(0, 24*60*60 - 1) for _ in range(n_times)])
return [base_date + datetime.timedelta(seconds=s) for s in secs]
def uniform_daily_hours(n, base_date=datetime.datetime.today()): # return floats! matplolib format!!!
return [base_date + datetime.timedelta(days=1)*i/n for i in range(n)]
# time ticks
time_ticks = uniform_daily_hours(N_MAJOR_TICKS)
# random values
random.seed(10) # fixing random state for reproducibility
x = random_daily_hours(15)
y = [random.random() for _ in range(len(x))]
# multi days - append a consecutive day
if IS_MULTI_DAYS:
day = datetime.timedelta(days=1)
x += random_daily_hours(15, base_date=datetime.datetime.today() + 1 * day)
y = [random.random() for _ in range(len(x))]
ax = plt.subplot(projection='polar')
# fix scale & orientation
ax.set_rticks([0.5, 1, 1.5, 2]) # set values radial ticks
ax.set_rlabel_position(120) # set location radial scale
ax.set_theta_zero_location('N') # set polar reference direction
ax.set_theta_direction(-1) # set default orientation - clockwise
# attempt 1
ax.set_xticks(np.linspace(0, 2 * np.pi, N_MAJOR_TICKS, endpoint=False))
ax.set_xticklabels(time_ticks)
ax.plot(x, y, '-')
ax.grid(True)
plt.show()

Offset secondary axis in matplotlib

I'm trying to bring together to different plot settings in matplotlib. I found nice examples for each of them in the matplotlib example gallery/documentation and stack but I couldn't find anything on my specific problem.
So what I know so far is, how to add one or more axes with offset y-axis for plotting different data with respect to the same x-axis, by using ax.twinx(). The third y-axis is called parasite axis in the example Parasite axis demo. However, if you want to add an additional axis which is just a scaled version of the existing one, you can use ax.secondary_yaxis(), as shown in the Secondary axis demo. There is no additional data to be plotted.
What I could not achieve so far is a secondary y-axis which is offset from the original one. This can be very helpful to make plots more readable across scientific communities. For instance, while some scientists use frequency as reference for the electromagnetic spectrum, others use the wavelength or the wavenumber. Afsar [1] used a very convenient axis labeling which includes all the three variables in the same plot:
I would like to the something similar, just on the y-axis instead of the x-axis. Is there a way to offset the secondary axis from the primary axis? I tried a few parameters but couldn't figure it out.
Thank you for any help!
[1] Afsar, Mohammed Nurul. “Precision Millimeter-Wave Measurements of Complex Refractive Index, Complex Dielectric Permittivity, and Loss Tangent of Common Polymers.” IEEE Transactions on Instrumentation and Measurement IM–36, no. 2 (June 1987): 530–36. https://doi.org/10.1109/TIM.1987.6312733.
[1]:

A complete example. The third-to-last line is the relevant one.
import matplotlib.pyplot as plt
import numpy as np
import datetime
dates = [datetime.datetime(2018, 1, 1) + datetime.timedelta(hours=k * 6)
for k in range(240)]
temperature = np.random.randn(len(dates)) * 4 + 6.7
fig, ax = plt.subplots(constrained_layout=True)
ax.plot(dates, temperature)
ax.set_ylabel(r'$T\ [^oC]$')
plt.xticks(rotation=70)
def date2yday(x):
"""Convert matplotlib datenum to days since 2018-01-01."""
y = x - mdates.date2num(datetime.datetime(2018, 1, 1))
return y
def yday2date(x):
"""Return a matplotlib datenum for *x* days after 2018-01-01."""
y = x + mdates.date2num(datetime.datetime(2018, 1, 1))
return y
secax_x = ax.secondary_xaxis('top', functions=(date2yday, yday2date))
secax_x.set_xlabel('yday [2018]')
def celsius_to_fahrenheit(x):
return x * 1.8 + 32
def fahrenheit_to_celsius(x):
return (x - 32) / 1.8
secax_y = ax.secondary_yaxis(
'right', functions=(celsius_to_fahrenheit, fahrenheit_to_celsius))
secax_y.set_ylabel(r'$T\ [^oF]$')
def celsius_to_anomaly(x):
return (x - np.mean(temperature))
def anomaly_to_celsius(x):
return (x + np.mean(temperature))
# document use of a float for the position:
secax_y2 = ax.secondary_yaxis(
1.2, functions=(celsius_to_anomaly, anomaly_to_celsius))
secax_y2.set_ylabel(r'$T - \overline{T}\ [^oC]$')
plt.show()

Here is another approach, although maybe it's more of a hack:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
#FuncFormatter
def twin1_formatter(x, pos):
return f'{x/np.pi*180:.0f}'
#FuncFormatter
def twin2_formatter(x, pos):
return f'{x/np.pi:.1f} $\pi$'
data = np.arange(0, 2*np.pi, 0.1)
fig, ax = plt.subplots()
twin1 = ax.twiny()
twin1.spines['top'].set_position(('axes', 1.2))
twin1.set_xlabel('Degrees')
twin1.xaxis.set_major_formatter(FuncFormatter(twin1_formatter))
twin2 = ax.twiny()
twin2.set_xlabel('Pies')
twin2.xaxis.set_major_formatter(FuncFormatter(twin2_formatter))
twin2.xaxis.set_ticks(np.array([0, 1/2, 1, 3/2, 2])*np.pi)
ax.plot(data, np.sin(data))
ax.set_xlabel('Radians')
twin1.set_xlim(ax.get_xlim())
twin2.set_xlim(ax.get_xlim())
fig.show()

Adding quantitative values to differentiate data through colours in a scatterplot's legend in Python?

Currently, I'm working on an introductory paper on data manipulation and such; however... the CSV I'm working on has some things I wish to do a scatter graph on!
I want a scatter graph to show me the volume sold on certain items as well as their average price, differentiating all data according to their region (Through colours I assume).
So what I want is to know if I can add the region column as a quantitative value
or if there's a way to make this possible...
It's my first time using Python and I'm confused way too often

I'm not sure if this is what you mean, but here is some working code, assuming you have data in the format of [(country, volume, price), ...]. If not, you can change the inputs to the scatter method as needed.
import random
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
n_countries = 50
# get the data into "countries", for example
countries = ...
# in this example: countries is [('BS', 21, 25), ('WZ', 98, 25), ...]
df = pd.DataFrame(countries)
# arbitrary method to get a color
def get_color(i, max_i):
cmap = matplotlib.cm.get_cmap('Spectral')
return cmap(i/max_i)
# get the figure and axis - make a larger figure to fit more points
# add labels for metric names
def get_fig_ax():
fig = plt.figure(figsize=(14,14))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('volume')
ax.set_ylabel('price')
return fig, ax
# switch around the assignments depending on your data
def get_x_y_labels():
x = df[1]
y = df[2]
labels = df[0]
return x, y, labels
offset = 1 # offset just so annotations aren't on top of points
x, y, labels = get_x_y_labels()
fig, ax = get_fig_ax()
# add a point and annotation for each of the labels/regions
for i, region in enumerate(labels):
ax.annotate(region, (x[i] + offset, y[i] + offset))
# note that you must use "label" for "legend" to work
ax.scatter(x[i], y[i], color=get_color(i, len(x)), label=region)
# Add the legend just outside of the plot.
# The .1, 0 at the end will put it outside
ax.legend(loc='upper right', bbox_to_anchor=(1, 1, .1, 0))
plt.show()

Adjusting x-axis in matplotlib

I have a range of values for every hour of year. Which means there are 24 x 365 = 8760 values. I want to plot this information neatly with matplotlib, with x-axis showing January, February......
Here is my current code:
from matplotlib import pyplot as plt
plt.plot(x_data,y_data,label=str("Plot"))
plt.xticks(rotation=45)
plt.xlabel("Time")
plt.ylabel("Y axis values")
plt.title("Y axis values vs Time")
plt.legend(loc='upper right')
axes = plt.gca()
axes.set_ylim([0,some_value * 3])
plt.show()
x_data is a list containing dates in datetime format. y_data contains values corresponding to the values in x_data. How can I get the plot neatly done with months on the X axis? An example:

You could create a scatter plot with horizontal lines as markers. The month is extracted by using the datetime module. In case the dates are not ordered, the plot sorts both lists first according to the date:
#creating a toy dataset for one year, random data points within month-specific limits
from datetime import date, timedelta
import random
x_data = [date(2017, 1, 1) + timedelta(days = i) for i in range(365)]
random.shuffle(x_data)
y_data = [random.randint(50 * (i.month - 1), 50 * i.month) for i in x_data]
#the actual plot starts here
from matplotlib import pyplot as plt
#get a scatter plot with horizontal markers for each data point
#in case the dates are not ordered, sort first the dates and the y values accordingly
plt.scatter([day.strftime("%b") for day in sorted(x_data)], [y for _xsorted, y in sorted(zip(x_data, y_data))], marker = "_", s = 900)
plt.show()
Output
The disadvantage is obviously that the lines have a fixed length. Also, if a month doesn't have a data point, it will not appear in the graph.
Edit 1:
You could also use Axes.hlines, as seen here.
This has the advantage, that the line length changes with the window size. And you don't have to pre-sort the lists, because each start and end point is calculated separately.
The toy dataset is created as above.
from matplotlib import pyplot as plt
#prepare the axis with categories Jan to Dec
x_ax = [date(2017, 1, 1) + timedelta(days = 31 * i) for i in range(12)]
#create invisible bar chart to retrieve start and end points from automatically generated bars
Bars = plt.bar([month.strftime("%b") for month in x_ax], [month.month for month in x_ax], align = "center", alpha = 0)
start_1_12 = [plt.getp(item, "x") for item in Bars]
end_1_12 = [plt.getp(item, "x") + plt.getp(item, "width") for item in Bars]
#retrieve start and end point for each data point line according to its month
x_start = [start_1_12[day.month - 1] for day in x_data]
x_end = [end_1_12[day.month - 1] for day in x_data]
#plot hlines for all data points
plt.hlines(y_data, x_start, x_end, colors = "blue")
plt.show()
Output
Edit 2:
Now your description of the problem is totally different from what you show in your question. You want a simple line plot with specific axis formatting. This can be found easily in the matplotlib documentation and all over SO. An example, how to achieve this with the above created toy dataset would be:
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MonthLocator
ax = plt.subplot(111)
ax.plot([day for day in sorted(x_data)], [y for _xsorted, y in sorted(zip(x_data, y_data))], "r.-")
ax.xaxis.set_major_locator(MonthLocator(bymonthday=15))
ax.xaxis.set_minor_locator(MonthLocator())
ax.xaxis.set_major_formatter(DateFormatter("%B"))
plt.show()
Output

Line-based heatmap or 2D line histogram

I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.

It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:

Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.