I have a line plot and a scatter plot that are conceptually linked by sample IDs, i.e. each dot on the 2D scatter plot corresponds to a line on the line plot.
While I have done linked plotting before using scatter plots, I have not seen examples of this for the situation above - where I select dots and thus selectively view lines.
Is it possible to link dots on a scatter plot to a line on a line plot? If so, is there an example implementation available online?
Searching the web for bokeh link line and scatter plot yields no examples online, as of 14 August 2018.
I know this is a little late - but maybe this snippet of code will help?
import numpy as np
from bokeh.io import output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.models import Circle,MultiLine
def play():
x = np.linspace(0,10,100)
y = np.random.rand(100)
xs = np.random.rand(100,3)
ys = np.random.normal(size=(100,3))
xp = [list(xi) for xi in xs] # Because Multi-List does not like numpy arrays
yp = [list(yi) for yi in ys]
output_file('play.html')
source = ColumnDataSource(data=dict(x=x,y=y,xp=xp,yp=yp))
TOOLS = 'box_select'
left = figure(tools=TOOLS,plot_width=700,plot_height=700)
c1 = left.circle('x','y',source=source)
c1.nonselection_glyph = Circle(fill_color='gray',fill_alpha=0.4,
line_color=None)
c1.selection_glyph = Circle(fill_color='orange',line_color=None)
right = figure(tools=TOOLS,plot_width=700,plot_height=700)
c2 = right.multi_line(xs='xp',ys='yp',source=source)
c2.nonselection_glyph = MultiLine(line_color='gray',line_alpha=0.2)
c2.selection_glyph = MultiLine(line_color='orange')
p = gridplot([[left, right]])
show(p)
As things turn out, I was able to make this happen by using HoloViews rather than Bokeh. The relevant example for making this work comes from the Selection1d tap stream.
http://holoviews.org/reference/streams/bokeh/Selection1D_tap.html#selection1d-tap
I will do an annotated version of the example below.
First, we begin with imports. (Note: all of this assumes work is being done in the Jupyter notebook.)
import numpy as np
import holoviews as hv
from holoviews.streams import Selection1D
from scipy import stats
hv.extension('bokeh')
First off, we set some styling options for the charts. In my experience, I usually build the chart before styling it, though.
%%opts Scatter [color_index=2 tools=['tap', 'hover'] width=600] {+framewise} (marker='triangle' cmap='Set1' size=10)
%%opts Overlay [toolbar='above' legend_position='right'] Curve (line_color='black') {+framewise}
This function below generates data.
def gen_samples(N, corr=0.8):
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
return np.random.multivariate_normal(means, covs, N)
data = [('Week %d' % (i%10), np.random.rand(), chr(65+np.random.randint(5)), i) for i in range(100)]
sample_data = hv.NdOverlay({i: hv.Points(gen_samples(np.random.randint(1000, 5000), r2))
for _, r2, _, i in data})
The real magic begins here. First off, we set up a scatterplot using the hv.Scatter object.
points = hv.Scatter(data, ['Date', 'r2'], ['block', 'id']).redim.range(r2=(0., 1))
Then, we create a Selection1D stream. It pulls in points from the points object.
stream = Selection1D(source=points)
We then create a function to display the regression plot on the right. There's an empty plot that is the "default", and then there's a callback that hv.DynamicMap calls on.
empty = (hv.Points(np.random.rand(0, 2)) * hv.Curve(np.random.rand(0, 2))).relabel('No selection')
def regression(index):
if not index:
return empty
scatter = sample_data[index[0]]
xs, ys = scatter['x'], scatter['y']
slope, intercep, rval, pval, std = stats.linregress(xs, ys)
xs = np.linspace(*scatter.range(0)+(2,))
reg = slope*xs+intercep
return (scatter * hv.Curve((xs, reg))).relabel('r2: %.3f' % slope)
Now, we create the DynamicMap which dynamically loads the regression curve data.
reg = hv.DynamicMap(regression, kdims=[], streams=[stream])
# Ignoring annotation for average - it is not relevant here.
average = hv.Curve(points, 'Date', 'r2').aggregate(function=np.mean)
Finally, we display the plots.
points * average + reg
The most important thing I learned from building this is that the indices for the points have to be lined up with the indices for the regression curves.
I hope this helps others building awesome viz using HoloViews!
Related
I am using scipy.signal library to find the peaks of a time graph. I inputted the y values of my pandas series. And it gave me the location of the the peaks. Now i am trying to use the locations from the find_peaks function to return the position in time of the peaks. Here is my function:
def turn_peaks_to_time_series(df,t_interval):
df_values = df['l'].values
fig, ax1 = plt.subplots()
x_of_peaks, _ = find_peaks(df_values, height=None)
y_of_peaks = df_values[x_of_peaks]
x_values_to_t_values = lambda x : timedelta(minutes=x) * t_interval
time_initial = np.min(df.index)
t_of_peaks = [ time_initial + x_values_to_t_values(int(i)) for i in x_of_peaks ] #source of issue
ax1.plot(t_of_peaks, y_of_peaks, "rp",label='peak') #plot peaks on graph
ax1.plot(df.index,df.l) # plot df line
plt.show()
However, peaks are not properly aligning
I know the issue is with my x_values_to_t_values function. In addition, any suggesting to optimize my code are very welcomed.
Turns out i was trying to reinvent the wheel. The solution to my problem was extremely simple. Also I adjusted the code to be more general.
def turn_peaks_to_time_series(series):
series_values = series.values
series_index = series.index
fig, ax1 = plt.subplots()
x_of_peaks, _ = find_peaks(series_values, height=None)
y_of_peaks = series_values[x_of_peaks]
ax1.plot(series_index[x_of_peaks], y_of_peaks, "rp",label='peak') #plot peaks on graph
ax1.plot(series_index,series_values) # plot df line
plt.show()
Using Blender created this model
that can be seen in A-frame in this link
This model is great and it gives an overview of what I'm trying to accomplish here. Basically, instead of having the names, I'd have dots that symbolize one specific platform.
The best way to achieve it with current state of the art, at my sight, is through Plotly 3D Scatter Plots. I've got the following scatterplot
import plotly.express as px
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/tiago-peres/immersion/master/Platforms_dataset.csv')
fig = px.scatter_3d(df, x='Functionality ', y='Accessibility', z='Immersion', color='Platforms')
fig.show()
that by going to this link you'll be able to click a button and open it in Colab
This nearly looks like the model. Yet, still in need to add three planes to the plot in specific locations. More precisely, in x=?, y=? and z=? (I'm using question mark because the value can be anything stablished).
In other words, want to add three planes to that scatterplot
x = 10
y = 30
z = 40
In the documentation, what closely resembles what I want was 3D Surface Plots.
I've done research and found two similar questions with R
Insert 2D plane into a 3D Plotly scatter plot in R
Add Regression Plane to 3d Scatter Plot in Plotly
I think you might be looking for the add_trace function in plotly so you can just create the surfaces and then add them to the figure:
Also, note, there's definitely ways to simplify this code, but for a general idea:
import plotly.express as px
import pandas as pd
import plotly.graph_objects as go
import numpy as np
fig = px.scatter_3d(df, x='Functionality ', y='Accessibility', z='Immersion', color='Platforms')
bright_blue = [[0, '#7DF9FF'], [1, '#7DF9FF']]
bright_pink = [[0, '#FF007F'], [1, '#FF007F']]
light_yellow = [[0, '#FFDB58'], [1, '#FFDB58']]
# need to add starting point of 0 to each dimension so the plane extends all the way out
zero_pt = pd.Series([0])
z = zero_pt.append(df['Immersion'], ignore_index = True).reset_index(drop = True)
y = zero_pt.append(df['Accessibility'], ignore_index = True).reset_index(drop = True)
x = zero_pt.append(df['Functionality '], ignore_index = True).reset_index(drop = True)
length_data = len(z)
z_plane_pos = 40*np.ones((length_data,length_data))
fig.add_trace(go.Surface(x=x, y=y, z=z_plane_pos, colorscale=light_yellow, showscale=False))
fig.add_trace(go.Surface(x=x.apply(lambda x: 10), y=y, z = np.array([z]*length_data), colorscale= bright_blue, showscale=False))
fig.add_trace(go.Surface(x=x, y= y.apply(lambda x: 30), z = np.array([z]*length_data).transpose(), colorscale=bright_pink, showscale=False))
I have a synthetic dataset with 1000 noisy polygons of various orders and sin/cos curves that I can plot as lines using python seaborn.
Since I have quite a few lines that are overlapping, I'd like to plot some sort of heatmap or histogram of my line graphs.
I've tried iterating over the columns and aggregating the counts to use seaborn's heatmap graph, but with many lines this takes quite a while.
The next best thing that results in what I want was a hexbin graph (with seaborn jointgraph).
But it's a compromise between runtime and granularity (the shown graph has gridsize 750). I couldn't find any other graph-type for my problem. But I also don't know exactly what it might be called.
I've also tried with line alpha set to 0.2. This results in a similar graph to what I want. But it's less precise (if more than 5 lines overlap at the same point I already have zero transparency left). Also, it misses the typical coloration of heatmaps.
(Moot search terms were: heatmap, 2D line histogram, line histogram, density plots...)
Does anybody know packages to plot this more efficiently and high(er) quality or knows how to do it with the popular python plotters (i.e. the matplotlib family: matplotlib, seaborn, bokeh). I'm really fine with any package though.
It took me awhile, but I finally solved this using Datashader. If using a notebook, the plots can be embedded into interactive Bokeh plots, which looks really nice.
Anyhow, here is the code for static images, in case someone else is in need of something similar:
# coding: utf-8
import time
import numpy as np
from numpy.polynomial import polynomial
import pandas as pd
import matplotlib.pyplot as plt
import datashader as ds
import datashader.transfer_functions as tf
plt.style.use("seaborn-whitegrid")
def create_data():
# ...
# Each column is one data sample
df = create_data()
# Following will append a nan-row and reshape the dataframe into two columns, with each sample stacked on top of each other
# THIS IS CRUCIAL TO OPTIMIZE SPEED: https://github.com/bokeh/datashader/issues/286
# Append row with nan-values
df = df.append(pd.DataFrame([np.array([np.nan] * len(df.columns))], columns=df.columns, index=[np.nan]))
# Reshape
x, y = df.shape
arr = df.as_matrix().reshape((x * y, 1), order='F')
df_reshaped = pd.DataFrame(arr, columns=list('y'), index=np.tile(df.index.values, y))
df_reshaped = df_reshaped.reset_index()
df_reshaped.columns.values[0] = 'x'
# Plotting parameters
x_range = (min(df.index.values), max(df.index.values))
y_range = (df.min().min(), df.max().max())
w = 1000
h = 750
dpi = 150
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=h, plot_width=w)
# Aggregate data
t0 = time.time()
aggs = cvs.line(df_reshaped, 'x', 'y', ds.count())
print("Time to aggregate line data: {}".format(time.time()-t0))
# One colored plot
t1 = time.time()
stacked_img = tf.Image(tf.shade(aggs, cmap=["darkblue", "darkblue"]))
print("Time to create stacked image: {}".format(time.time() - t1))
# Save
f0 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax0 = f0.add_subplot(111)
ax0.imshow(stacked_img.to_pil())
ax0.grid(False)
f0.savefig("stacked.png", bbox_inches="tight", dpi=dpi)
# Heat map - This uses a equalized histogram (built-in default), there are other options, though.
t2 = time.time()
heatmap_img = tf.Image(tf.shade(aggs, cmap=plt.cm.Spectral_r))
print("Time to create stacked image: {}".format(time.time() - t2))
# Save
f1 = plt.figure(figsize=(w / dpi, h / dpi), dpi=dpi)
ax1 = f1.add_subplot(111)
ax1.imshow(heatmap_img.to_pil())
ax1.grid(False)
f1.savefig("heatmap.png", bbox_inches="tight", dpi=dpi)
With following run times (in seconds):
Time to aggregate line data: 0.7710442543029785
Time to create stacked image: 0.06000351905822754
Time to create stacked image: 0.05600309371948242
The resulting plots:
Although it seems you have tried this, plotting the counts seems to give a good representation of the data. However, it really depends what you're trying to find in your data, what is it supposed to tell you?
The reason for the long run time is due to plotting so many lines, a heatmap based on the counts however will plot fairly quickly.
I created some dummy data for sinus waves, based on noise, no. of lines, amplitude and shift. Added both a boxplot and heatmap.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import random
import pandas as pd
np.random.seed(0)
#create dummy data
N = 200
sinuses = []
no_lines = 200
for i in range(no_lines):
a = np.random.randint(5, 40)/5 #amplitude
x = random.choice([int(N/5), int(N/(2/5))]) #random shift
sinuses.append(np.roll(a * np.sin(np.linspace(0, 2 * np.pi, N)) + np.random.randn(N), x))
fig = plt.figure(figsize=(20 / 2.54, 20 / 2.54))
sins = pd.DataFrame(sinuses, )
ax1 = plt.subplot2grid((3,10), (0,0), colspan=10)
ax2 = plt.subplot2grid((3,10), (1,0), colspan=10)
ax3 = plt.subplot2grid((3,10), (2,0), colspan=9)
ax4 = plt.subplot2grid((3,10), (2,9))
# plot line data
sins.T.plot(ax=ax1, color='lightblue',linewidth=.3)
ax1.legend_.remove()
ax1.set_xlim(0, N)
# try boxplot
sins.plot.box(ax=ax2, showfliers=False)
xticks = ax2.xaxis.get_major_ticks()
for index, label in enumerate(ax2.get_xaxis().get_ticklabels()):
xticks[index].set_visible(False) # hide ticks where labels are hidden
#make a list of bins
no_bins = 20
bins = list(np.arange(sins.min().min(), sins.max().max(), int(abs(sins.min().min())+sins.max().max())/no_bins))
bins.append(sins.max().max())
# calculate histogram
hists = []
for col in sins.columns:
count, division = np.histogram(sins.iloc[:,col], bins=bins)
hists.append(count)
hists = pd.DataFrame(hists, columns=[str(i) for i in bins[1:]])
print(hists.shape, '\n', hists.head())
cmap = mpl.colors.ListedColormap(['white', '#FFFFBB', '#C3FDB8', '#B5EAAA', '#64E986', '#54C571',
'#4AA02C', '#347C17', '#347235', '#25383C', '#254117'])
#heatmap
im = ax3.pcolor(hists.T, cmap=cmap)
cbar = plt.colorbar(im, cax=ax4)
yticks = np.arange(0, len(bins))
yticklabels = hists.columns.tolist()
ax3.set_yticks(yticks)
ax3.set_yticklabels([round(i,1) for i in bins])
ax3.set_title('Count')
yticks = ax3.yaxis.get_major_ticks()
for index, label in enumerate(ax3.get_yaxis().get_ticklabels()):
if index % 3 != 0: #make some labels invisible
yticks[index].set_visible(False) # hide ticks where labels are hidden
plt.show()
Although the boxplot is easy to interpret, it doesn't show the actual distribution of the data very well, but knowing where the median and quantiles lie may be helpful.
Increasing the number of lines and amount of values per line will increase plotting time considerably for the line plots, the heatmap is still fairly quick though to generate. The boxplot becomes indiscernible however.
I couldn't exactly replicate your data (or know the actual size of it), but perhaps the heatmap may be helpful.
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)
My question is almost exactly similar to this one. However, I'm not satisfied with the answers, because I want to generate an actual heatmap, without explicitely binning the data.
To be precise, I would like to display the function that is the result of a convolution between the scatter data and a custom kernel, such as 1/x^2.
How should I implement this with matplotlib?
EDIT: Basically, what I have done is this. The result is here. I'd like to keep everything, the axis, the title, the labels and so on. Basically just change the plot to be like I described, while re-implementing as little as possible.
Convert your time series data into a numeric format with matplotlib.dats.date2num. Lay down a rectangular grid that spans your x and y ranges and do your convolution on that plot. Make a pseudo-color plot of your convolution and then reformat the x labels to be dates.
The label formatting is a little messy, but reasonably well documented. You just need to replace AutoDateFormatter with DateFormatter and an appropriate formatting string.
You'll need to tweak the constants in the convolution for your data.
import numpy as np
import datetime as dt
import pylab as plt
import matplotlib.dates as dates
t0 = dt.date.today()
t1 = t0+dt.timedelta(days=10)
times = np.linspace(dates.date2num(t0), dates.date2num(t1), 10)
dt = times[-1]-times[0]
price = 100 - (times-times.mean())**2
dp = price.max() - price.min()
volume = np.linspace(1, 100, 10)
tgrid = np.linspace(times.min(), times.max(), 100)
pgrid = np.linspace(70, 110, 100)
tgrid, pgrid = np.meshgrid(tgrid, pgrid)
heat = np.zeros_like(tgrid)
for t,p,v in zip(times, price, volume):
delt = (t-tgrid)**2
delp = (p-pgrid)**2
heat += v/( delt + delp*1.e-2 + 5.e-1 )**2
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pcolormesh(tgrid, pgrid, heat, cmap='gist_heat_r')
plt.scatter(times, price, volume, marker='x')
locator = dates.DayLocator()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(dates.AutoDateFormatter(locator))
fig.autofmt_xdate()
plt.show()