Python interactive plotting for large data sets - python

Suppose I have a dataset with 100k rows (1000 different times, 100 different series, an observation for each, and auxilliary information). I'd like to create something like the following:
(1) first panel of plot has time on x axis, and average of the different series (and standard error) on y axis.
(2) based off the time slice (vertical line) we hover over in panel 1, display a (potentially down sampled) scatter plot of auxilliary information versus the series value at that time slice.
I've looked into a few options for this: (1) matplotlib + ipywidgets doesn't seem to handle it unless you explicitly select points via a slider. This also doesn't translate well to html exporting. This is not ideal, but is potentially workable. (2) altair - this library is pretty sleek, but from my understanding, I need to give it the whole dataset for it to handle the interactions, but it also can't handle more than 5kish data points. This would preclude my use case, correct?
Any suggestions as to how to proceed? Is what I'm asking impossible in the current state of things?

You can work with datasets larger than 5k rows in Altair, as specified in this section of the docs.
One of the most convenient solutions in my opinion is to install altair_data_server and then add alt.data_transformers.enable('data_server') on the top of your notebooks and scripts. This server will provide the data to Altair as long as your Python process is running so there is no need to include all the data as part of the created chart specification, which means that the 5k error will be avoided. The main drawback is that it wont work if you export to a standalone HTML because you rely on being in an environment where the server Python process is running.

Related

Python/Jupyter: scroll scatter-plot with many data points horizontally

I want to visualize time-series-like data with several measurements over time.
There are a lot of such measurements in a dataset, in the order of tens to hundreds of thousands.
In order to view these in a notebook or HTML page, I would like some efficient method to show a subrange of the whole time range with just a view hundred to thousand database and have controls to scroll lef/right i.e. forward/backward in time through the data.
I have tried doing this with Plotly and a range slider, but unfortunately this does not scale to a lot of data at all. Apparently, this approach creates all the graph data in the output javascript, which slows down everything and at some point makes the browser hang or crash.
What I would need is an approach that actually only renders the data in the subrange and interacts with the python code via the scrolling widgets to update the view.
Ideally, this would work with Plotly as I am using it for all other visualizations, but any other efficient solution would also be welcome.
Plotly runs into rendering issues when there are too many data points within the window (see Plotly Benchmarks). I would suggest using Plotly-Resampler which resamples data that is within the user's view.

Plotting table in python (matplotlib alternatives)

Are there any alternatives to matplotlib as it comes to plotting tables?
I find it very inconvenient to plot tables in matplotlib. It's hard to make small change of one parameter - table size (scale), font size, cells contents, column widths/heights etc. often requires fine-tuning all other parameters.
There are some alternatives. One of them that you might want to look at is Plotly. Have a look at its documentation and examples showing how to plot interactive tables.
https://plotly.com/python/table/

Choroplethmapbox slow to render?

I’ve been playing around with Plotly and Dash for the first time over the past few days, with the hope of developing a browser-based data explorer for geographic NetCDF4 data. I’ve been impressed at how straightforward this has been so far, however I’m finding that some interactions with choroplethmapbox are taking longer to update and render than expected. I believe this may be the same issue discussed here
The following refers to the code and sample data available here, where the Dash application can be run using:
python choropleth.py (Python 3.7).
The source of my data comes from a 4D NetCDF4 file (in this case a model of ocean temperature - temp.nc) with dimensions of time, depth, lat and lon. In my case I’m only plotting a 2D chloropleth map, but I’d like the user to interactively select the desired time interval (and eventually depth) as well (the render will always be in 2D space).
Using the examples from here, I’m using a GeoJSON file of the 2D grid cells coupled with a Pandas DataFrame to render ocean temperature. Everything is working as expected, however any changes to the slider value (time) take a long time to update (approx six seconds on my machine). It appears as though there’s a second or so between selecting the slider value and running the update_figure() callback, then another 4-5 seconds before the new render starts to take place in the browser.
The update_figure() callback reads the requested data directly from the NetCDF4 file, then directly updates the Z values in the existing figure dictionary and returns this as a new figure (see code fragment below). At first I was concerned that the slow response time was due to reading from the NetCDF4, however a basic timing function shows that the update_figure() callback runs in less than 0.01 seconds in most cases. So it appears the delay is either coming from the #app.callback or the render function (post update_figure()) in Dash?
# Create the callback and callback function (update_figure)
#app.callback(Output('plot', 'figure'),
[Input('slide', 'value')],
[State('plot','relayoutData'),State('plot', 'figure')])
def update_figure(x,r,f):
t0 = tme.time()
f['layout']['mapbox']['center']['lat'] = f['layout']['mapbox']['center']['lat']
f['layout']['mapbox']['center']['lon'] = f['layout']['mapbox']['center']['lon']
f['layout']['mapbox']['zoom'] = f['layout']['mapbox']['zoom']
# If the map window has been panned or zoomed, grab those values for the new figure
if r is not None:
if 'mapbox.center' in r:
f['layout']['mapbox']['center']['lat'] = r['mapbox.center']['lat']
f['layout']['mapbox']['center']['lon'] = r['mapbox.center']['lon']
f['layout']['mapbox']['zoom'] = r['mapbox.zoom']
# Extract the new time values from the NetCDF file
tmp = nc['temp'][x, -1, :, :].values.flatten()
# Repace the Z values in the original figure with the updated values, leave everything else (e.g. cell geojson and max/min ranges) as-is
f['data'][0]['z'] = np.where(np.isnan(tmp), None, tmp).tolist()
print("update_figure() time: ",tme.time()-t0)
return f
I suspect that the slow render times are somehow related to the GeoJSON of each cell polygon (47k grid cell polygons are being rendered in total, with each polygon being defined by 6 points (i.e. 284k points total)), and unfortunately this can’t be simplified any further.
I'm seeking suggestions on how I can speed up the update/render when a user is interacting with the application. Two ideas I've had include:
Utilising WebGL if possible? It's unclear to me from the documentation whether choroplethmapbox already uses WebGL? If not, is there a pathway for making use of this for faster rendering?
Implementing some form of client side callback, although I don't know if this is possible given that I need to read the values directly out of the NetCDF file when requested by the user? Perhaps it's possible to just read/return the new Z values, then merge that with the existing GeoJSON on the client side?
Suggestions appreciated.

Bokeh: Repeated plotting in jupyter lab increases (browser) memory usage

I am using Bokeh to plot many time-series (>100) with many points (~20,000) within a Jupyter Lab Notebook.
When executing the cell multiple times in Jupyter the memory consumption of Chrome increases per run by over 400mb. After several cell executions Chrome tends to crash, usually when several GB of RAM usage are accumulated. Further, the plotting tends to get slower after each execution.
A "Clear [All] Outputs" or "Restart Kernel and Clear All Outputs..." in Jupyter also does not free any memory. In a classic Jupyter Notebook as well as with Firefox or Edge the issue also occurs.
Minimal version of my .ipynp:
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import bokeh
output_notebook() # See e.g.: https://github.com/bokeh/bokeh-notebooks/blob/master/tutorial/01%20-%20Basic%20Plotting.ipynb
# Just create a list of numpy arrays with random-walks as dataset
ts_length = 20000
n_lines = 100
np.random.seed(0)
dataset = [np.cumsum(np.random.randn(ts_length)) + i*100 for i in range(n_lines)]
# Plot exactly the same linechart every time
plot = figure(x_axis_type="linear")
for data in dataset:
plot.line(x=range(ts_length), y=data)
show(plot)
This 'memory-leak' behavior continues, even if I execute the following cell every time before re-executing the (plot) cell above:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
output_notebook() # has to be done again because output was reset
Are there any additional mechanism in Bokeh that I might have overlooked, which would allow me to clean up the plot and to free the memory (in browser/js/client)?
Do I have to plot (or show the plot) somehow else within a Jupyter Notebook to avoid this issue? Or is this simply a bug of Bokeh/Jupyter?
Installed Versions on my System (Windows 10):
Python 3.6.6 : Anaconda custom (64-bit)
bokeh: 1.4.0
Chrome: 78.0.3904.108
jupyter:
core: 4.6.1
lab: 1.1.4
ipywidgets: 7.5.1
labextensions:
#bokeh/jupyter_bokeh: v1.1.1
#jupyter-widgets/jupyterlab-manager: v1.0.*
TLDR; This is probably worth making an issue(s) for.
Memory usage
Just some notes about different aspects:
Clear/Reset functions
First to note, these:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
Only affect data structures in the Python process (e.g. the Jupyter Kernel). They will never have any effect on the browser memory usage or footprint.
One-time Memory footprint
Based on just the data I'd expect some where in the neighborhood of ~64MB:
20000 * 100 * 2 * 2 * 8 = 64MB
That's: 100 lines with 20k (x,y) points, which will also be converted to (sx,sy) screen coordinates, all in float64 (8byte) types arrays. However, Bokeh also constructs a spatial index for all data to support things like hover tools. I expect you are blowing up this index with this data. it is probably worth making this feature configurable so that folks who do not need hit testing do not have to pay for it. A feature-request issue to discuss this would be appropriate.
Repeated Execution
There are supposed to be DOM event triggers that will clean up when a notebook cell is re-executed. Perhaps these have become broken? Maintaining integrations between three large hybrid Python/JS tools (including classic Notebook) with a tiny team is unfortunately an ongoing challenge. A bug-report issue would be appropriate so that this can be tracked and investigated.
Other options
What can you do, right now?
More optimal usage
At least for the specific case you have here with timeseries all of the same length, that above code is structured in a very suboptimal way. You should try putting everything in a single ColumnDataSource instead:
ts_length = 20000
n_lines = 100
np.random.seed(0)
source = ColumnDataSource(data=dict(x=np.arange(ts_length)))
for i in range(n_lines):
source.data[f"y{i}"] = np.cumsum(np.random.randn(ts_length)) + i*100
plot = figure()
for i in range(n_lines):
plot.line(x='x', y=f"y{i}", source=source)
show(plot)
By passing sequence literals to line, your code results in the creation 99 unnecessary CDS objects (one per line call). Also does not re-used the x data, resulting in sending 99*20k extra points to BokehJS unnecessarily. And by sending a plain list instead of a numpy array, these also all get encoded using the less efficient (in time and space) default JSON encoding, instead of the efficient binary encoding that is available for numpy arrays.
That said, this is not causing all the issues here, and is probably not a solution on its own. But I wanted to make sure to point it out.
Datashader
For this many points, you might consider using DataShader in conjunction with Bokeh. The Holoviews library also integrates Bokeh and Datashader automatically at a high level. By pre-rendering images on the Python side, Datashader is effectively a bandwidth compression tool (among other things).
PNG export
Bokeh tilts trade-off towards affording various kinds of interactivity. But if you don't actually need that interactivity, then you are paying some extra costs. If that's your situation, you could consider generating static PNGs instead:
from bokeh.io.export import get_screenshot_as_png
p = get_screenshot_as_png(plot)
You'll need to install the additional optional dependencies listed in Exporting Plots and if you are doing many plots you might want to consider saving and reusing a webdriver explicitly for each call.

Efficient visualisation in Python

I have data (generated by an algorithm I wrote for it) for a random process which consists of coalescing and branching random walks on a finite space that I would like to visualize using python and probably something from matplotlib.
The data look like this:
A list of lists the of states of the process at times when something changes (a walk moves to an empty spot, coalesces with another one or a new particle is born), so something like this (let's say the process lives on {0,1,2,3,4}:
[[0,1,2,0,2],...,[1,0,2,2,0]], so at the beginning I start with the process having particles at positions 1,2 and 4 (there are two different kinds of particles so that "1" indicates the presence of a first type and "2" of the second, whole "0" means nothing there)
And I have also the list of events that alter the process, so a list of lists of the form
[place,time,type]
so I know what happens where and at what time (which corresponds to writing appropriate marks in the graphical representation, for example an arrow to the left if the event was that a particle moved to the left).
I wrote something like this :
import pylab as P
P.plot(-spacebound,0,spacebound,maxtime)
while something in the process:
current=listofevents.pop(0)
for i that are nonempty at current time:
P.arrow() in a way corresponding to the data
P.show()
This works, but it is extremely slow so that if I have a big process it takes an enormous amount of time to make this visualization (while generating the process data takes a few seconds at most for rather extreme parameters - a big space, time and a high rate of particle births which means a a lot of events changing the process often).
I am pretty sure using arrows like this is pretty idiotic, but since I've only visualized things in R so far (I could of course simply export my data from python and visualize them in R but I want to avoid that) I am also very green at doing this in Python.
I tried some googling, found out about matplotlib and looked at some tutorials there and apart from the arrows I also tried just visualizing the states of the process (without the events) by looping plt.scatter() over all the states, but while this is slightly faster, it is still extremely slow and it also looks messy.
So how would I plot this in a sensible way? Even a link to something like "learn to do plotting in Python properly" is welcome as an answer. Thanks!
matplotlib is not for interactive plotting. It used for generating a article-quality plots. For interactive plots you could try to use Chaco or other libs. The Chaco ideology is to create a plot and link it with the data. As you update the data you get your chart updated automatically.

Categories