I am using Bokeh to plot many time-series (>100) with many points (~20,000) within a Jupyter Lab Notebook.
When executing the cell multiple times in Jupyter the memory consumption of Chrome increases per run by over 400mb. After several cell executions Chrome tends to crash, usually when several GB of RAM usage are accumulated. Further, the plotting tends to get slower after each execution.
A "Clear [All] Outputs" or "Restart Kernel and Clear All Outputs..." in Jupyter also does not free any memory. In a classic Jupyter Notebook as well as with Firefox or Edge the issue also occurs.
Minimal version of my .ipynp:
import numpy as np
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
import bokeh
output_notebook() # See e.g.: https://github.com/bokeh/bokeh-notebooks/blob/master/tutorial/01%20-%20Basic%20Plotting.ipynb
# Just create a list of numpy arrays with random-walks as dataset
ts_length = 20000
n_lines = 100
np.random.seed(0)
dataset = [np.cumsum(np.random.randn(ts_length)) + i*100 for i in range(n_lines)]
# Plot exactly the same linechart every time
plot = figure(x_axis_type="linear")
for data in dataset:
plot.line(x=range(ts_length), y=data)
show(plot)
This 'memory-leak' behavior continues, even if I execute the following cell every time before re-executing the (plot) cell above:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
output_notebook() # has to be done again because output was reset
Are there any additional mechanism in Bokeh that I might have overlooked, which would allow me to clean up the plot and to free the memory (in browser/js/client)?
Do I have to plot (or show the plot) somehow else within a Jupyter Notebook to avoid this issue? Or is this simply a bug of Bokeh/Jupyter?
Installed Versions on my System (Windows 10):
Python 3.6.6 : Anaconda custom (64-bit)
bokeh: 1.4.0
Chrome: 78.0.3904.108
jupyter:
core: 4.6.1
lab: 1.1.4
ipywidgets: 7.5.1
labextensions:
#bokeh/jupyter_bokeh: v1.1.1
#jupyter-widgets/jupyterlab-manager: v1.0.*
TLDR; This is probably worth making an issue(s) for.
Memory usage
Just some notes about different aspects:
Clear/Reset functions
First to note, these:
bokeh.io.curdoc().clear()
bokeh.io.state.State().reset()
bokeh.io.reset_output()
Only affect data structures in the Python process (e.g. the Jupyter Kernel). They will never have any effect on the browser memory usage or footprint.
One-time Memory footprint
Based on just the data I'd expect some where in the neighborhood of ~64MB:
20000 * 100 * 2 * 2 * 8 = 64MB
That's: 100 lines with 20k (x,y) points, which will also be converted to (sx,sy) screen coordinates, all in float64 (8byte) types arrays. However, Bokeh also constructs a spatial index for all data to support things like hover tools. I expect you are blowing up this index with this data. it is probably worth making this feature configurable so that folks who do not need hit testing do not have to pay for it. A feature-request issue to discuss this would be appropriate.
Repeated Execution
There are supposed to be DOM event triggers that will clean up when a notebook cell is re-executed. Perhaps these have become broken? Maintaining integrations between three large hybrid Python/JS tools (including classic Notebook) with a tiny team is unfortunately an ongoing challenge. A bug-report issue would be appropriate so that this can be tracked and investigated.
Other options
What can you do, right now?
More optimal usage
At least for the specific case you have here with timeseries all of the same length, that above code is structured in a very suboptimal way. You should try putting everything in a single ColumnDataSource instead:
ts_length = 20000
n_lines = 100
np.random.seed(0)
source = ColumnDataSource(data=dict(x=np.arange(ts_length)))
for i in range(n_lines):
source.data[f"y{i}"] = np.cumsum(np.random.randn(ts_length)) + i*100
plot = figure()
for i in range(n_lines):
plot.line(x='x', y=f"y{i}", source=source)
show(plot)
By passing sequence literals to line, your code results in the creation 99 unnecessary CDS objects (one per line call). Also does not re-used the x data, resulting in sending 99*20k extra points to BokehJS unnecessarily. And by sending a plain list instead of a numpy array, these also all get encoded using the less efficient (in time and space) default JSON encoding, instead of the efficient binary encoding that is available for numpy arrays.
That said, this is not causing all the issues here, and is probably not a solution on its own. But I wanted to make sure to point it out.
Datashader
For this many points, you might consider using DataShader in conjunction with Bokeh. The Holoviews library also integrates Bokeh and Datashader automatically at a high level. By pre-rendering images on the Python side, Datashader is effectively a bandwidth compression tool (among other things).
PNG export
Bokeh tilts trade-off towards affording various kinds of interactivity. But if you don't actually need that interactivity, then you are paying some extra costs. If that's your situation, you could consider generating static PNGs instead:
from bokeh.io.export import get_screenshot_as_png
p = get_screenshot_as_png(plot)
You'll need to install the additional optional dependencies listed in Exporting Plots and if you are doing many plots you might want to consider saving and reusing a webdriver explicitly for each call.
Related
Suppose I have a dataset with 100k rows (1000 different times, 100 different series, an observation for each, and auxilliary information). I'd like to create something like the following:
(1) first panel of plot has time on x axis, and average of the different series (and standard error) on y axis.
(2) based off the time slice (vertical line) we hover over in panel 1, display a (potentially down sampled) scatter plot of auxilliary information versus the series value at that time slice.
I've looked into a few options for this: (1) matplotlib + ipywidgets doesn't seem to handle it unless you explicitly select points via a slider. This also doesn't translate well to html exporting. This is not ideal, but is potentially workable. (2) altair - this library is pretty sleek, but from my understanding, I need to give it the whole dataset for it to handle the interactions, but it also can't handle more than 5kish data points. This would preclude my use case, correct?
Any suggestions as to how to proceed? Is what I'm asking impossible in the current state of things?
You can work with datasets larger than 5k rows in Altair, as specified in this section of the docs.
One of the most convenient solutions in my opinion is to install altair_data_server and then add alt.data_transformers.enable('data_server') on the top of your notebooks and scripts. This server will provide the data to Altair as long as your Python process is running so there is no need to include all the data as part of the created chart specification, which means that the 5k error will be avoided. The main drawback is that it wont work if you export to a standalone HTML because you rely on being in an environment where the server Python process is running.
I want to understand the clear difference between Datashader and other graphing libraries eg plotly/matplotlib etc.
I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.
But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????
Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??
Plz explain i am unable to understand the in and out clearly.
It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to numpy.histogram2d. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.
By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.
The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.
But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)
From the datashader docs,
datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core or distributed processing for even larger datasets.
There aren't any tricks going on in any of these libraries - rendering a huge number of points takes a long time. What datashader does is to shift the burden of visualization from rendering to computing. There's a very good reason you have to create a canvas before plotting instructions in datashader. The first step in a datashader pipeline is to rasterize a dataset, in other words, it approximates the position of each piece of data and then uses aggregation functions to determine the intensity or color of each pixel. This allows datashader to plot enormous numbers of points; even more points than can be held in memory.
Matplotlib, on the other hand, renders every single point you instruct it to plot, making plotting large datasets time consuming or even impossible.
While there are a few things that are still being worked out, I am a big fan of the LightTable editor. The IPython Notebook is a remarkable delivery system, but managing a larger product is a bit easier in a more conventional development environment.
One thing that I have not yet figured out, however, is complicated plotting in LightTable. With no cell equivalent, I am not sure how to modify plot components because each command seems to be considered independently. In particular, I am not clear on how to work with subplots. I am unable to connect the actual plot to the subplot array. For example, consider the following:
fig,ax=plt.subplots(2)
ax[0].hist(np.random.uniform(size=100))
ax[1].hist(np.random.normal(size=100))
When I create the subplots, they show up empty inline. The remaining code, however, does not cause them to update inline. In the Notebook, all the code is considered jointly in batch. LightTable interactivity is a bit closer to dealing with an interpreter in interactive mode (even though the script is obviously preserved). I have experiemented with turning interactivity on and off via plt.ioff(), but to no avail. Any assistance would be greatly appreciated...
I have a strange issue. Using IPython Notebook, I created a quite extensive script using pandas and matplotlib to create a number of charts.
When my tinkering was finished, I copied (and cleaned) the code into a standalone python script (so that I can push it into the svn and my paper co-authors can create the charts as well).
For convenience, I import the standalone python script into the notebook again and create a number of charts:
import create_charts as cc
df = cc.read_csv_files("./data")
cc.chart_1(df, 'fig_chart1.pdf')
...
Strange enough, the .pdf file I get using the above method is slightly different from the .pdf file I get when I run my standalone python script from my Windows 7 terminal. The most notable difference is that in a particular chart the legend is located in the upper corner instead of the lower corner. But there are other small diferences as well (bounding box size, font seems slightly different)
What could be the cause of this. And how can I troubleshoot it?
(I already shut down my notebook and restarted it, to reimport my create_charts script and rule out any unsaved changes)
My terminal reports I am using Python 2.7.2, and pip freeze | grep ipython reports ipython 0.13.1
To complete Joe answer, the inlinebackend (IPython/kernel/zmq/pylab/backend_inline.py) have some default matplotlib parameters :
# The typical default figure size is too large for inline use,
# so we shrink the figure size to 6x4, and tweak fonts to
# make that fit.
rc = Dict({'figure.figsize': (6.0,4.0),
# play nicely with white background in the Qt and notebook frontend
'figure.facecolor': 'white',
'figure.edgecolor': 'white',
# 12pt labels get cutoff on 6x4 logplots, so use 10pt.
'font.size': 10,
# 72 dpi matches SVG/qtconsole
# this only affects PNG export, as SVG has no dpi setting
'savefig.dpi': 72,
# 10pt still needs a little more room on the xlabel:
'figure.subplot.bottom' : .125
}, config=True,
help="""Subset of matplotlib rcParams that should be different for the
inline backend."""
)
As this is not obvious to everyone, you can set it in config through c.InlineBackend.rc.
[Edit] precise info about configurability.
IPython have the particularity that most of the classes have properties which default values can be configured. Those are often refered as Configurable (uppercase C), those property can easily be recognize in the code as they are declared like so before __init__:
property = A_Type( <default_value>, config=True , help="a string")
You can overwrite those properties in IPython configuration files (which one depends on what you want to do) by doing
c.ClassName.propertie_name = value
Here as it is a dict you could do
#put your favorite matplotlib config here.
c.InlineBackend.rc = {'figure.facecolor': 'black'}
I guess an empty dict would allow inline backend to use matplotlib defaults.
Extending Matt's answer (lots of credit to him, but I think the answers can be less complex), this is how I eventually solved it.
(a) I looked up ipython's default matplotlib settings in C:\Python27\Lib\site-packages\IPython\zmq\pylab\backend_inline.py (see Matt's answer).
(b) and overwrote them with the values as set in the terminal version (I used print mpl.rcParams['figure.figsize'] etc. to find out) by inserting the following code in my script:
import matplotlib as mpl
#To make sure we have always the same matplotlib settings
#(the ones in comments are the ipython notebook settings)
mpl.rcParams['figure.figsize']=(8.0,6.0) #(6.0,4.0)
mpl.rcParams['font.size']=12 #10
mpl.rcParams['savefig.dpi']=100 #72
mpl.rcParams['figure.subplot.bottom']=.1 #.125
The font size issues are due to differences in the dpi. I'd guess the slightly different size of the figure (in pixels) changes the "best" location for the legend, as well.
The default dpi a figure is displayed at is 80, while savefig defaults to 100. This means that by default, matplotlib figures will look slightly different when saved compared to what's displayed on the screen.
I don't know for sure, but I'm guessing that ipython notebooks set the dpi to something other than 100 (most likely 80) and use that when saving figures.
Try doing savefig('filename.pdf', dpi=80) in your standalone script.
I'm writing a thousand plots to a PDF using matplotlib. I've already optimized the plotting code, ie. reusing figures/axes/lines and just changing the y data.
The bulk of the remaining time is spent in save_figure.
R, in comparison, seems to output a plot to PDF about 2x faster. Plots will all zero data seem to be even faster in R, while they're the same speed in Python.
I've set pdf.compression = 0, which makes a small improvement.
Tried rasterizing the data, it made no difference to plotting speed (although it used a ton of RAM).
Is there anything else I can try to speed up the matplotlib with PDF backend, or are there any alternative backends I should consider? I'm trying to beat R.
Thanks!
Have to tried pyreport from Gael Varoquaux? You call it on your script, it then collects all calls to pylab.show(), makes a png of each and then creates a PDF from it.
It uses Latex in the end, so you'll need this. But I expect this might be faster, as PDF creation is delegated to Latex.