I want to understand the clear difference between Datashader and other graphing libraries eg plotly/matplotlib etc.
I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.
But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????
Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??
Plz explain i am unable to understand the in and out clearly.
It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to numpy.histogram2d. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.
By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.
The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.
But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)
From the datashader docs,
datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core or distributed processing for even larger datasets.
There aren't any tricks going on in any of these libraries - rendering a huge number of points takes a long time. What datashader does is to shift the burden of visualization from rendering to computing. There's a very good reason you have to create a canvas before plotting instructions in datashader. The first step in a datashader pipeline is to rasterize a dataset, in other words, it approximates the position of each piece of data and then uses aggregation functions to determine the intensity or color of each pixel. This allows datashader to plot enormous numbers of points; even more points than can be held in memory.
Matplotlib, on the other hand, renders every single point you instruct it to plot, making plotting large datasets time consuming or even impossible.
Related
Suppose I have a dataset with 100k rows (1000 different times, 100 different series, an observation for each, and auxilliary information). I'd like to create something like the following:
(1) first panel of plot has time on x axis, and average of the different series (and standard error) on y axis.
(2) based off the time slice (vertical line) we hover over in panel 1, display a (potentially down sampled) scatter plot of auxilliary information versus the series value at that time slice.
I've looked into a few options for this: (1) matplotlib + ipywidgets doesn't seem to handle it unless you explicitly select points via a slider. This also doesn't translate well to html exporting. This is not ideal, but is potentially workable. (2) altair - this library is pretty sleek, but from my understanding, I need to give it the whole dataset for it to handle the interactions, but it also can't handle more than 5kish data points. This would preclude my use case, correct?
Any suggestions as to how to proceed? Is what I'm asking impossible in the current state of things?
You can work with datasets larger than 5k rows in Altair, as specified in this section of the docs.
One of the most convenient solutions in my opinion is to install altair_data_server and then add alt.data_transformers.enable('data_server') on the top of your notebooks and scripts. This server will provide the data to Altair as long as your Python process is running so there is no need to include all the data as part of the created chart specification, which means that the 5k error will be avoided. The main drawback is that it wont work if you export to a standalone HTML because you rely on being in an environment where the server Python process is running.
TL;DR: I want to do something like
cache.append(fig.save_lines)
....
cache.load_into(fig)
I'm writing a (QML) front-end for a pyplot-like and matplotlib based MCMC sample visualisation library, and hit a small roadblock. I want to be able to produce and cache figures in the background, so that when the user moves some sliders, the plots aren't re-generated (they are complex and expensive to re-compute) but just brought in from the cache.
In order to do that I need to be able to do the plotting (but not the rendering) offline and then simply change the contents of a canvas. Effectively I want to do something like cache the
line = plt.plot(x,y)
object, but for multiple subplots.
The library produces very complex plots so I can't keep track of the line2D objects and use those.
My attempt at a solution: render to a pixmap with the correct DPI and use that. Issues arise if I resize the canvas, and not want to re-scale the Pixmaps. I've had situations where the wonderful SO community came up with much better solutions than what I had in mind, so if anyone has experience and/or ideas for how to get this behaviour, I'd be very much obliged!
I usually use adjustText to produce labels (annotations) that do not overlap in pyplot.
https://github.com/Phlya/adjustText
But at 10,000 data points for a scatter plot, it because very slow.
I was wondering if there's anything faster, that perhaps takes advantage of multi-core and/or GPU resources?
Just created a package for this purpose that is fast due to a more brute force implementation with broadcasting in numpy, textalloc.
It generates a plot like below (scattertext design) with ~2000 points in 2.9s on a laptop. Here I use the base setting not to plot texts that end up overlapping after running the algorithm, but if you want to plot 10000 text labels that do not overlap I assume you could use even smaller text size and the draw_all setting.
I have been reading a lot about Bokeh for visualisations of large datasets. I plan on plotting a heatmap with over 25 million points.
I saw read the page on speeding up WebGL and they mention that any plots with glyphs are accelerated.
Does the Heatmap plot use glyphs? Will there be any benefits in turning on WebGL for heatmap plots?
Pretty much everything that Bokeh draws is a glyph of some type. However, the text on that page you link actually states that "allows rendering some glyph types on graphics hardware." Currently (as of Bokeh 0.12.3) WebGL support only extends to scatter-type markers (e.g. circle, x, etc) and to lines. But HeatMap is implemented using the Rect glyph, so I would not expect WebGL to offer any improvement at the present time.
But I would add: It's good to thoroughly investigate any actual performance hotspots. Bokeh is really two libraries: a Python library and a JavaScript library. If you are seeing performance issues, are you sure it's on the JS side? For example, you have not said what your data sizes are. Are you sure it's not actually the binning/aggregation (that happens on the Python side) that is your issue?
Finally, if you have data sizes that are in the millions-to-billions of points range, you should probably be looking at the separate bokeh/datashader project.
I'm writing a thousand plots to a PDF using matplotlib. I've already optimized the plotting code, ie. reusing figures/axes/lines and just changing the y data.
The bulk of the remaining time is spent in save_figure.
R, in comparison, seems to output a plot to PDF about 2x faster. Plots will all zero data seem to be even faster in R, while they're the same speed in Python.
I've set pdf.compression = 0, which makes a small improvement.
Tried rasterizing the data, it made no difference to plotting speed (although it used a ton of RAM).
Is there anything else I can try to speed up the matplotlib with PDF backend, or are there any alternative backends I should consider? I'm trying to beat R.
Thanks!
Have to tried pyreport from Gael Varoquaux? You call it on your script, it then collects all calls to pylab.show(), makes a png of each and then creates a PDF from it.
It uses Latex in the end, so you'll need this. But I expect this might be faster, as PDF creation is delegated to Latex.