Matplotlib performance problem: savefig too slow, any alternative? - python

I am generating a matplotlib figure which consists of 2 subplots: A 3D plot with updated data each time (Poly3dcollection) and an image file that I plug in with plt.imshow().
I am able to save the this figure by plt.savefig, however, savefig is very slow for my application since I need to save ~4 million of these figures. I already tried saving in different file formats with various parameters, saving it into memory and than read it by PIL to save by PIL.save etc. None of the solutions that I could find made the saving of the figure in less than a second.
I`d very much appreciate if someone has a suggestion for this but I need a major increase in performance, minor changes will not matter a lot.

Related

How to buffer pyplot plots

TL;DR: I want to do something like
cache.append(fig.save_lines)
....
cache.load_into(fig)
I'm writing a (QML) front-end for a pyplot-like and matplotlib based MCMC sample visualisation library, and hit a small roadblock. I want to be able to produce and cache figures in the background, so that when the user moves some sliders, the plots aren't re-generated (they are complex and expensive to re-compute) but just brought in from the cache.
In order to do that I need to be able to do the plotting (but not the rendering) offline and then simply change the contents of a canvas. Effectively I want to do something like cache the
line = plt.plot(x,y)
object, but for multiple subplots.
The library produces very complex plots so I can't keep track of the line2D objects and use those.
My attempt at a solution: render to a pixmap with the correct DPI and use that. Issues arise if I resize the canvas, and not want to re-scale the Pixmaps. I've had situations where the wonderful SO community came up with much better solutions than what I had in mind, so if anyone has experience and/or ideas for how to get this behaviour, I'd be very much obliged!

difference between datashader and other plotting libraries

I want to understand the clear difference between Datashader and other graphing libraries eg plotly/matplotlib etc.
I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.
But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????
Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??
Plz explain i am unable to understand the in and out clearly.
It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to numpy.histogram2d. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.
By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.
The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.
But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)
From the datashader docs,
datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core or distributed processing for even larger datasets.
There aren't any tricks going on in any of these libraries - rendering a huge number of points takes a long time. What datashader does is to shift the burden of visualization from rendering to computing. There's a very good reason you have to create a canvas before plotting instructions in datashader. The first step in a datashader pipeline is to rasterize a dataset, in other words, it approximates the position of each piece of data and then uses aggregation functions to determine the intensity or color of each pixel. This allows datashader to plot enormous numbers of points; even more points than can be held in memory.
Matplotlib, on the other hand, renders every single point you instruct it to plot, making plotting large datasets time consuming or even impossible.

Size of figure too big - How to compress without qualityloss in matplotlib

I have a figure with subplots, whereas each subplot features as contourplot.
As it turns out, if I save it to .pdf, the size of the figure is 50mb. Is there
a way to compress the figure without having such a big quality loss as seen in .png?
thanks in advance

matplotlib shows different figure than saves from the show() window

I plot rather complex data with matplotlib's imshow(), so I prefer to first visually inspect if it is all right, before saving. So I usually call plt.show(), see if it is fine, and then manually save it with a GUI dialog in the show() window. And everything was always fine, but recently I started getting a weird thing. When I save the figure I get a very wrong picture, though it looks perfectly fine in the matplotlib's interactive window.
If I zoom to a specific location and then save what I see, I get a fine figure.
So, this is the correct one (a small area of the picture, saved with zooming first):
And this one is a zoom into approximately the same area of the figure, after I saved it all:
For some reason pixels in the second one are much bigger! That is vary bad for me - as you can see, it looses a lot of details in there.
Unfortunately, my code is quite complicated and I wasn't able to reproduce it with some randomly generated data. This problem appeared after I started to plot two triangles of the picture separately: I read my two huge data files with np.loadtxt(), get np.triu(data1) and np.tril(data2), mask zeroes, NAs, -inf and +inf and then plot them on the same axes with plt.imshow(data, interpolation='none', origin='lower', extent=extent). I do lot's of other different things to make it nicer, but I guess it doesn't matter, because it all worked like a charm before.
Please, let me know, if you need to know anything else specific from my code, that could be relevant to this problem.
When you save a figure in png/jpg you are forced to rasterize it, convert it to a finite number of pixels. If you want to keep the full resolution, you have a few options:
Use a very high dpi parameter, like 900. Saving the plot will be slow, and many image viewers will take some time to open it, but the information is there and you can always crop it.
Save the image data, the exact numbers you used to make the plot. Whenever you need to inspect it, load it in Matplotlib in interactive mode, navigate to your desired corner, and save it.
Use SVG: it is a vector graphics format, so you are not limited to pixels.
Here is how to use SVG:
import matplotlib
matplotlib.use('SVG')
import matplotlib.pyplot as plt
# Generate the image
plt.imshow(image, interpolation='none')
plt.savefig('output_image')
Edit:
To save a true SVG you need to use the SVG backend from the beginning, which is unfortunately, incompatible with interactive mode. Some backends, like GTKCairo seem to allow both, but the result is still rasterized, not a true SVG.
This may be a bug in matplotlib, at least, to the best of my knowledge, it is not documented.

Matplotlib PDF backend slow?

I'm writing a thousand plots to a PDF using matplotlib. I've already optimized the plotting code, ie. reusing figures/axes/lines and just changing the y data.
The bulk of the remaining time is spent in save_figure.
R, in comparison, seems to output a plot to PDF about 2x faster. Plots will all zero data seem to be even faster in R, while they're the same speed in Python.
I've set pdf.compression = 0, which makes a small improvement.
Tried rasterizing the data, it made no difference to plotting speed (although it used a ton of RAM).
Is there anything else I can try to speed up the matplotlib with PDF backend, or are there any alternative backends I should consider? I'm trying to beat R.
Thanks!
Have to tried pyreport from Gael Varoquaux? You call it on your script, it then collects all calls to pylab.show(), makes a png of each and then creates a PDF from it.
It uses Latex in the end, so you'll need this. But I expect this might be faster, as PDF creation is delegated to Latex.

Categories