I'm writing a thousand plots to a PDF using matplotlib. I've already optimized the plotting code, ie. reusing figures/axes/lines and just changing the y data.
The bulk of the remaining time is spent in save_figure.
R, in comparison, seems to output a plot to PDF about 2x faster. Plots will all zero data seem to be even faster in R, while they're the same speed in Python.
I've set pdf.compression = 0, which makes a small improvement.
Tried rasterizing the data, it made no difference to plotting speed (although it used a ton of RAM).
Is there anything else I can try to speed up the matplotlib with PDF backend, or are there any alternative backends I should consider? I'm trying to beat R.
Thanks!
Have to tried pyreport from Gael Varoquaux? You call it on your script, it then collects all calls to pylab.show(), makes a png of each and then creates a PDF from it.
It uses Latex in the end, so you'll need this. But I expect this might be faster, as PDF creation is delegated to Latex.
Related
I am generating a matplotlib figure which consists of 2 subplots: A 3D plot with updated data each time (Poly3dcollection) and an image file that I plug in with plt.imshow().
I am able to save the this figure by plt.savefig, however, savefig is very slow for my application since I need to save ~4 million of these figures. I already tried saving in different file formats with various parameters, saving it into memory and than read it by PIL to save by PIL.save etc. None of the solutions that I could find made the saving of the figure in less than a second.
I`d very much appreciate if someone has a suggestion for this but I need a major increase in performance, minor changes will not matter a lot.
I find the amount of whitespace around plots in both normal Python Matplotlib and Matlab quite annoying, specifically the left and right margins that make your plot look tiny when inserting the saved (landscape) figure into a standard (portrait) .doc or .pdf file.
Fortunately Python Matplotlib has the "tight_layout()" functionality that takes care of this beautifully. Does Matlab have a similar easy, single-solution-fits-all way of doing it?
I know there are ways to reduce the margins for the plots in Matlab in various ways (such as this for subplots, or this and this for pdf output), but I can't seem to find a single all-compassing "minimize the amount of whitespace" functionality as Python's tight_layout().
You can achieve that with tiledlayout, introduced in Matlab R2019b. To reduce whitespace you can use the 'TileSpacing' and 'Padding' parameters, with values either 'compact' or 'none':
h = tiledlayout(2,2, 'TileSpacing', 'none', 'Padding', 'none');
nexttile
plot(1:4, rand(1,4))
nexttile
plot(1:8, rand(1,8))
nexttile
plot(1:16, rand(1,16))
nexttile
plot(1:32, rand(1,32))
TL;DR: I want to do something like
cache.append(fig.save_lines)
....
cache.load_into(fig)
I'm writing a (QML) front-end for a pyplot-like and matplotlib based MCMC sample visualisation library, and hit a small roadblock. I want to be able to produce and cache figures in the background, so that when the user moves some sliders, the plots aren't re-generated (they are complex and expensive to re-compute) but just brought in from the cache.
In order to do that I need to be able to do the plotting (but not the rendering) offline and then simply change the contents of a canvas. Effectively I want to do something like cache the
line = plt.plot(x,y)
object, but for multiple subplots.
The library produces very complex plots so I can't keep track of the line2D objects and use those.
My attempt at a solution: render to a pixmap with the correct DPI and use that. Issues arise if I resize the canvas, and not want to re-scale the Pixmaps. I've had situations where the wonderful SO community came up with much better solutions than what I had in mind, so if anyone has experience and/or ideas for how to get this behaviour, I'd be very much obliged!
I want to understand the clear difference between Datashader and other graphing libraries eg plotly/matplotlib etc.
I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.
But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????
Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??
Plz explain i am unable to understand the in and out clearly.
It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to numpy.histogram2d. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.
By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.
The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.
But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)
From the datashader docs,
datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core or distributed processing for even larger datasets.
There aren't any tricks going on in any of these libraries - rendering a huge number of points takes a long time. What datashader does is to shift the burden of visualization from rendering to computing. There's a very good reason you have to create a canvas before plotting instructions in datashader. The first step in a datashader pipeline is to rasterize a dataset, in other words, it approximates the position of each piece of data and then uses aggregation functions to determine the intensity or color of each pixel. This allows datashader to plot enormous numbers of points; even more points than can be held in memory.
Matplotlib, on the other hand, renders every single point you instruct it to plot, making plotting large datasets time consuming or even impossible.
I am writing a code to fit a gaussian over a function and if I don't plot the result (it is a datacube of ~60x60 spectra, so I am using a loop) the code works really fast.
But when I say the code to plot every graph it gets really slow, something like 2 graphs a second (when I don't plot it does like 40).
Ok, I understand it can be right to slow a lot down, but there is a code in IDL that does the exact same thing and the code runs 8~10 plots per second.
Is there a way to improve it? Or python is really slower than IDL?
Here is the plot code:
plt.plot(wavelengthset, data_datacube[minpixel:maxpixel+1, j, i], 'k-',
wavelengthset, gaussian(fit[0], wavelengthset), 'r-')
plt.draw()
plt.clf()
I recommend looking into removing plt.draw() and using blit. If that's insufficient, please let me know a little more about your data and the purpose of the plots.
See this answer for more info: why is plotting with Matplotlib so slow?
As the answer at the above link mentions, matplotlib is designed for quality, customizable, interactive plots. Matplotlib may be slower than the data processing tools you're familiar with in IDL, but that's not to say that another, speed-conscious Python toolkit won't be just as fast/helpful.
Good luck!