I am working on software that processes time series. Sometimes these are very long (>10 million data points). Our software is very usable for shorter time series but gets unusably bogged down for these long ones. When looking at the RAM usage, it's almost 10x what all the time series data together occupy.
When doing some tests, it's clear that a lot of memory is used by matplotlib, which we are using to plot the time series. Using a separate piece of code that includes ONLY loading of the time series from a file and plotting, I can see that when going from loading only (with the plotting command commented out) to plotting, the memory usage goes up almost 3-fold. This is true whether or not the whole time range is visible within the given axis limits, although passing only a small slice of the series (numpy array) to matplotlib DOES proportionally reduce the excess memory.
Given that we expect users to scroll through the time series and only view short chunks at a time, it would be much better to have matplotlib only fetch the visible portion of the numpy array, grabbing new elements as the user scrolls or zooms. In fact, it would likely be preferable to replace the X and Y arrays with generators that re-compute the values on the fly as the plot needs them, possibly caching points just outside the limits to make scrolling faster. The X values in particular are simple linspaces that would likely be best not stored at all, given that computing them should be as fast as a lookup into a huge array, never mind storing them once in the outer software AND also in matplotlib.
I know we could try to "fake" this by capturing user events sent to the plot and re-sending new X and Y arrays all the time, but this feels clunky, prone to all sorts of corner cases where things get out of sync, and like trying to take over from the plotting library things it "wants" to do itself. At some point it would become easier just to write our own simple plotting routine in C/C++ that does the computations and draws lines using a graphics API. In fact, the nearest closed-source competitor to our software seems to be doing just that, given that it's super snappy and uses an amount of RAM that is a mere fraction of the size of a time series. But, we want our software to be extensible by users without a deep understanding of the internals of our code.
Is there a standard way of handling this, or is this just too far from the "spirit" of matplotlib to be worth using it? And in that case, is there an alternative Python plotting library with exactly this use case in mind? I would imagine that data scientists working with terabytes of data would want a way to graphically explore it without the plotting code eating terabytes of storage itself...
Related
I have been reading the doc and searching online a bit, but I am still confused about the difference in between persist and scatter.
I have been working with data sets about half a TB large, and have been using scatter to generate futures and then send them to workers. This has been working fine. But recently I started scaling up, and now dealing with datasets a few TB large, and this method stops working. On the dashboard, I see workers not triggered and I am quite certain that this is a scheduler issue.
I saw this video by Matt Rocklin. When he deals with a large dataset, I saw first thing he does is to persist it to the memory (distributed memory). I will give this a try with my large datasets, but meanwhile I am wondering what is the difference between persist and scatter? What specific situations are they best suited? Do I still need to scatter after I persist?
Thanks.
First with persist, imagine you have table A, which is used to make table B, and then you use B to generate two tables C and D. You have two chains of lineage with A->B->C and A->B->D. The A->B sequence can be computed twice, once to generate C and another one for D. This is because of the lazy evaluation nature of Dask.
Scatter is also called broadcast in other distributed frameworks. Basically, you have a sizeable object that you want to send to the workers ahead of time to minimize the transfer. Think like a machine learning model. You can scatter it ahead of time so it's available on all workers.
I have a Python program that gets data from a measurement instrument and plots the data using matplotlib (I am on Debian Linux). The plotting is done in a separate thread, which updates the data plots at fixed time intervals. At every update, the existing lines are removed from the plot, and then the lines are re-created with the new data (yes, there might be more efficient ways, but it's not possible to just add the new data to existing lines in my situation).
After a while, the program will take up huge amounts of memory (gigabytes). This does not happen if I modify the code to skip the plotting/matplotlib part, so the use of humungeous amounts of memory is clearly related to matplotlib. If I put some pressure on the system by running another application that will consume a lot of memory, my Python program will at some point start to release the excessive memory used up by matplotlib (ending up at about 50 MB or so), Releasing the memory does not seem to have any negative effects on the operation of my program. This tells me that the large junk of memory used by matplotlib is not vital (if not useless) in my application.
How can I avoid matplotlib from picking up so much memory?
not sure if this will help, but have you tried using line-collections?
it should be more efficient for plotting a huge amount of lines in one go
... aside of that some lines of code that show what you're doing might help to identify the problem
I have about a million rows of data with lat and lon attached, and more to come. Even now reading the data from SQLite file (I read it with pandas, then create a point for each row) takes a lot of time.
Now, I need to make a spatial joint over those points to get a zip code to each one, and I really want to optimise this process.
So I wonder: if there is any relatively easy way to parallelize those computations?
I am assuming you have already implemented GeoPandas and are still finding difficulties?
you can improve this by further hashing your coords data. similar to how google hashes their search data. Some databases already provide support for these types of operations (eg mongodb). Imagine if you took the first (left) digit of your coords, and put each set of cooresponding data into a seperate sqlite file. each digit can be a hash pointing to the correct file to look for. now your lookup time has improved by a factor of 20 (range(-9,10)), assuming your hash lookup takes minimal time in comparison
As it turned out, the most convenient solution in my case is to use pandas.read_SQL function with specific chunksize parameter. In this case, it returns a generator of data chunks, which can be effectively feed to the mp.Pool().map() along with the job;
In this (my) case job consists of 1) reading geoboundaries, 2) spatial joint of the chunk 3) writing the chunk to the database.
This method is completely dependent on your spatial scale, but one way you might parallelize your join would be to subdivide your polygons into subpolygons and then offload the work to separate threads in separate cores. This geopandas r-tree tutorial demonstrates that technique, subdividing a large polygon into many small ones and intersecting each with a large set of points. But again, this only works if your spatial scale is appropriate: ie, a few polygons and a lot of points (such as a few zip code polygons and millions of points in and around them).
I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here
Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.
You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.
try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.
Original Question:
I have a question about the Python Aggdraw module that I cannot find in the Aggdraw documentation. I'm using the ".polygon" command which renders a polygon on an image object and takes input coordinates as its argument.
My question is if anyone knows or has experience with what types of sequence containers the xy coordinates can be in (list, tuple, generator, itertools-generator, array, numpy-array, deque, etc), and most importantly which input type will help Aggdraw render the image in the fastest possible way?
The docs only mention that the polygon method takes: "A Python sequence (x, y, x, y, …)"
I'm thinking that Aggdraw is optimized for some sequence types more than others, and/or that some sequence types have to be converted first, and thus some types will be faster than others. So maybe someone knows these details about Aggdraw's inner workings, either in theory or from experience?
I have done some preliminary testing, and will do more soon, but I still want to know the theory behind why one option might be faster, because it might be that I not doing the tests properly or that there are some additional ways to optimize Aggdraw rendering that I didn't know about.
(Btw, this may seem like trivial optimization but not when the goal is to be able to render tens of thousands of polygons quickly and to be able to zoom in and out of them. So for this question I dont want suggestions for other rendering modules (from my testing Aggdraw appears to be one of the fastest anyway). I also know that there are other optmization bottlenecks like coordinate-to-pixel transformations etc, but for now Im only focusing on the final step of Aggdraw's internal rendering speed.)
Thanks a bunch, curious to see what knowledge and experience others out there have with Aggdraw.
A Winner? Some Preliminary Tests
I have now conducted some preliminary tests and reported the results in an Answer further down the page if you want the details. The main finding is that rounding float coordinates to pixel coordinates as integers and having them in arrays are the fastest way to make Aggdraw render an image or map, and lead to incredibly fast rendering speedups on the scale of 650% at speeds that can be compared with well-known and commonly used GIS software. What remains is to find fast ways to optimize coordinate transformations and shapefile loading, and these are daunting tasks indeed. For all the findings check out my Answer post further down the page.
I'm still interested to hear if you have done any tests of your own, or if you have other useful answers or comments. I'm still curious about the answers to the Bonus question if anyone knows.
Bonus question:
If you don't know the specific answer to this question it might still help if you know which programming language the actual Aggdraw rendering is done in? Ive read that the Aggdraw module is just a Python binding for the original C++ Anti-Grain Geometry library, but not entirely sure what that actually means. Does it mean that the Aggdraw Python commands are simply a way of accessing and activating the c++ library "behind the scenes" so that the actual rendering is done in C++ and at C++ speeds? If so then I would guess that C++ would have to convert the Python sequence to a C++ sequence, and the optimization would be to find out which Python sequence can be converted the fastest to a C++ sequence. Or is the Aggdraw module simply the original library rewritten in pure Python (and thus much slower than the C++ version)? If so which Python types does it support and which is faster for the type of rendering work it has to do. enter code here
A Winner? Some Preliminary Tests
Here are the results from my initial testings of which input types are faster for aggdraw rendering. One clue was to be found in the aggdraw docs where it said that aggdraw.polygon() only takes "sequences": officially defined as "str, unicode, list, tuple, bytearray, buffer, xrange" (http://docs.python.org/2/library/stdtypes.html). Luckily however I found that there are also additional input types that aggdraw rendering accepts. After some testing I came up with a list of the input container types that I could find that aggdraw (and maybe also PIL) rendering supports:
tuples
lists
arrays
Numpy arrays
deques
Unfortunately, aggdraw does not support and results in errors when supplying coordinates contained in:
generators
itertool generators
sets
dictionaries
And then for the performance testing! The test polygons were a subset of 20 000 (multi)polygons from the Global Administrative Units Database of worldwide sub-national province boundaries, loaded into memory using the PyShp shapefile reader module (http://code.google.com/p/pyshp/). To ensure that the tests only measured aggdraw's internal rendering speed I made sure to start the timer only after the polygon coordinates were already transformed to aggdraw image pixel coordinates, AND after I had created a list of input arguments with the correct input type and aggdraw.Pen and .Brush objects. I then timed and ran the rendering using itertools.starmap with the preloaded coordinates and arguments:
t=time.time()
iterat = itertools.starmap(draw.polygon, args) #draw is the aggdraw.Draw() object
for runfunc in iterat: #iterating through the itertools generator consumes and runs it
pass
print time.time()-t
My findings confirm the traditional notion that tuples and arrays are the fastest Python iterators, which both ended up being the fastest. Lists were about 50% slower, and so too were numpy arrays (this was initially surprising given the speed-reputation of Numpy arrays, but then I read that Numpy arrays are only fast when one uses the internal Numpy functions on them, and that for normal Python iteration they are generally slower than other types). Deques, usually considered to be fast, turned out to be the slowest (almost 100%, ie 2x slower).
### Coordinates as FLOATS
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
tuples
8.90130587328
arrays
9.03419164657
lists
13.424952522
numpy
13.1880489246
deque
16.8887938784
In other words, if you usually use lists for aggdraw coordinates you should know that you can gain a 50% performance improvement by instead putting them into a tuple or array. Not the most radical improvement but still useful and easy to implement.
But wait! I did find another way to squeeze out more performance power from the aggdraw module--quite a lot actually. I forget why I did it but when I tried rounding the transformed floating point coordinates to the nearest pixel integer as integer type (ie "int(round(eachcoordinate))") before rendering them I got a 6.5x rendering speedup (650%) compared to the most common list container--a well-worth and also easy optimization. Surprisingly, the array container type turns out to be about 25% faster than tuples when the renderer doesnt have to worry about rounding numbers. This prerounding leads to no loss of visual details that I could see, because these floating points can only be assigned to one pixel anyway, and might be the reason why preconverting/prerounding the coordinates before sending them off to the aggdraw renderer speeds up the process bc then aggdraw doesnt have to. A potential caveat is that it could be that taking away the decimal information changes how aggdraw does its anti-aliasing but in my opinion the final map still looks equally anti-aliased and smooth. Finally, this rounding optimization must be weighed against the time it would take to round the numbers in Python, but from what I can see the time it takes to do prerounding does not outweigh the benefits of the rendering speedup. Further optimization should be explored for how to round and convert the coordinates in a fast way.
### Coordinates as INTEGERS (rounded to pixels)
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
arrays
1.40970077294
tuples
2.19892537074
lists
6.70839555276
numpy
6.47806400659
deque
7.57472232757
In conclusion then: arrays and tuples are the fastest container types to use when providing aggdraw (and possibly also PIL?) with drawing coordinates.
Given the hefty rendering speeds that can be obtained when using the correct input type with aggdraw, it becomes particularly crucial and rewarding to find even the slightest optimizations for other aspects of the map rendering process, such as coordinate transformation routines (I am already exploring and finding for instance that Numpy is particularly fast for such purposes).
An more general finding from all of this is that Python can potentially be used for very fast map rendering applications and thus further opens the possibilities for Python geospatial scripting; e.g. the entire GADM dataset of 200 000+ provinces can theoretically be rendered in about 1.5*10=15 seconds without thinking about coordinate to image coordinate transformation, which is way faster than QGIS and even ArcGIS which in my experience struggles with displaying the GADM dataset.
All results were obtained on a 8-core processor, 2-year old Windows 7 machine, using Python 2.6.5. Whether these results are also the most efficient when it comes to loading and/or processing the data is a question that has to be tested and answered in another post. It would be interesting to hear if someone else already have any good insights on these aspects.