Fastest way to extract all bands from raster at once (python/gdal) - python

I have some hyperspectral imagery with a large number of bands, which I want to do analysis on. My script needs to be able to access all the bands at once.
Currently, I'm achieving this with the following:
bands = np.asarray([dataset.GetRasterBand(n+1) for n in range(dataset.RasterCount)])
This works fine, but it seems that this step is taking up a significant amount of time in my processing workflow, and I suspect there is a better way to do it. Also, I am under the impression that it is poor practice to use list comprehensions with numpy in this way (?).
Do numpy or gdal have any built-in methods that can make this faster?

In GDAL there is a distinction between the bands, and the data in the band. Assuming you want the latter, just use:
data = dataset.ReadAsArray()

Related

python xarray write to netcdf file very slow

for fname in ids['fnames']:
aq = xr.open_dataset(fname, chunks='auto', mask_and_scale=False)
aq = aq[var_lists]
aq = aq.isel(lat=slice(yoff, yoff+ysize), lon=slice(xoff, xoff+xsize))
list_of_ds.append(aq)
aq.close()
all_ds = xr.concat(list_of_ds, dim='time')
all_ds.to_netcdf('tmp.nc')
Hi all, I am making use of xarray to read netcdf files (around 1000) and save selected resutls to a temporary file, as shown above. However, the saving part runs very slow. How can I speed this up?
I also tried directly load the data, but still very slow.
I've also tried using open_mfdataset with parallel=True, and it's also slow:
aq = xr.open_mfdataset(
sorted(ids_list),
data_vars=var_lists,
preprocess=add_time_dim,
combine='by_coords',
mask_and_scale=False,
decode_cf=False,
parallel=True,
)
aq.isel({'lon':irlon,'lat':irlat}).to_netcdf('tmp.nc')
Unfortunately, concatenating ~1000 files in xarray will be slow. Not a great way around that.
It's hard for us to offer specific advice without more detail about your data and setup. But here are some things I'd try:
use xr.open_mfdataset. Your second code block looks great. dask will generally be faster and more efficient at managing tasks than you will with a for loop.
Make sure your chunks are aligned with how you're slicing the data. You don't want to read in more than you have to. If you're reading netCDFs, you have flexibility about how to read in the data into dask. Since you're selecting (it looks like) a small spatial region within each array, it may make sense to explicitly chunk the data such that you're only reading in a small portion of each array, e.g. with chunks={"lat": 50, "lon": 50}. You'll want to balance a few things here - making sure the chunk sizes are manageable and not too small (leading to too many tasks). Shoot for chunks ~100-500 MB range as a general rule, and trying to keep the number of tasks to less than 1 million (or # chunks to fewer than ~10-100k across all your datasets).
Be explicit about your concatenation. The more "magic" the process feels, the more work xarray is doing to infer what you mean. Generally, combine='nested' performs better than 'by_coords', so if you're concatenating files which are structured logically along one or more dimensions, it may help to arrange the files in the same way a dim is provided.
skip the pre-processing. If you can, add new dimensions on concatenation rather than as an ingestion step. This allows dask to more fully plan the computation, rather than treating your preprocess function as a black box, and what's worse as a pre-requisite to scheduling the final array construction operation because you're using combine='by_coords', where the coords are the result of an earlier dask operation. If you need to attach a time dim to each file, with 1 element per file, something like xr.open_mfdataset(files, concat_dim=pd.Index(pd.date_range("2020-01-01", freq="D", periods=1000), name="time"), combine="nested") works well in my experience.
If this is all taking too long, you could try pre-processing the data. Using a compiled utility like nco or even just subsetting the data and grouping smaller subsets of the data into larger files using dask.distributed's client.map might help cut down on the complexity of the final dataset join.

How do I store a multidimensional array?

I am trying to write a code in python that will display the trajectory of projectile on a 2D graph. The initial velocity and launch angle will be varying. Instead of calculating it every time, I was wondering if there is any way to create a data file which will store all the values of the coordinates for each of those different combinations of speed and launch angle. That is a 4 dimensional database. Is this even possible?
This sounds like a pretty ideal case for using CSV as your file format. It's not a "4 dimension" so much as a "4 column" database.
initial_velocity, launch_angle, end_x, end_y
which you can write out and read in easily - using either the standard library's csv module, or pandas' read_csv()
i think that you should look at the HDF5 format, which has been specialized to work with big data in NASA and bulletproof in very large scale applications :
From the webside
HDF5 lets you store huge amounts of numerical data, and easily
manipulate that data from NumPy. For example, you can slice into
multi-terabyte datasets stored on disk, as if they were real NumPy
arrays. Thousands of datasets can be stored in a single file,
categorized and tagged however you want.
In addition from me is the point that NumPy has been developed to work with multidimensional array very efficiently. Good luck !

Iterate and compute over multiple dask arrays

I have multiple dask arrays and would like to save them to a GIF or some movie format using imageio one frame at a time, but I think the problem is generic enough that the solution could help other people. I'm wondering if there is a way to compute the arrays in order and while computing one array and writing it to disk, start computing the next one on the remaining workers. If possible, it would be nice if the scheduler/graph could share tasks between the dask arrays if any.
The code would look something like this in my eyes:
import dask.array as da
writer = Writer(...)
for dask_arr in da.compute([dask_arr1, dask_arr2, dask_arr3]):
writer.write_frame(dask_arr)
It looks like this is probably hackable by users with the distributed scheduler, but I'd like to use the threaded scheduler if possible. I'm also not sure if this is super useful in my exact real world case given memory usage or possibly having to write entire frames at a time instead of chunks. I also don't doubt that this could be handled in a custom array-like object with da.store...some how.
If you're able to write a function that takes in a slice of the array and then writes it appropriately you might be able to use a function like da.map_blocks.
This would become much more complex if you're trying to write into a single file where random access is harder to guarantee.
Perhaps you could use map_blocks to save each slice as a single image and then use some post-processing tool to stitch those images together.

How to save large Python numpy datasets?

I'm attempting to create an autonomous RC car and my Python program is supposed to query the live stream on a given interval and add it to a training dataset. The data I want to collect is the array of the current image from OpenCV and the current speed and angle of the car. I would then like it to be loaded into Keras for processing.
I found out that numpy.save() just saves one array to a file. What is the best/most efficient way of saving data for my needs?
As with anything regarding performance or efficiency, test it yourself. The problem with recommendations for the "best" of anything is that they might change from year to year.
First, you should determine if this is even an issue you should be tackling. If you're not experiencing performance issues or storage issues, then don't bother optimizing until it becomes a problem. What ever you do, don't waste your time on premature optimizations.
Next, assuming it actually is an issue, try out every method for saving to see which one yields the smallest results in the shortest amount of time. Maybe compression is the answer, but that might slow things down? Maybe pickling objects would be faster? Who knows until you've tried.
Finally, weigh the trade-offs and decide which method you can compromise on; You'll almost never have one silver bullet solution. While your at it, determine if just adding more CPU, RAM or disk space at the problem would solve it. Cloud computing affords you a lot of headroom in those areas.
The most simple way is np.savez_compressed(). This saves any number of arrays using the same format as np.save() but encapsulated in a standard Zip file.
If you need to be able to add more arrays to an existing file, you can do that easily, because after all the NumPy ".npz" format is just a Zip file. So open or create a Zip file using zipfile, and then write arrays into it using np.save(). The APIs aren't perfectly matched for this, so you can first construct a StringIO "file", write into it with np.save(), then use writestr() in zipfile.

Coordinate container types in Python Aggdraw for fastest possible rendering?

Original Question:
I have a question about the Python Aggdraw module that I cannot find in the Aggdraw documentation. I'm using the ".polygon" command which renders a polygon on an image object and takes input coordinates as its argument.
My question is if anyone knows or has experience with what types of sequence containers the xy coordinates can be in (list, tuple, generator, itertools-generator, array, numpy-array, deque, etc), and most importantly which input type will help Aggdraw render the image in the fastest possible way?
The docs only mention that the polygon method takes: "A Python sequence (x, y, x, y, …)"
I'm thinking that Aggdraw is optimized for some sequence types more than others, and/or that some sequence types have to be converted first, and thus some types will be faster than others. So maybe someone knows these details about Aggdraw's inner workings, either in theory or from experience?
I have done some preliminary testing, and will do more soon, but I still want to know the theory behind why one option might be faster, because it might be that I not doing the tests properly or that there are some additional ways to optimize Aggdraw rendering that I didn't know about.
(Btw, this may seem like trivial optimization but not when the goal is to be able to render tens of thousands of polygons quickly and to be able to zoom in and out of them. So for this question I dont want suggestions for other rendering modules (from my testing Aggdraw appears to be one of the fastest anyway). I also know that there are other optmization bottlenecks like coordinate-to-pixel transformations etc, but for now Im only focusing on the final step of Aggdraw's internal rendering speed.)
Thanks a bunch, curious to see what knowledge and experience others out there have with Aggdraw.
A Winner? Some Preliminary Tests
I have now conducted some preliminary tests and reported the results in an Answer further down the page if you want the details. The main finding is that rounding float coordinates to pixel coordinates as integers and having them in arrays are the fastest way to make Aggdraw render an image or map, and lead to incredibly fast rendering speedups on the scale of 650% at speeds that can be compared with well-known and commonly used GIS software. What remains is to find fast ways to optimize coordinate transformations and shapefile loading, and these are daunting tasks indeed. For all the findings check out my Answer post further down the page.
I'm still interested to hear if you have done any tests of your own, or if you have other useful answers or comments. I'm still curious about the answers to the Bonus question if anyone knows.
Bonus question:
If you don't know the specific answer to this question it might still help if you know which programming language the actual Aggdraw rendering is done in? Ive read that the Aggdraw module is just a Python binding for the original C++ Anti-Grain Geometry library, but not entirely sure what that actually means. Does it mean that the Aggdraw Python commands are simply a way of accessing and activating the c++ library "behind the scenes" so that the actual rendering is done in C++ and at C++ speeds? If so then I would guess that C++ would have to convert the Python sequence to a C++ sequence, and the optimization would be to find out which Python sequence can be converted the fastest to a C++ sequence. Or is the Aggdraw module simply the original library rewritten in pure Python (and thus much slower than the C++ version)? If so which Python types does it support and which is faster for the type of rendering work it has to do. enter code here
A Winner? Some Preliminary Tests
Here are the results from my initial testings of which input types are faster for aggdraw rendering. One clue was to be found in the aggdraw docs where it said that aggdraw.polygon() only takes "sequences": officially defined as "str, unicode, list, tuple, bytearray, buffer, xrange" (http://docs.python.org/2/library/stdtypes.html). Luckily however I found that there are also additional input types that aggdraw rendering accepts. After some testing I came up with a list of the input container types that I could find that aggdraw (and maybe also PIL) rendering supports:
tuples
lists
arrays
Numpy arrays
deques
Unfortunately, aggdraw does not support and results in errors when supplying coordinates contained in:
generators
itertool generators
sets
dictionaries
And then for the performance testing! The test polygons were a subset of 20 000 (multi)polygons from the Global Administrative Units Database of worldwide sub-national province boundaries, loaded into memory using the PyShp shapefile reader module (http://code.google.com/p/pyshp/). To ensure that the tests only measured aggdraw's internal rendering speed I made sure to start the timer only after the polygon coordinates were already transformed to aggdraw image pixel coordinates, AND after I had created a list of input arguments with the correct input type and aggdraw.Pen and .Brush objects. I then timed and ran the rendering using itertools.starmap with the preloaded coordinates and arguments:
t=time.time()
iterat = itertools.starmap(draw.polygon, args) #draw is the aggdraw.Draw() object
for runfunc in iterat: #iterating through the itertools generator consumes and runs it
pass
print time.time()-t
My findings confirm the traditional notion that tuples and arrays are the fastest Python iterators, which both ended up being the fastest. Lists were about 50% slower, and so too were numpy arrays (this was initially surprising given the speed-reputation of Numpy arrays, but then I read that Numpy arrays are only fast when one uses the internal Numpy functions on them, and that for normal Python iteration they are generally slower than other types). Deques, usually considered to be fast, turned out to be the slowest (almost 100%, ie 2x slower).
### Coordinates as FLOATS
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
tuples
8.90130587328
arrays
9.03419164657
lists
13.424952522
numpy
13.1880489246
deque
16.8887938784
In other words, if you usually use lists for aggdraw coordinates you should know that you can gain a 50% performance improvement by instead putting them into a tuple or array. Not the most radical improvement but still useful and easy to implement.
But wait! I did find another way to squeeze out more performance power from the aggdraw module--quite a lot actually. I forget why I did it but when I tried rounding the transformed floating point coordinates to the nearest pixel integer as integer type (ie "int(round(eachcoordinate))") before rendering them I got a 6.5x rendering speedup (650%) compared to the most common list container--a well-worth and also easy optimization. Surprisingly, the array container type turns out to be about 25% faster than tuples when the renderer doesnt have to worry about rounding numbers. This prerounding leads to no loss of visual details that I could see, because these floating points can only be assigned to one pixel anyway, and might be the reason why preconverting/prerounding the coordinates before sending them off to the aggdraw renderer speeds up the process bc then aggdraw doesnt have to. A potential caveat is that it could be that taking away the decimal information changes how aggdraw does its anti-aliasing but in my opinion the final map still looks equally anti-aliased and smooth. Finally, this rounding optimization must be weighed against the time it would take to round the numbers in Python, but from what I can see the time it takes to do prerounding does not outweigh the benefits of the rendering speedup. Further optimization should be explored for how to round and convert the coordinates in a fast way.
### Coordinates as INTEGERS (rounded to pixels)
### Pure rendering time (seconds) for 20 000 polygons from the GADM dataset
arrays
1.40970077294
tuples
2.19892537074
lists
6.70839555276
numpy
6.47806400659
deque
7.57472232757
In conclusion then: arrays and tuples are the fastest container types to use when providing aggdraw (and possibly also PIL?) with drawing coordinates.
Given the hefty rendering speeds that can be obtained when using the correct input type with aggdraw, it becomes particularly crucial and rewarding to find even the slightest optimizations for other aspects of the map rendering process, such as coordinate transformation routines (I am already exploring and finding for instance that Numpy is particularly fast for such purposes).
An more general finding from all of this is that Python can potentially be used for very fast map rendering applications and thus further opens the possibilities for Python geospatial scripting; e.g. the entire GADM dataset of 200 000+ provinces can theoretically be rendered in about 1.5*10=15 seconds without thinking about coordinate to image coordinate transformation, which is way faster than QGIS and even ArcGIS which in my experience struggles with displaying the GADM dataset.
All results were obtained on a 8-core processor, 2-year old Windows 7 machine, using Python 2.6.5. Whether these results are also the most efficient when it comes to loading and/or processing the data is a question that has to be tested and answered in another post. It would be interesting to hear if someone else already have any good insights on these aspects.

Categories