LoadPage in Shady very slow on Linux Mint

LoadPage in Shady very slow on Linux Mint - python

I am trying to display a sequence of frames using Shady, but I'm running into difficulties. I'm looking at 25 frames, covering a 1080x1080 pixels area. The stimulus is grayscale, and I am doing luminance linearization off-line, so I only need to save a uint8 value for each pixel. The full sequence is thus about 29Mb. I define the stimulus as a 3-D numpy array [1080x1080x25], and I save it to disk using np.save(). I then load it using np.load().
try:
yy = np.load(fname)
except:
print fname + ' does not exist'
return
This step takes about 20ms. It is my understanding that Shady does not deal with uint8 luminance values, but rather with floats between 0 and 1. I thus convert it into a float array and divide by 255.
yy = yy.astype(np.float)/255.0
This second step takes approx 260ms, which is already not great (ideally I need to get the stimulus loaded and ready to be presented in 400ms).
I now create a list of 25 numpy arrays to use as my pages parameter in the Stimulus class:
pages = []
for j in range(yy.shape[2]):
pages.append(np.squeeze(yy[:, :, j]))
This is virtually instantaneous. But at my next step I run into serious timing problems.
if (self.sequence is None):
self.sequence = self.wind.Stimulus(pages, 'sequence', multipage=True, anchor=Shady.LOCATION.UPPER_LEFT, position=[deltax, deltay], visible=False)
else:
self.sequence.LoadPages(pages, visible=False)
Here I either create a Stimulus object, or update its pages attribute if this is not the first sequence I load. Either way, this step takes about 10s, which is about 100 times what I can tolerate in my application.
Is there a way to significantly speed this up? What am I doing wrong? I have a pretty mediocre graphics card on this machine (Radeon Pro WX 4100), and if that is the problem I could upgrade it, but I do not want to go through the hassle if that is not going to fix it.

Based on jez comments, his tests, and my tests, I guess that on some configurations (in my case a Linux Mint 19 with Cinnamon and a mediocre AMD video card) loading floats can be much slower than loading uint8. With uint8 the behavior appears to be consistent across configurations. So go with uint8 if you can. Since this will (I assume) disable much of what Shady can do in terms of gamma correction and dynamic range enhancement, this might be limiting for some.

Shady can accept uint8 pixel values as-is so you can cut out your code for scaling and type-conversion. Of course, you lose out on Shady's ability to do dynamic range enhancement that way, but it seems like you have your own offline solutions for that kind of thing. If you're going to use uint8 stimuli exclusively, you can save a bit of GPU processing effort by turning off dithering (set the .ditheringDenominator of both the World and the Stimulus to 0 or a negative value).
It seems like the ridiculous 10–to-15-second delays come from inside the compiled binary "accelerator" component, when transferring the raw texture data from RAM to the graphics card. The problem is apparently (a) specific to transferring floating-point texture data rather than integer data, and (b) specific to the graphics card you have (since you reported the problem went away on the same system when you swapped in an NVidia card). Possibly it's also OS- or driver-specific with regard to the old graphics card.
Note that you can also reduce your LoadPages() time from 300–400ms down to about 40ms by cutting down the amount of numpy operations Shady has to do. Save your arrays as [pages x rows x columns] instead of [rows x columns x pages]. Relative to your existing workflow, this means you do yy = yy.transpose([2, 0, 1]) before saving. Then, when you load, don't transpose back: just split on axis=0, and then squeeze the leftmost dimension out of each resulting page:
pages = [ page.squeeze(0) for page in numpy.split(yy, yy.shape[0], axis=0) ]
That way you'll end up with 25 views into the original array, each of which is a contiguous memory block. By contrast, if you do it the original [rows x columns x pages] way, then regardless of whether you do split-and-squeeze or your original slice-and-squeeze loop, you get 25 non-contiguous views into the original memory, and that fact will catch up with you sooner or later—if not when you or Shady convert between numeric formats, then at latest when Shady uses numpy's .tostring method to serialize the data for transfer.

Related

Configuring matplotlib to fetch data on demand?

I am working on software that processes time series. Sometimes these are very long (>10 million data points). Our software is very usable for shorter time series but gets unusably bogged down for these long ones. When looking at the RAM usage, it's almost 10x what all the time series data together occupy.
When doing some tests, it's clear that a lot of memory is used by matplotlib, which we are using to plot the time series. Using a separate piece of code that includes ONLY loading of the time series from a file and plotting, I can see that when going from loading only (with the plotting command commented out) to plotting, the memory usage goes up almost 3-fold. This is true whether or not the whole time range is visible within the given axis limits, although passing only a small slice of the series (numpy array) to matplotlib DOES proportionally reduce the excess memory.
Given that we expect users to scroll through the time series and only view short chunks at a time, it would be much better to have matplotlib only fetch the visible portion of the numpy array, grabbing new elements as the user scrolls or zooms. In fact, it would likely be preferable to replace the X and Y arrays with generators that re-compute the values on the fly as the plot needs them, possibly caching points just outside the limits to make scrolling faster. The X values in particular are simple linspaces that would likely be best not stored at all, given that computing them should be as fast as a lookup into a huge array, never mind storing them once in the outer software AND also in matplotlib.
I know we could try to "fake" this by capturing user events sent to the plot and re-sending new X and Y arrays all the time, but this feels clunky, prone to all sorts of corner cases where things get out of sync, and like trying to take over from the plotting library things it "wants" to do itself. At some point it would become easier just to write our own simple plotting routine in C/C++ that does the computations and draws lines using a graphics API. In fact, the nearest closed-source competitor to our software seems to be doing just that, given that it's super snappy and uses an amount of RAM that is a mere fraction of the size of a time series. But, we want our software to be extensible by users without a deep understanding of the internals of our code.
Is there a standard way of handling this, or is this just too far from the "spirit" of matplotlib to be worth using it? And in that case, is there an alternative Python plotting library with exactly this use case in mind? I would imagine that data scientists working with terabytes of data would want a way to graphically explore it without the plotting code eating terabytes of storage itself...

SatPy load all channels at once

I have a scene object, I would like to load all channels into a numpy array of shape (24,24,3). Where 3 is the number of channels.
scene_xybox = scn.crop(xy_bbox=box)
I have to select each channel:
channel= scene_xybox['VIS006'].values
repeat, and stack at the end.
Is there a way to get the stacked numpy array with one line.
This takes 5 seconds for each box, I have many files and it will take a very very long time to do the same operation to multiple boxes in an image to multiple images.

A perfect answer may require more information from you regarding what your end goal is, how many "boxes" you are cutting out, etc. But I'll see what I can clear up first. I assume you are not resampling the data with Scene.resample in your code at all.
Satpy uses dask so if possible it would be best to compute everything at once. Or at least limit how many times things are computed (.values computes the dask array). If you have a lot of boxes to cut out and your system has the available memory, you may want to calculate the slices yourself for all the xy bboxes (I think there are methods to help with this), load the entire image (see xr.concat below), and then use basic slicing techniques to get each of the box cutouts. This should save you from loading the data from disk each time you call .values, but also will really help with processing the other files you have since the slices should be the same across all times (except for special instrument cases).
You say you want the final shape to be (rows, cols, N). Is there a good reason you can't have (N, rows, cols)? The latter should be faster as the arrays are in their original contiguous form. If whatever processing you are doing after this could be done with dask at all this would "flow" really well with the tasks that would be made too.
You can use xr.concat, passing all the DataArrays at once and then call .values to get the full numpy array underneath. This should compute all the bands at the same time. Something like:
final_arr = xr.concat([scn['VIS006'], scn['band2'], scn['band3']], "bands").values

Astronomical FITS Image calibration: Indexing issue using ccdproc

I seem to be having an issue with some basic astronomical image processing/calibration using the python package ccdproc.
I'm currently compiling 30 bias frames into a single image average of the component frames. Before going through the combination I iterate over each image in order to subtract the overscan region using subtract_overscan() and then select the image dimensions I want to retain using trim_image().
I suppose my indexing is correct but when I get to the combination, it takes extremely long (more than a couple of hours). I'm not sure that this is normal. I suspect something might be being misinterpreted by my computer. I've created the averaged image before without any of the other processing and it didn't take long (5-10 mins or so) which is why I'm thinking it might be an issue with my indexing.
If anyone can verify that my code is correct and/or comment on what might be the issue it'd be a lot of help.
Image dimensions: NAXIS1 = 3128 , NAXIS2 = 3080 and allfiles is a ccdproc.ImageFileCollection.
from astropy.io import fits
import ccdproc as cp
biasImages = []
for filename in allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS'):
ccd = fits.getdata(allfiles.location + filename)
# print(ccd)
ccd = cp.CCDData(ccd, unit = u.adu)
# print(ccd)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
# print(ccd)
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
# print(ccd)
biasImages.append(ccd)
master_bias = cp.combine(biasImages,output_file = path + 'mbias_avg.fits', method='average')

The code looks similar to my own code for combining biases together (see this example), so there is nothing jumping out immediately as a red flag. I rarely do such a large number of biases and the ccdproc.combine task could be far more optimized, so I'm not surprised it is very slow.
One thing that sometimes I run into is issues with garbage collection. So if you are running this in a notebook or part of a large script, there may be a problem with the memory not being cleared. It is useful to see what is happening in memory, and I sometimes include deleting the biasImages object (or an other list of ccd objects) after it has been used and it isn't needed any further
I'm happy to respond further here, or if you have further issues please open an issue at the github repo.

In case you're just looking for a solution skip ahead to the end of this answer but in case you're interested why that (probably) don't skip ahead
it takes extremely long (more than a couple of hours).
That seems like your running out of RAM and your computer then starts using swap memory. That means it will save part (or all) of the objects on your hard disk and remove them from RAM so it can load them again when needed. In some cases swap memory can be very efficient because it only needs to reload from the hard disk rarely but in some cases it has to reload lots of times and then your going to notice a "whole system slow down" and "never ending operations".
After some investigations I think the problem is mainly because the numpy.array created by ccdproc.combine is stacked along the first axis and the operation is along the first axis. The first axis would be good in case it's a FORTRAN-contiguous array but ccdproc doesn't specify any "order" and then it's going to be C-contiguous. That means the elements on the last axis are stored next to each other in memory (if it were FORTRAN-contiguous the elements in the first axis would be next to each other). So if you run out of RAM and your computer starts using swap memory it puts parts of the array on the disk but because the operation is performed along the first axis - the memory addresses of the elements that are used in each operation are "far away from each other". That means it cannot utilize the swap memory in a useful way because it has to basically reload parts of the array from the hard disk for "each" next item.
It's not very important to know that actually, I just included that in case you're interested what the reason for the observed behavior war. The main point to take away is that if you notice that the system is becoming very slow if you run any program and it doesn't seem to make much progress that's because you have been running out of RAM!
The easiest solution (although it has nothing to do with programming) is to buy more RAM.
The complicated solution would be to reduce the memory footprint of your program.
Let's first do a small calculation how much memory we're dealing with:
Your images are 3128 * 3080 that's 9634240 elements. They might be any type when you read them but when you use ccdproc.subtract_overscan they will be floats later. One float (well, actually np.float64) uses 8 bytes, so we're dealing with 77073920 bytes. That's roughly 73 MB per bias image. You have 30 bias images, so we're dealing with roughly 2.2 GB of data here. That's assuming your images don't have uncertainty or mask. If they have that would add another 2.2 GB for the uncertainties or 0.26 GB for the masks.
2.2 GB sounds like a small enough number but ccdproc.combine stacks the NumPy arrays. That means it will create a new array and copy the data of your ccds into the new array. That will double the memory right there. It makes sense to stack them because even though it will take more memory it will be much faster when it actually does the "combining" but it's not there yet.
All in all 4.4 GB could already exhaust your RAM. Some computers will only have 4GB RAM and don't forget that your OS and the other programs need some RAM as well. However, it's unlikely that you run out of RAM if you have 8GB or more but given the numbers and your observations I assume that you only have 4-6GB of RAM.
The interesting question is actually how to avoid the problem. That really depends on the amount of memory you have:
Less than 4GB RAM
That's tricky because you won't have much free RAM after you deduct the size of all the CCDData objects and what you OS and the other processes need. In that case it would be best to process for example 5 bias images at a time and then combine the results of the first combinations. That's probably going to work because you use average as method (it wouldn't work if you used median) because (A+B+C+D) / 4 is equal to ((A+B)/2 + (C+D)/2)/2.
That would be (I haven't actually checked this code, so please inspect it carefully before you run it):
biasImages = []
biasImagesCombined = []
for idx, filename in enumerate(allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS')):
ccd = fits.getdata(allfiles.location + filename)
ccd = cp.CCDData(ccd, unit = u.adu)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
biasImages.append(ccd)
# Combine every 5 bias images. This only works correctly if the amount of
# images is a multiple of 5 and you use average as combine-method.
if (idx + 1) % 5 == 0:
tmp_bias = cp.combine(biasImages, method='average')
biasImages = []
biasImagesCombined.append(tmp_bias)
master_bias = cp.combine(biasImagesCombined, output_file = path + 'mbias_avg.fits', method='average')
4GB of RAM
In that case you probably have 500 MB to spare, so you could simply use mem_limit to limit the amount of RAM the combine will take additionally. In that case just change your last line to:
# To account for additional memory usage I chose 100 MB of additional memory
# you could try to adapt the actual number.
master_bias = cp.combine(biasImages, mem_limit=1024*1024*100, ioutput_file=path + 'mbias_avg.fits', method='average')
More than 4GB of RAM but less than 8GB
In that case you probably have 1 GB of free RAM that could be used. It's still the same approach as the 4GB option but you could use a much higher mem_limit. I would start with 500 MB: mem_limit=1024*1024*500.
More than 8GB of RAM
In that case I must have missed something because using ~4.5GB of RAM shouldn't actually exhaust your RAM.

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here

Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.

You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.

try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

Techniques for working with large Numpy arrays? [duplicate]

This question already has answers here:
Very large matrices using Python and NumPy
(11 answers)
Closed 2 years ago.
There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, I have found that Pickling (Pickle, CPickle, Pytables etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).
Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.
I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:
Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
A couple of things you can do to handle this:
1. Divide and conquer
Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.
2. Cache expensive computations
This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.
3. Use your dtypes wisely
If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)
Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.
Third: as pointed out by #Jaime, work un block sub-matrices, if the whole matrix is to big.
EDIT:
Avoid unecessary list comprehension, as pointed out in this answer in SE.

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.
You could also look into Spartan, Distarray, and Biggus.

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it
will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,
uses all cores of your dual or quad core CPU,
is an extension to numpy, not an alternative.
For medium and large sized arrays, it is faster that numpy alone.
Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

On top of everything said in other answers if we'd like to store all the intermediate results of the computation (because we don't always need to keep intermediate results in memory) we can also use accumulate from numpy after various types of aggregations:
Aggregates
For binary ufuncs, there are some interesting aggregates that can be computed directly from the object. For example, if we'd like to reduce an array with a particular operation, we can use the reduce method of any ufunc. A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling reduce on the add ufunc returns the sum of all elements in the array:
x = np.arange(1, 6)
np.add.reduce(x) # Outputs 15
Similarly, calling reduce on the multiply ufunc results in the product of all array elements:
np.multiply.reduce(x) # Outputs 120
Accumulate
If we'd like to store all the intermediate results of the computation, we can instead use accumulate:
np.add.accumulate(x) # Outputs array([ 1, 3, 6, 10, 15], dtype=int32)
np.multiply.accumulate(x) # Outputs array([ 1, 2, 6, 24, 120], dtype=int32)
Wisely using these numpy operations while performing many intermediate operations on one, or more, large Numpy arrays can give you great results without usage of any additional libraries.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.