Astronomical FITS Image calibration: Indexing issue using ccdproc

Astronomical FITS Image calibration: Indexing issue using ccdproc - python

I seem to be having an issue with some basic astronomical image processing/calibration using the python package ccdproc.
I'm currently compiling 30 bias frames into a single image average of the component frames. Before going through the combination I iterate over each image in order to subtract the overscan region using subtract_overscan() and then select the image dimensions I want to retain using trim_image().
I suppose my indexing is correct but when I get to the combination, it takes extremely long (more than a couple of hours). I'm not sure that this is normal. I suspect something might be being misinterpreted by my computer. I've created the averaged image before without any of the other processing and it didn't take long (5-10 mins or so) which is why I'm thinking it might be an issue with my indexing.
If anyone can verify that my code is correct and/or comment on what might be the issue it'd be a lot of help.
Image dimensions: NAXIS1 = 3128 , NAXIS2 = 3080 and allfiles is a ccdproc.ImageFileCollection.
from astropy.io import fits
import ccdproc as cp
biasImages = []
for filename in allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS'):
ccd = fits.getdata(allfiles.location + filename)
# print(ccd)
ccd = cp.CCDData(ccd, unit = u.adu)
# print(ccd)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
# print(ccd)
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
# print(ccd)
biasImages.append(ccd)
master_bias = cp.combine(biasImages,output_file = path + 'mbias_avg.fits', method='average')

The code looks similar to my own code for combining biases together (see this example), so there is nothing jumping out immediately as a red flag. I rarely do such a large number of biases and the ccdproc.combine task could be far more optimized, so I'm not surprised it is very slow.
One thing that sometimes I run into is issues with garbage collection. So if you are running this in a notebook or part of a large script, there may be a problem with the memory not being cleared. It is useful to see what is happening in memory, and I sometimes include deleting the biasImages object (or an other list of ccd objects) after it has been used and it isn't needed any further
I'm happy to respond further here, or if you have further issues please open an issue at the github repo.

In case you're just looking for a solution skip ahead to the end of this answer but in case you're interested why that (probably) don't skip ahead
it takes extremely long (more than a couple of hours).
That seems like your running out of RAM and your computer then starts using swap memory. That means it will save part (or all) of the objects on your hard disk and remove them from RAM so it can load them again when needed. In some cases swap memory can be very efficient because it only needs to reload from the hard disk rarely but in some cases it has to reload lots of times and then your going to notice a "whole system slow down" and "never ending operations".
After some investigations I think the problem is mainly because the numpy.array created by ccdproc.combine is stacked along the first axis and the operation is along the first axis. The first axis would be good in case it's a FORTRAN-contiguous array but ccdproc doesn't specify any "order" and then it's going to be C-contiguous. That means the elements on the last axis are stored next to each other in memory (if it were FORTRAN-contiguous the elements in the first axis would be next to each other). So if you run out of RAM and your computer starts using swap memory it puts parts of the array on the disk but because the operation is performed along the first axis - the memory addresses of the elements that are used in each operation are "far away from each other". That means it cannot utilize the swap memory in a useful way because it has to basically reload parts of the array from the hard disk for "each" next item.
It's not very important to know that actually, I just included that in case you're interested what the reason for the observed behavior war. The main point to take away is that if you notice that the system is becoming very slow if you run any program and it doesn't seem to make much progress that's because you have been running out of RAM!
The easiest solution (although it has nothing to do with programming) is to buy more RAM.
The complicated solution would be to reduce the memory footprint of your program.
Let's first do a small calculation how much memory we're dealing with:
Your images are 3128 * 3080 that's 9634240 elements. They might be any type when you read them but when you use ccdproc.subtract_overscan they will be floats later. One float (well, actually np.float64) uses 8 bytes, so we're dealing with 77073920 bytes. That's roughly 73 MB per bias image. You have 30 bias images, so we're dealing with roughly 2.2 GB of data here. That's assuming your images don't have uncertainty or mask. If they have that would add another 2.2 GB for the uncertainties or 0.26 GB for the masks.
2.2 GB sounds like a small enough number but ccdproc.combine stacks the NumPy arrays. That means it will create a new array and copy the data of your ccds into the new array. That will double the memory right there. It makes sense to stack them because even though it will take more memory it will be much faster when it actually does the "combining" but it's not there yet.
All in all 4.4 GB could already exhaust your RAM. Some computers will only have 4GB RAM and don't forget that your OS and the other programs need some RAM as well. However, it's unlikely that you run out of RAM if you have 8GB or more but given the numbers and your observations I assume that you only have 4-6GB of RAM.
The interesting question is actually how to avoid the problem. That really depends on the amount of memory you have:
Less than 4GB RAM
That's tricky because you won't have much free RAM after you deduct the size of all the CCDData objects and what you OS and the other processes need. In that case it would be best to process for example 5 bias images at a time and then combine the results of the first combinations. That's probably going to work because you use average as method (it wouldn't work if you used median) because (A+B+C+D) / 4 is equal to ((A+B)/2 + (C+D)/2)/2.
That would be (I haven't actually checked this code, so please inspect it carefully before you run it):
biasImages = []
biasImagesCombined = []
for idx, filename in enumerate(allfiles.files_filtered(NAXIS1=3128,NAXIS2=3080,OBSTYPE = 'BIAS')):
ccd = fits.getdata(allfiles.location + filename)
ccd = cp.CCDData(ccd, unit = u.adu)
ccd = cp.subtract_overscan(ccd,overscan_axis = 1, fits_section = '[3099:3124,:]')
ccd = cp.trim_image(ccd,fits_section = '[27:3095,3:3078]')
biasImages.append(ccd)
# Combine every 5 bias images. This only works correctly if the amount of
# images is a multiple of 5 and you use average as combine-method.
if (idx + 1) % 5 == 0:
tmp_bias = cp.combine(biasImages, method='average')
biasImages = []
biasImagesCombined.append(tmp_bias)
master_bias = cp.combine(biasImagesCombined, output_file = path + 'mbias_avg.fits', method='average')
4GB of RAM
In that case you probably have 500 MB to spare, so you could simply use mem_limit to limit the amount of RAM the combine will take additionally. In that case just change your last line to:
# To account for additional memory usage I chose 100 MB of additional memory
# you could try to adapt the actual number.
master_bias = cp.combine(biasImages, mem_limit=1024*1024*100, ioutput_file=path + 'mbias_avg.fits', method='average')
More than 4GB of RAM but less than 8GB
In that case you probably have 1 GB of free RAM that could be used. It's still the same approach as the 4GB option but you could use a much higher mem_limit. I would start with 500 MB: mem_limit=1024*1024*500.
More than 8GB of RAM
In that case I must have missed something because using ~4.5GB of RAM shouldn't actually exhaust your RAM.

Related

Numpy's memmap acting strangely?

I am dealing with large numpy arrays and I am trying out memmap as it could help.
big_matrix = np.memmap(parameters.big_matrix_path, dtype=np.float16, mode='w+', shape=(1000000, 1000000)
The above works fine and it creates a file on my hard drive of about 140GB.
1000000 is just a random number I used - not the one I am actually using.
I want to fill the matrix with values. Currently it is just set to zero.
for i in tqdm(range(len(big_matrix))):
modified_row = get_row(i)
big_matrix[i, :] = modified_row
At this point now, I have a big_matrix filled with the values I want.
The problem is that from this point on I can't operate on this memmap.
For example I want to multiply column wise (broadcast).
I run this:
big_matrix * weights[:, np.newaxis]
Where weights has the same length.
It just hangs and throws and out of memory error as my RAM and SWAP is all used.
My understanding was that the memmap will keep everything on the hard drive.
For example save the results directly there.
So I tried this then:
for i in tqdm(range(big_matrix.shape[1])):
temp = big_matrix[:, i].tolist()
temp = np.array(temp) * weights
The above loads only 1 column in memory, and multiply that with the weights.
Then I will save that column back in big_matrix.
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
At this point I am thinking of switching to sqlite.
I wanted to get some insights why my code is not working?
Do I need to flush the memmap everytime I change it ?

np.memmap map a part of the virtual memory to the storage device space here. The OS is free to preload pages and cache them for a fast reuse. The memory is generally not flushed unless it is reclaimed (eg. by another process or the same process). When this happen, the OS typically (partially) flush data to the storage device and (partially) free the physical memory used for the mapping. That being said, this behaviour is dependent of the actual OS. It work that way on Windows. On Linux, you can use madvise to tune this behaviour but madvise is a low-level C function not yet supported by Numpy (though it is apparently supported for Python, see this issue for more information). Actually, Numpy does not even support closing the memmaped space (which is leaky). The solution is generally to flush data manually not to lose it. There are alternative solutions but none of them is great yet.
big_matrix * weights[:, np.newaxis]
It just hangs and throws and out of memory error as my RAM and SWAP is all used
This is normal since Numpy creates a new temporary array stored in RAM. There is no way to tell to Numpy to store temporary array in on the storage device. That being said, you can tell to Numpy where the output data is stored using the out parameter on some function (eg. np.multiply supports it). The output array can be created using memmap so not to use too much memory (regarding the behaviour of the OS).
But even with 1 column my program hangs. The only difference here is that the RAM is not used up.
This is also expected, especially if you use a HDD and not and SSD. Indeed, the array is stored (virtually) contiguously on the storage device. big_matrix[:, i] has to fetch data with a huge stride. For each item, with a size of only 2 bytes, the OS will perform an IO request to the storage device. Storage devices are optimized for contiguous reads so fetches are buffered and each IO request has a pretty significant latency. In practice, the OS will generally at least fetch a page (typically 4096 bytes, that is 512 times more than what is actually needed). Moreover, there is a limit of the number of IO requests that can be completed per second. HDDs can typically do about 20-200 IO requests per seconds while the fastest Nvme SSDs reach 100_000-600_000 UI requests per seconds. Note that the cache help not not reload data for the next column unless there are too many loaded pages and the OS has to flush them. Reading a matrix of size (1_000_000,1_000_000) causes up to 1_000_000*1_000_000=1_000_000_000_000 fetch, which is horribly inefficient. The cache could reduce this by a large margin, but operating simultaneously on 1_000_000 pages is also horribly inefficient since the processor cannot do that (due to a limited number of entries in the TLB). This will typically results in TLB misses, that is expensive kernel calls for each item to be read. Because a kernel call typically take (at least) about ~1 us on mainstream PC, this means more than a week to to the whole computation.
If you want to efficiently read columns, then you need to read large chunk of columns. For example, you certainly need at least several hundred of columns to be read even on a fast Nvme SSD. For HDD, it is at least several dozens of thousand columns to get a proper throughput. This means you certainly cannot read the full columns efficiently due to the high amount of requested RAM. Using another data layout (tile + transposed data) is critical in this case.

improve efficiency of an exponential string in python

I've been trying to make a dragon curve fractal in python, and I've gotten as far as 32 iterations, yet 33 gives me a memory error. My computer however is particularly good, a GE75 Raider 9SE, and even with 32 iterations everything is fine. I can run even a 3d modeling software.
I am using 64bit python and have not yet tried allocating more memory or multiprossecing, I still want to see if the efficiency of my actual generation process could be improved. My code so far is as shown:
old = 'r'
new = old
table = str.maketrans("lr", "rl")
iteration = 32
for i in range(iteration):
new = (old) + ('r')
old = "".join(old[::-1])
old = old.translate(table)
new = (new) + (old)
old = new
is one of the things I'm performing redundant? Are one of the functions I'm calling inefficient? I would like to know before I explore more options. I don't make any more copies of this string, so there aren't many unnecessary objects in that regard.

Well, I don't think there are too many "inefficiencies" there... Just lots of memory requirement for the ginormous string you are making.
for every iteration, your string length is roughly doubling, and so is the memory requirement to hold it.
At 30 iterations, the string is roughly 1B characters, which takes about 1GB of RAM to hold. (Each character is a byte). You can check this with sys.getsizeof() function on your old at end of loop. So, from 30 to 33 iterations is a factor of 8, which gets to 8GB of RAM to hold the result, plus for some period of time before the result is garbage collected, it will hold both old and new so depending on internals probably some penalty on top of that.
You can search the site for ways to expand your system's allocation for the python virtual machine, which may give you some help, but at the end of the day, do you really need more than 1B elements of the fractal?

LoadPage in Shady very slow on Linux Mint

I am trying to display a sequence of frames using Shady, but I'm running into difficulties. I'm looking at 25 frames, covering a 1080x1080 pixels area. The stimulus is grayscale, and I am doing luminance linearization off-line, so I only need to save a uint8 value for each pixel. The full sequence is thus about 29Mb. I define the stimulus as a 3-D numpy array [1080x1080x25], and I save it to disk using np.save(). I then load it using np.load().
try:
yy = np.load(fname)
except:
print fname + ' does not exist'
return
This step takes about 20ms. It is my understanding that Shady does not deal with uint8 luminance values, but rather with floats between 0 and 1. I thus convert it into a float array and divide by 255.
yy = yy.astype(np.float)/255.0
This second step takes approx 260ms, which is already not great (ideally I need to get the stimulus loaded and ready to be presented in 400ms).
I now create a list of 25 numpy arrays to use as my pages parameter in the Stimulus class:
pages = []
for j in range(yy.shape[2]):
pages.append(np.squeeze(yy[:, :, j]))
This is virtually instantaneous. But at my next step I run into serious timing problems.
if (self.sequence is None):
self.sequence = self.wind.Stimulus(pages, 'sequence', multipage=True, anchor=Shady.LOCATION.UPPER_LEFT, position=[deltax, deltay], visible=False)
else:
self.sequence.LoadPages(pages, visible=False)
Here I either create a Stimulus object, or update its pages attribute if this is not the first sequence I load. Either way, this step takes about 10s, which is about 100 times what I can tolerate in my application.
Is there a way to significantly speed this up? What am I doing wrong? I have a pretty mediocre graphics card on this machine (Radeon Pro WX 4100), and if that is the problem I could upgrade it, but I do not want to go through the hassle if that is not going to fix it.

Based on jez comments, his tests, and my tests, I guess that on some configurations (in my case a Linux Mint 19 with Cinnamon and a mediocre AMD video card) loading floats can be much slower than loading uint8. With uint8 the behavior appears to be consistent across configurations. So go with uint8 if you can. Since this will (I assume) disable much of what Shady can do in terms of gamma correction and dynamic range enhancement, this might be limiting for some.

Shady can accept uint8 pixel values as-is so you can cut out your code for scaling and type-conversion. Of course, you lose out on Shady's ability to do dynamic range enhancement that way, but it seems like you have your own offline solutions for that kind of thing. If you're going to use uint8 stimuli exclusively, you can save a bit of GPU processing effort by turning off dithering (set the .ditheringDenominator of both the World and the Stimulus to 0 or a negative value).
It seems like the ridiculous 10–to-15-second delays come from inside the compiled binary "accelerator" component, when transferring the raw texture data from RAM to the graphics card. The problem is apparently (a) specific to transferring floating-point texture data rather than integer data, and (b) specific to the graphics card you have (since you reported the problem went away on the same system when you swapped in an NVidia card). Possibly it's also OS- or driver-specific with regard to the old graphics card.
Note that you can also reduce your LoadPages() time from 300–400ms down to about 40ms by cutting down the amount of numpy operations Shady has to do. Save your arrays as [pages x rows x columns] instead of [rows x columns x pages]. Relative to your existing workflow, this means you do yy = yy.transpose([2, 0, 1]) before saving. Then, when you load, don't transpose back: just split on axis=0, and then squeeze the leftmost dimension out of each resulting page:
pages = [ page.squeeze(0) for page in numpy.split(yy, yy.shape[0], axis=0) ]
That way you'll end up with 25 views into the original array, each of which is a contiguous memory block. By contrast, if you do it the original [rows x columns x pages] way, then regardless of whether you do split-and-squeeze or your original slice-and-squeeze loop, you get 25 non-contiguous views into the original memory, and that fact will catch up with you sooner or later—if not when you or Shady convert between numeric formats, then at latest when Shady uses numpy's .tostring method to serialize the data for transfer.

Python: slow nested for loop

I need to find out an optimal selection of media, based on certain constraints. I am doing it in FOUR nested for loop and since it would take about O(n^4) iterations, it is slow. I had been trying to make it faster but it is still damn slow. My variables can be as high as couple of thousands.
Here is a small example of what I am trying to do:
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = []
for i in range(max_disks):
for j in range(max_ssds):
for k in range(max_tapes):
for l in range(max_BR):
allocations.append((i,j,k,l)) # this is just for example. In actual program, I do processing here, like checking for bandwidth and cost constraints, and choosing the allocation based on that.
It wasn't slow for up to hundreds of each media type but would slow down for thousands.
Other way I tried is :
max_disks = 5
max_ssds = 5
max_tapes = 1
max_BR = 1
allocations = [(i,j,k,l) for i in range(max_disks) for j in range(max_ssds) for k in range(max_tapes) for l in range(max_BR)]
This way it is slow even for such small numbers.
Two questions:
Why the second one is slow for small numbers?
How can I make my program work for big numbers (in thousands)?
Here is the version with itertools.product
max_disks = 500
max_ssds = 100
max_tapes = 100
max_BR = 100
# allocations = []
for i, j, k,l in itertools.product(range(max_disks),range(max_ssds),range(max_tapes),range(max_BR)):
pass
It takes 19.8 seconds to finish with these numbers.

From the comments, I got that you're working on a problem that can be rewritten as an ILP. You have several constraints, and need to find a (near) optimal solution.
Now, ILPs are quite difficult to solve, and brute-forcing them quickly becomes intractable (as you've already witnessed). This is why there are several really clever algorithms used in the industry that truly work magic.
For Python, there are quite a few interfaces that hook-up to modern solvers; for more details, see e.g. this SO post. You could also consider using an optimizer, like SciPy optimize, but those generally don't do integer programming.

Doing any operation in Python a trillion times is going to be slow. However, that's not all you're doing. By attempting to store all the trillion items in a single list you are storing lots of data in memory and manipulating it in a way that creates a lot of work for the computer to swap memory in and out once it no longer fits in RAM.
The way that Python lists work is that they allocate some amount of memory to store the items in the list. When you fill up the list and it needs to allocate more, Python will allocate twice as much memory and copy all the old entries into the new storage space. This is fine so long as it fits in memory - even though it has to copy all the contents of the list each time it expands the storage, it has to do so less frequently as it keeps doubling the size. The problem comes when it runs out of memory and has to swap unused memory out to disk. The next time it tries to resize the list, it has to reload from disk all the entries that are now swapped out to disk, then swap them all back out again to get space to write the new entries. So this creates lots of slow disk operations that will get in the way of your task and slow it down even more.
Do you really need to store every item in a list? What are you going to do with them when you're done? You could perhaps write them out to disk as you're going instead of accumulating them in a giant list, though if you have a trillion of them, that's still a very large amount of data! Or perhaps you're filtering most of them out? That will help.
All that said, without seeing the actual program itself, it's hard to know if you have a hope of completing this work by an exhaustive search. Can all the variables be on the thousands scale at once? Do you really need to consider every combination of these variables? When max_disks==2000, do you really need to distinguish the results for i=1731 from i=1732? For example, perhaps you could consider values of i 1,2,3,4,5,10,20,30,40,50,100,200,300,500,1000,2000? Or perhaps there's a mathematical solution instead? Are you just counting items?

Initializing a big matrix with numpy taking way too long

I am working in python and I have encountered a problem: I have to initialize a huge array (21 x 2000 x 4000 matrix) so that I can copy a submatrix on it.
The problem is that I want it to be really quick since it is for a real-time application, but when I run numpy.ones((21,2000,4000)), it takes about one minute to create this matrix.
When I run numpy.zeros((21,2000,4000)), it is instantaneous, but as soon as I copy the submatrix, it takes one minute, while in the first case the copying part was instantaneous.
Is there a faster way to initialize a huge array?

I guess there's not a faster way. The matrix you're building is quite large (8 byte float64 x 21 x 2000 x 4000 = 1.25 GB), and might be using up a large fraction of the physical memory on your system; thus, the one minute that you're waiting might be because the operating system has to page other stuff out to make room. You could check this by watching top or similar (e.g., System Monitor) while you're doing your allocation and watching memory usage and paging.
numpy.zeros seems to be instantaneous when you call it, because memory is allocated lazily by the OS. However, as soon as you try to use it, the OS actually has to fit that data somewhere. See Why the performance difference between numpy.zeros and numpy.zeros_like?
Can you restructure your code so that you only create the submatrices that you were intending to copy, without making the big matrix?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.