Creating a HDF5 datacube in Python with multiple FITS files - python

I am currently struggling with a problem. I have 1500 fits files that contain 3800 x 3800 arrays. My objective is to create a single HDF5 datacube with them. Unfortunately I cannot provide with the fits files (due to storage problems). So far, I have been able to create an empty HDF5 array with the required shape (3800, 3800, 1500) by doing:
import h5py
from import fits
import glob
outname = "test.hdf5"
NAXIS1 = 3800
NAXIS2 = 3800
NAXIS3 = 1500
f = h5py.File(outname,'w')
dataset = f.create_dataset("DataCube",
but I am having trouble trying to write the arrays from the fits files, because each element from the following for loop takes 30 mins at least:
f = h5py.File(outname,'r+')
# This is the actual line, but I will replace it by random noise
# in order to make the example reproducible.
# fitslist = glob.glob("*fits") # They are 1500 fits files
for i in range(NAXIS3):
# Again, I replace the real data with noise.
# hdul =[i])
# file['DataCube'][:,:,i] = hdul[0].data
data = np.random.normal(0,1,(dim0,dim1))
file['DataCube'][:,:,i] = data
There is any better way to construct a 3D datacube made of N slices which are already stored in N fits files? I was expecting that once the HDF5 file was created in the disc, writing it would be quite fast, but it wasn't.
Thank you very much for your help.
EDIT 1: I tested the modification proposed by astrofrog and it worked really well. Now the performance is quite good. In addition to this, I stored several fits files (~50) into one temporary numpy array in order to reduce the number of times I write into the hdf5 file. Now the code looks like this:
NAXIS1 = len(fitslist)
NAXIS2 = fits_0[ext].header['NAXIS1']
NAXIS3 = fits_0[ext].header['NAXIS2']
shape_array = (NAXIS2, NAXIS3)
f = h5py_cache.File(outname, 'w', chunk_cache_mem_size=3000*1024**2,
dataset = f.create_dataset("/x", (NAXIS1, NAXIS2, NAXIS3),
cache_size = 50
cache_array = np.empty(shape=(cache_size, NAXIS2, NAXIS3))
j = 0
for i in tqdm(range(len(fitslist))):
hdul = fits.getdata(fitslist[i], ext)
cache_array[j:j+1, :, :] = hdul
if ((i % cache_size == 0) & (i != 0)):
print("Writing to disc")
f['/x'][i-cache_size+1:i+1, :, :] = cache_array
j = 0
if (i % 100 == 0):
print("collecting garbage")
j = j + 1
My question is: There is any more pythonic way of doing this? I am not sure is this is the most efficient way of writing files with h5py, or if there is any better way to read from fits to numpy and then to hdf5.

I think the issue might be that the order of the dimensions should be NAXIS3, NAXIS2, NAXIS1 (currently I think it is doing very inefficient striding over the array). I would also only add the array to the HDF5 file at the end:
import glob
import h5py
import numpy as np
from import fits
fitslist = glob.glob("*.fits")
NAXIS1 = 3800
NAXIS2 = 3800
NAXIS3 = 1000
array = np.zeros((NAXIS3, NAXIS2, NAXIS1), dtype=np.float32)
for i in range(NAXIS3):
hdul =[i], memmap=False)
array[i, :, :] = hdul[0].data
f = h5py.File('test.hdf5', 'w')
f.create_dataset("DataCube", data=array)
If you need the array in the NAXIS1, NAXIS2, NAXIS3 order, just transpose it at the very end.


(Python/h5py) Memory-efficient processing of HDF5 dataset slices

I'm working in python with a large dataset of images, each 3600x1800. There are ~360000 images in total. Each image is added to the stack one by one since an initial processing step is run on each image. H5py has proven effective at building the stack, image by image, without it filling the entire memory.
The analysis I am running is calculated on the grid cells - so on 1x1x360000 slices of the stack. Since the analysis of each slice depends on the max and min values of within that slice, I think it is necessary to hold the 360000-long array in memory. I have a fair bit of RAM to work with (~100GB) but not enough to hold the entire stack of 3600x1800x360000 in memory at once.
This means I need (or I think I need) a time-efficient way of accessing the 360000-long arrays. While h5py is efficient at adding each image to the stack, it seems that slicing perpendicular to the images is much, much slower (hours or more).
Am I missing an obvious method to slice the data perpendicular to the images?
Code below is a timing benchmark for 2 different slice directions:
file = "file/path/to/large/stack.h5"
t0 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][:,:,1]
print('Time = ' + str(time.time()- t0))
t1 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][500,500,:]
print('Time = ' + str(time.time()- t1))
## time to read a image slice, e.g. [:,:,1]:
Time = 0.0701
## time to read a slice thru the image stack, e.g. [500,500,:]:
Time = multiple hours, server went offline for maintenance while running
You have the right approach. Using numpy slicing notation to read the stack of interest reduces the memory footprint.
However, with this large dataset, I suspect I/O performance is going to depend on chunking: 1) Did you define chunks= when you created the dataset, and 2) if so, what is the chunked size? The chunk is the shape of the data block used when reading or writing. When an element in a chunk is accessed, the entire chunk is read from disk. As a result, you will not be able to optimize the shape for both writing images (3600x1800x1) and reading stacked slices (1x1x360000). The optimal shape for writing an image is (shape[0], shape[1], 1) and for reading a stacked slice is (1, 1, shape[2])
Tuning the chunk shape is not trivial. h5py docs recommend the chunk size should be between 10 KiB and 1 MiB, larger for larger datasets. You could start with chunks=True (h5py determines the best shape) and see if that helps.
Assuming you will create this file once, and read many times, I suggest optimizing for reading. However, creating the file may take a long time. I wrote a simple example that you can use to "tinker" with the chunk shape on a small file to observe the behavior before you work on the large file. The table below shows the effect of different chunk shapes on 2 different file sizes. The code follows the table.
Dataset shape=(36,18,360)
Dataset shape=(144,72,1440) (4x4x4 larger)
(9, 9, 180) / True
(144, 72, 100)
(10, 10, 1440)
Code below:
f = 4
a0, a1, n = f*36, f*18, f*360
c0, c1, cn = a0, a1, 100
#c0, c1, cn = 10, 10, n
arr = np.random.randint(256,size=(a0,a1))
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
# Use chunks=True for default size or use chunks=(c0,c1,cn) for user defined size
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
for i in range(n):
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
for i in range(ds.shape[0]):
for j in range(ds.shape[1]):
dat = ds[i,j,:]
print(f"Time to read file:{(time.time()-start): .3f}")
HDF5 chunked storage improves I/O time, but can still be very slow when reading small slices (e.g., [1,1,360000]). To further improve performance, you need to read larger slices into an array as described by #Jérôme Richard in the comments below your question. You can then quickly access a single slice from the array (because it is in memory).
This answer combines the 2 techniques: 1) HDF5 chunked storage (from my first answer), and 2) reading large slices into an array and then reading a single [i,j] slice from that array. The code to create the file is very similar to the first answer. It is setup to create a dataset of shape [3600, 1800, 100] with default chunk size ([113, 57, 7] on my system). You can increase n to test with larger datasets.
When reading the file, the large slice [0] and [1] dimensions are set equal to the associated chunk shape (so I only access each chunk once). As a result, the process to read the file is "slightly more complicated" (but worth it). There are 2 loops: the 1st loop reads a large slice into an array 'dat_chunk', and the 2nd loop reads a [1,1,:] slice from 'dat_chunk' into a second array 'dat'.
Differences in timing data for 100 images is dramatic. It takes 8 seconds to read all of the data using the method below. It required 74 min with the first answer (reading each [i,j] pair directly). Clearly that is too slow. Just for fun, I increased the dataset size to 1000 images (shape=[3600, 1800, 1000]) in my test file and reran. It takes 4:33 (m:ss) to read all slices with this method. I didn't even try with the previous method (for obvious reasons). Note: my computer is pretty old & slow with a HDD, so your timing data should be faster.
Code to create the file:
a0, a1, n = 3600, 1800, 100
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
for i in range(n):
arr = np.random.randint(256,size=(a0,a1))
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
Code to read the file using large array slices:
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
ds_i_max, ds_j_max, ds_k_max = ds.shape
ch_i_max, ch_j_max, ch_k_max = ds.chunks
i = 0
while i < ds_i_max:
i_stop = min(i+ch_i_max,ds_i_max)
print(f'i_range: {i}:{i_stop}')
j = 0
while j < ds_j_max:
j_stop = min(j+ch_j_max,ds_j_max)
print(f' j_range: {j}:{j_stop}')
dat_chunk = ds[i:i_stop,j:j_stop,:]
# print(dat_chunk.shape)
for ic in range(dat_chunk.shape[0]):
for jc in range(dat_chunk.shape[1]):
dat = dat_chunk[ic,jc,:]
j = j_stop
i = i_stop
print(f"Time to read file:{(time.time()-start): .3f}")

Fast way to create an animation from binary data

I have a C++ program which writes pixel data for a 2D grid to a binary file. Each binary file may have numerous grid states back-to-back, and there may be multiple of these binary files.
e.g. I could have 10 binary files bin0, bin1, bin2... bin9 each holding data for 10 grid states for a total of 100 grid states to animate.
I'm looking for a fast way to create an animation from the grid states in these binary files.
My best attempt used python and PIL.Image to create a gif:
import glob
import numpy as np
from PIL import Image
from functools import partial
def create_images():
paths = glob.glob('./outfiles/dump*')
imgs = []
for path in sorted(paths, key=lambda x: int(x.split("p")[1])):
with open(path, 'rb') as ifile:
for dat in iter(partial(, rows*cols), b''):
mem = memoryview(dat).cast('B', shape=[rows,cols])
arr = np.asarray(mem)
img = Image.fromarray(arr*255)
return imgs
rows = 128
cols = 128
imgs = create_images()
imgs[0].save('./animation.gif', save_all=True, append_images=imgs[1:], loop=0)
but the final line where I actually write the gif can take a long time if I have potentially thousands of images, each with thousands of pixels. The rendering of the gif is also poor quality when the images are large.
Looking forward to suggestions for how to make this run fast using a different library from Pillow. Not bothered about sticking to Python if there is a better alternative using C/C++ (or other languages, but Python or C/C++ preferred), nor does the animation have to be a gif.
In my case I'm working with grid data which is either 0 or 1 (the context is Conway's Game of Life), so optimizations which take advantage of this would be welcome. Note, however, that currently each 1 or 0 occupies a whole byte in the binary file, therefore are not packed into bits.
Just adding a helper python script to generate binary files as I described above for anyone who gives it a go.
import numpy as np
rows = 128
cols = 128
img_per_file = 10
num_files = 3
filesize = rows*cols*img_per_file
for i in range(num_files):
filename = 'bin'+str(i)
data = np.random.randint(2, size=filesize, dtype=np.uint8).tobytes()
f = open(filename, 'wb')

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
channelChunks = getDFT(dataChunk,channelChunks,
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
return channelOut
if __name__ == '__main__':
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.
The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
channelChunks = getDFT(dataChunk,channelChunks,
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
return channelOut
if __name__ == '__main__':
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...

IncrementalPCA & partial_fit - number of components

I work with python and about 4000 images of watches (examples: watch_1, watch_2). The images are rgb and their resolution is 450x450. My aim is to find the most similar watches among them. For this reason I am using IncrementalPCA and partial_fit of scikit_learn to handle these big data with my 26GB RAM (see also: SO_Link_1, SO_Link_2). My source code is the following:
import cv2
import numpy as np
import os
from glob import glob
from sklearn.decomposition import IncrementalPCA
from sklearn import neighbors
from sklearn import preprocessing
data = []
# Read images from file #
for filename in glob('Watches/*.jpg'):
img = cv2.imread(filename)
height, width = img.shape[:2]
img = np.array(img)
# Check that all my images are of the same resolution
if height == 450 and width == 450:
# Reshape each image so that it is stored in one line
img = np.concatenate(img, axis=0)
img = np.concatenate(img, axis=0)
# Normalise data #
data = np.array(data)
Norm = preprocessing.Normalizer()
data = Norm.transform(data)
# IncrementalPCA model #
ipca = IncrementalPCA(n_components=6)
length = len(data)
chunk_size = 4
pca_data = np.zeros(shape=(length, ipca.n_components))
for i in range(0, length // chunk_size):
ipca.partial_fit(data[i*chunk_size : (i+1)*chunk_size])
pca_data[i * chunk_size: (i + 1) * chunk_size] = ipca.transform(data[i*chunk_size : (i+1)*chunk_size])
# K-Nearest neighbours #
knn = neighbors.NearestNeighbors(n_neighbors=4, algorithm='ball_tree', metric='minkowski').fit(data)
distances, indices = knn.kneighbors(data)
However when I run this program for start with 40 images of watches I get the following error when i = 1:
ValueError: Number of input features has changed from 4 to 6 between calls to partial_fit! Try setting n_components to a fixed value.
However, it is obvious that I set n_components to 6 when coding ipca = IncrementalPCA(n_components=6) but for some reason ipca considers chunk_size = 4 as the number of components when i = 0 and then when i = 1 changes to 6.
Why is this happening?
How can I fix it?
This seems to follow the math behind PCA as it will be ill-conditioned for n_components > n_samples.
You might be interested in reading this (introduction of error-message) and some discussion behind it.
Try to increase the batch-size / chunk-size (or lowering n_components).
(In general i'm also somewhat sceptic about this approach. I hope you tested it on some small example-dataset using batch-PCA. It does not seem your watches are preprocessed in regards to geometry: cropping; maybe hist-/color-normalization.)

Python garbage collector for large data formatting program

I have written a program to read through a folder of excel files and load each file into the program. It then takes the data and creates an array of zeros of size (3001,2001), which will be iterated through and the corresponding coordinate values from excel will put changed to ones. The array is then reshaped to a size of (1,6005001). I am using tensorflow to reshape the array since the program considers it a tuple, but the final values are stored in a numpy array. I finally store the final formatted array into a csv file named "filename_Array.csv" and the program moves on to the next excel file to be formatted. I am running Python on Eclipse with tensorflow installed
The issue I am running into is that some values are being cached in memory, but I can not figure out what it is. I have tried explicitly deleting large variables that will be reinitialized and having gc.collect() to clean the inactive memory that is stored. I am still seeing a steady increase in memory usage until around 25 files formatted, then the computer begins freezing up as all of the RAM on my pc (12GB) is being used. I know that python automatically clears memory for values that are completely unreachable by the program, so I am not sure if this is an issue with fragmenting of RAM or something else.
Sorry for the walls of text, I am just trying to give as much info to the problem as possible.
Here is a link to a screenshot of my performance tab while running the program through about 24 files before I had to terminate the program due to the computer freezing.
Here is my code:
from __future__ import print_function
import os
import tensorflow as tf
import numpy as np
import csv
import gc
path = r'C:\Users\jeremy.desforges\Desktop\Eclipse\NN_MNIST\VAM SLIJ-II 4.500'
def create_array(g,h,trainingdata,filename):
# Multiplying by factors of 10 to keep precision of data
g = g*1000
h = h*1
max_g = 3000
max_h = 2000
# Initializes an array with zeros to represent a blank graph
image = np.zeros((max_g+1,max_h+1),
shape = ((max_g+1)*(max_h+1))
# Fills the blank graph with the input data points
for i in range(len(h)):
image[g[i].astype('int'),h[i].astype('int')] = 1
image = tf.reshape(image,[-1,shape])
# Converts tensor objects to numpy arrays to feed into network
sess = tf.InteractiveSession()
image =
np.savetxt((filename + "_Array.csv"), np.flip(image,1).astype(int), fmt = '%i' ,delimiter=",")
print(filename, "appended")
print(image,"= output array")
del image,shape,g,h,filename,sess
# Initializing variables
image = []
shape = 1
g = 1.0
h = 1.0
f = 1
specials = '.csv'
folder = os.listdir(path)
for filename in folder:
trainingdata = open(filename, "r+")
filename = str(filename.replace(specials, ''))
data_read = csv.reader(trainingdata)
for row in data_read:
in1 = float(row[0])
in2 = float(row[1])
if (f==0):
z_ = np.array([in1])
g = np.hstack((g,z_))
q = np.array([in2])
h = np.hstack((h,q))
if (f == 1):
g = np.array([in1])
h = np.array([in2])
f = 0
image = []
shape = 1
g = 1.0
h = 1.0
f = 1
