PyTables vs Matlab HDF5 read times

PyTables vs Matlab HDF5 read times - python

I have an HDF5 output file from NASTRAN that contains mode shape data. I am trying to read them into Matlab and Python to check various post-processing techniques. The file in question is in the local directory for both of these tests. The file is semi-large at 1.2 GB but certainly not that large in terms of HDF5 files I have read previously. There are 17567342 rows and 8 columns in the table I want to access. The first and last columns are integers the middle 6 are floating point numbers.
Matlab:
file = 'HDF5.h5';
hinfo = hdf5info(file);
% ... Find the dataset I want to extract
t = hdf5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');
This last operation is extremely slow (can be measured in hours).
Python:
import tables
hfile = tables.open_file("HDF5.h5")
modetable = hfile.root.NASTRAN.RESULT.NODAL.EIGENVECTOR
data = modetable.read()
This last operation is basically instant. I can then access data as if it were a numpy array. I am clearly missing something very basic about what these commands are doing. I'm thinking it might have something to do with data conversion but I'm not sure. If I do type(data) I get back numpy.ndarray and type(data[0]) returns numpy.void.
What is the correct (i.e. speedy) way to read the dataset I want into Matlab?

Matt, Are you still working on this problem?
I am not a matlab guy, but I am familiar with Nastran HDF5 file. You are right; 1.2 GB is big, but not that big by today's standards.
You might be able to diagnose the matlab performance bottle neck by running tests with different numbers of rows in your EIGENVECTOR dataset. To do that (without running a lot of Nastran jobs), I created some simple code to create a HDF5 file with a user defined # of rows. It mimics the structure of the Nastran Eigenvector Result dataset. See below:
import tables as tb
import numpy as np
hfile = tb.open_file('SO_54300107.h5','w')
eigen_dtype = np.dtype([('ID',int), ('X',float),('Y',float),('Z',float),
('RX',float),('RY',float),('RZ',float), ('DOMAIN_ID',int)])
fsize = 1000.0
isize = int(fsize)
recarr = np.recarray((isize,),dtype=eigen_dtype)
id_arr = np.arange(1,isize+1)
dom_arr = np.ones((isize,), dtype=int)
arr = np.array(np.arange(fsize))/fsize
recarr['ID'] = id_arr
recarr['X'] = arr
recarr['Y'] = arr
recarr['Z'] = arr
recarr['RX'] = arr
recarr['RY'] = arr
recarr['RZ'] = arr
recarr['DOMAIN_ID'] = dom_arr
modetable = hfile.create_table('/NASTRAN/RESULT/NODAL', 'EIGENVECTOR',
createparents=True, obj=recarr )
hfile.close()
Try running this with different values for fsize (# of rows), then attach the HDF5 file it creates to matlab. Maybe you can find the point where performance noticeably degrades.

Matlab provided another HDF5 reader called h5read. Using the same basic approach the amount of time taken to read the data was drastically reduced. In fact hdf5read is listed for removal in a future version. Here is same basic code with the perfered functions.
file = 'HDF5.h5';
hinfo = h5info(file);
% ... Find the dataset I want to extract
t = h5read(file, '/NASTRAN/RESULT/NODAL/EIGENVECTOR');

Related

Pytables: Can Appended Earray be reduced in size?

Following suggestions on SO Post, I also found PyTables-append is exceptionally time efficient. However, in my case the output file (earray.h5) has huge size. Is there a way to append the data such that the output file is not as huge? For example, in my case (see link below) a 13GB input file (dset_1: 2.1E8 x 4 and dset_2: 2.1E8 x 4) gives a 197 GB output file with just one column (2.5E10 x 1). All elements are float64.
I want to reduce the output file size such that the execution speed of the script is not compromised and the output file reading is also efficient for later use. Can saving the data along columns and not just rows help? Any suggestions on this? Given below is a MWE.
Output and input files' details here
# no. of chunks from dset-1 and dset-2 in inp.h5
loop_1 = 40
loop_2 = 20
# save to disk after these many rows
app_len = 10**6
# **********************************************
# Grabbing input.h5 file
# **********************************************
filename = 'inp.h5'
f2 = h5py.File(filename, 'r')
chunks1 = f2['dset_1']
chunks2 = f2['dset_2']
shape1, shape2 = chunks1.shape[0], chunks2.shape[0]
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float64Atom(), shape=(0, 4))
size1 = shape1//loop_1
size2 = shape2//loop_2
# ***************************************************
# Grabbing chunks to process and append data
# ***************************************************
for c in range(loop_1):
h = c*size1
# grab chunks from dset_1 of inp.h5
chunk1 = chunks1[h:(h + size1)]
for d in range(loop_2):
g = d*size2
chunk2 = chunks2[g:(g + size2)] # grab chunks from dset_2 of inp.h5
r1 = chunk1.shape[0]
r2 = chunk2.shape[0]
left, right = 0, 0
for j in range(r1): # grab col.2 values from dataset-1
e1 = chunk1[j, 1]
#...Algaebraic operations here to output a row containing 4 float64
#...append to a (earray) when no. of rows reach a million
del chunk2
del chunk1
f2.close()

I wrote the answer you are referencing. That is a simple example that "only" writes 1.5e6 rows. I didn't do anything to optimize performance for very large files. You are creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some suggestions based on comments in another thread.
Areas I recommend (3 related to PyTables code, and 2 based on external utilizes).
PyTables code suggestions:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Define the expectedrows= parameter in .create_tables() (per PyTables docs, 'this will optimize the HDF5 B-Tree and amount of memory used'). The default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation). I suggest you set this to a larger value if you are creating 10**6 (or more) rows.
There is a side benefit to setting expectedrows=. If you don't define chunkshape, 'a sensible value is calculated based on the expectedrows parameter'. Check the value used. This won't decrease the created file size, but will improve I/O performance.
If you didn't use compression when you created the file, there are 2 methods to compress existing files:
External Utilities:
The PyTables utility ptrepack - run against a HDF5 file to create a
new file (useful to go from uncompressed to compressed, or vice-versa). It is delivered with PyTables, and runs on the command line.
The HDF5 utility h5repack - works similar to ptrepack. It is delivered with the HDF5 installer from The HDF Group.
There are trade-offs with file compression: it reduces the file size, but increases access time (reduces I/O performance). I tend to use uncompressed files I open frequently (for best I/O performance). Then when done, I convert to compressed format for long term archiving. You can continue to work with them in compress format (the API handles cleanly).

Why can I process a large file only when I don't fix HDF5 deprecation warning?

After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. warning, I changed my code to:
import h5py
import numpy as np
f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value
and when I tried to read a (not so) big file of images, I got a Killed: 9 on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there . .
However, my original code:
f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed
worked just fine, except from the issued warning..
Why did my code work?

Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart)
foo = f['foo'] and foo = f.get('foo') both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo'], but nothing wrong with the get() method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!
This statement returns a Numpy array: data_foo = f.get('foo').value. The preferred method is data_foo = f['foo'][:]. (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value is deprecated.)
This also returns a Numpy array: data_foo = foo[()] (assuming foo is defined as above).
So, when you enter this equation data_foo = np.array(foo[()]) you are creating a new NumPy array from another array (foo[()] is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.
Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).
h5f = h5py.File('myfile.hdf5', mode='r')
# This returns a h5py object:
foo_ds = h5f['foo']
# You can slice to get elements like this:
foo_slice1 = foo_ds[0,:,:,:] # first row
foo_slice2 = foo_ds[-1,:,:,:] # last row
# This is the recommended method to get a Numpy array of the entire dataset:
foo_arr = h5f['foo'][:]
# or, referencing h5py dataset object above
foo_arr = foo_ds[:]
# you can also create an array with a slice
foo_slice1 = h5f['foo'][0,:,:,:]
# is the same as (from above):
foo_slice1 = foo_ds[0,:,:,:]

What is the fastest way to sort and unpack a large bytearray?

I have a large binary file that needs to be converted into hdf5 file format.
I am using Python3.6. My idea is to read in the file, sort the relevant information, unpack it and store it away. My information is stored in a way that the 8 byte time is followed by 2 bytes of energy and then 2 bytes of extra information, then again time, ... My current way of doing it, is the following (my information is read as an bytearray, with the name byte_array):
for i in range(0, len(byte_array)+1, 12):
if i == 0:
timestamp_bytes = byte_array[i:i+8]
energy_bytes = byte_array[i+8:i+10]
extras_bytes = byte_array[i+10:i+12]
else:
timestamp_bytes += byte_array[i:i+8]
energy_bytes += byte_array[i+8:i+10]
extras_bytes += byte_array[i+10:i+12]
timestamp_array = np.ndarray((len(timestamp_bytes)//8,), '<Q',timestamp_bytes)
energy_array = np.ndarray((len(energy_bytes) // 2,), '<h', energy_bytes)
extras_array = np.ndarray((len(timestamp_bytes) // 8,), '<H', extras_bytes)
I assume there is a much faster way of doing this, maybe by avoiding to loop over the the whole thing. My files are up to 15GB in size so every bit of improvement would help a lot.

You should be able to just tell NumPy to interpret the data as a structured array and extract fields:
as_structured = numpy.ndarray(shape=(len(byte_array)//12,),
dtype='<Q, <h, <H',
buffer=byte_array)
timestamps = as_structured['f0']
energies = as_structured['f1']
extras = as_structured['f2']
This will produce three arrays backed by the input bytearray. Creating these arrays should be effectively instant, but I can't guarantee that working with them will be fast - I think NumPy may need to do some implicit copying to handle alignment issues with these arrays. It's possible (I don't know) that explicitly copying them yourself with .copy() first might speed things up.

You can use numpy.frombuffer with a custom datatype:
import struct
import random
import numpy as np
data = [
(random.randint(0, 255**8), random.randint(0, 255*255), random.randint(0, 255*255))
for _ in range(20)
]
Bytes = b''.join(struct.pack('<Q2H', *row) for row in data)
dtype = np.dtype([('time', np.uint64),
('energy', np.uint16), # you may need to change that to `np.int16`, if energy can be negative
('extras', np.uint16)])
original = np.array(data, dtype=np.uint64)
result = np.frombuffer(Bytes, dtype)
print((result['time'] == original[:, 0]).all())
print((result['energy'] == original[:, 1]).all())
print((result['extras'] == original[:, 2]).all())
print(result)
Example output:
True
True
True
[(6048800706604665320, 52635, 291) (8427097887613035313, 15520, 4976)
(3250665110135380002, 44078, 63748) (17867295175506485743, 53323, 293)
(7840430102298790024, 38161, 27601) (15927595121394361471, 47152, 40296)
(8882783920163363834, 3480, 46666) (15102082728995819558, 25348, 3492)
(14964201209703818097, 60557, 4445) (11285466269736808083, 64496, 52086)
(6776526382025956941, 63096, 57267) (5265981349217761773, 19503, 32500)
(16839331389597634577, 49067, 46000) (16893396755393998689, 31922, 14228)
(15428810261434211689, 32003, 61458) (5502680334984414629, 59013, 42330)
(6325789410021178213, 25515, 49850) (6328332306678721373, 59019, 64106)
(3222979511295721944, 26445, 37703) (4490370317582410310, 52413, 25364)]

I'm not an expert on numpy, but here's my 5 cents:
You have lots of data, and probably it's more than your RAM.
This points to the simplest solution - don't try to fit all data in your program.
When you read a file into a variable - then the X GB is being read into RAM. If it's more than available RAM, then swapping is done by your OS. Swapping slows you down, since not only do you have disk operations for reading from source file, now you also have writing to disk to dump RAM contents into swap file.
Instead of that write the script so that it uses parts of the input file as necessary (in your case you read the file along anyways and don't go back or jump far ahead).
Try opening the input file as memory mapped data structure (please note differences in usage between Unix and windows environments)
Then you can do simple read([n]) bytes at a time and append that to your arrays.
behind the scenes data is read into RAM page by page as needed and will not exceed the available memory, also, leaving more space for your arrays to grow.
Also consider the fact that your resultant arrays can also outgrow RAM, which will cause similar slowdown as reading of a big file.

Optimizing my large data code with little RAM

I have a 120 GB file saved (in binary via pickle) that contains about 50,000 (600x600) 2d numpy arrays. I need to stack all of these arrays using a median. The easiest way to do this would be to simply read in the whole file as a list of arrays and use np.median(arrays, axis=0). However, I don't have much RAM to work with, so this is not a good option.
So, I tried to stack them pixel-by-pixel, as in I focus on one pixel position (i, j) at a time, then read in each array one by one, appending the value at the given position to a list. Once all the values for a certain position across all arrays are saved, I use np.median and then just have to save that value in a list -- which in the end will have the medians of each pixel position. In the end I can just reshape this to 600x600, and I'll be done. The code for this is below.
import pickle
import time
import numpy as np
filename = 'images.dat' #contains my 50,000 2D numpy arrays
def stack_by_pixel(i, j):
pixels_at_position = []
with open(filename, 'rb') as f:
while True:
try:
# Gather pixels at a given position
array = pickle.load(f)
pixels_at_position.append(array[i][j])
except EOFError:
break
# Stacking at position (median)
stacked_at_position = np.median(np.array(pixels_at_position))
return stacked_at_position
# Form whole stacked image
stacked = []
for i in range(600):
for j in range(600):
t1 = time.time()
stacked.append(stack_by_pixel(i, j))
t2 = time.time()
print('Done with element %d, %d: %f seconds' % (i, j, (t2-t1)))
stacked_image = np.reshape(stacked, (600,600))
After seeing some of the time printouts, I realize that this is wildly inefficient. Each completion of a position (i, j) takes about 150 seconds or so, which is not surprising since it is reading about 50,000 arrays one by one. And given that there are 360,000 (i, j) positions in my large arrays, this is projected to take 22 months to finish! Obviously this isn't feasible. But I'm sort of at a loss, because there's not enough RAM available to read in the whole file. Or maybe I could save all the pixel positions at once (a separate list for each position) for the arrays as it opens them one by one, but wouldn't saving 360,000 lists (that are about 50,000 elements long) in Python use a lot of RAM as well?
Any suggestions are welcome for how I could make this run significantly faster without using a lot of RAM. Thanks!

This is a perfect use case for numpy's memory mapped arrays.
Memory mapped arrays allow you to treat a .npy file on disk as though it were loaded in memory as a numpy array, without actually loading it. It's as simple as
arr = np.load('filename', mmap_mode='r')
For the most part you can treat this as any other array. Array elements are only loaded into memory as required. Unfortunately some quick experimentation suggests that median doesn't handle memmory mapped arrays well*, it still seems to load a substantial portion of the data into memory at once. So median(arr, 0) may not work.
However, you can still loop over each index and calculate the median without running into memory issues.
[[np.median([arr[k][i][j] for k in range(50000)]) for i in range(600)] for j in range(600)]
where 50,000 reflects the total number of arrays.
Without the overhead of unpickling each file just to extract a single pixel the run time should be much quicker (by about 360000 times).
Of course, that leaves the problem of creating a .npy file containing all of the data. A file can be created as follows,
arr = np.lib.format.open_memmap(
'filename', # File to store in
mode='w+', # Specify to create the file and write to it
dtype=float32, # Change this to your data's type
shape=(50000, 600, 600) # Shape of resulting array
)
Then, load the data as before and store it into the array (which just writes it to disk behind the scenes).
idx = 0
with open(filename, 'rb') as f:
while True:
try:
arr[idx] = pickle.load(f)
idx += 1
except EOFError:
break
Give it a couple hours to run, then head back to the start of this answer to see how to load it and take the median. Can't be any simpler**.
*I just tested it on a 7GB file, taking the median of 1,500 samples of 5,000,000 elements and memory usage was around 7GB, suggesting the entire array may have been loaded into memory. It doesn't hurt to try this way first though. If anyone else has experience with median on memmapped arrays feel free to comment.
** If you believe strangers on the internet.

Note: I use Python 2.x, porting this to 3.x shouldn't be difficult.
My idea is simple - disk space is plentiful, so let's do some preprocessing and turn that big pickle file into something that is easier to process in small chunks.
Preparation
In order to test this, I wrote a small script the generates a pickle file that resembles yours. I assumed your input images are grayscale and have 8bit depth, and generated 10000 random images using numpy.random.randint.
This script will act as a benchmark that we can compare the preprocessing and processing stages against.
import numpy as np
import pickle
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
with open('data/raw_data.pickle', 'wb') as f:
for i in range(FILE_COUNT):
data = np.random.randint(256, size=IMAGE_WIDTH*IMAGE_HEIGHT, dtype=np.uint8)
data = data.reshape(IMAGE_HEIGHT, IMAGE_WIDTH)
pickle.dump(data, f)
print i,
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run this script completed in 372 seconds, generating ~ 10 GB file.
Preprocessing
Let's split the input images on a row-by-row basis -- we will have 600 files, where file N contains row N from each input image. We can store the row data in binary using numpy.ndarray.tofile (and later load those files using numpy.fromfile).
import numpy as np
import pickle
import time
# Increase open file limit
# See https://stackoverflow.com/questions/6774724/why-python-has-limit-for-count-of-file-handles
import win32file
win32file._setmaxstdio(1024)
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
FILE_COUNT = 10000
t1 = time.time()
outfiles = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
outfiles.append(open(outfilename, 'wb'))
with open('data/raw_data.pickle', 'rb') as f:
for i in range(FILE_COUNT):
data = pickle.load(f)
for j in range(IMAGE_HEIGHT):
data[j].tofile(outfiles[j])
print i,
for i in range(IMAGE_HEIGHT):
outfiles[i].close()
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 134 seconds, generating 600 files, 6 million bytes each. It used ~30MB or RAM.
Processing
Simple, just load each array using numpy.fromfile, then use numpy.median to get per-column medians, reducing it back to a single row, and accumulate such rows in a list.
Finally, use numpy.vstack to reassemble a median image.
import numpy as np
import time
IMAGE_WIDTH = 600
IMAGE_HEIGHT = 600
t1 = time.time()
result_rows = []
for i in range(IMAGE_HEIGHT):
outfilename = 'data/row_%03d.dat' % i
data = np.fromfile(outfilename, dtype=np.uint8).reshape(-1, IMAGE_WIDTH)
median_row = np.median(data, axis=0)
result_rows.append(median_row)
print i,
result = np.vstack(result_rows)
print result
t2 = time.time()
print '\nDone in %0.3f seconds' % (t2 - t1)
In a test run, this script completed in 74 seconds. You could even parallelize it quite easily, but it doesn't seem to be worth it. The script used ~40MB of RAM.
Given how both of those scripts are linear, the time used should scale linearly as well. For 50000 images, this is about 11 minutes for preprocessing and 6 minutes for the final processing. This is on i7-4930K # 3.4GHz, using 32bit Python on purpose.

Saving data incrementaly with python

I am working on a project in which a lot of data is being generated. I want a way to save my data as I go so I don't have to keep it all in RAM. I am currently using numpy to save everything in a npz file when the program finishes. The things that need to be saved are scalars, list, and list of lists. The lists have values added on to them incrementally and so I need a way to append to each list without having to load everything into memory.
I am still a bit new to python so if there is a standard way of doing this please point me in that direction.
Thanks

PyTables is a numpy friendly package that is designed to page data to disk to operate on data-sets that don't fit in memory.
See: https://www.pytables.org/usersguide/tutorials.html
https://kastnerkyle.github.io/posts/using-pytables-for-larger-than-ram-data-processing/
Usage
# Create a data-frame description (called a table)
# each attribute of Particle below is a column.
from tables import *
class Particle(IsDescription):
name = StringCol(16) # 16-character String
idnumber = Int64Col() # Signed 64-bit integer
ADCcount = UInt16Col() # Unsigned short integer
TDCcount = UInt8Col() # unsigned byte
grid_i = Int32Col() # 32-bit integer
grid_j = Int32Col() # 32-bit integer
pressure = Float32Col() # float (single-precision)
energy = Float64Col() # double (double-precision)
# create a hdf5 file on disk to store data in
h5file = open_file("tutorial1.h5", mode="w", title="Test file")
# create a table within the file, using the Particle description class
table = h5file.create_table(group, 'readout', Particle, "Readout example")
Performance
It is especially useful for computations across many data rows.
PyTables supports Blosc (which is a neat trick)
You can perform "in kernal" queries using blosc with the where method.
result = [row['col2'] for row in table.where(
'''(((col4 >= lim1) & (col4 < lim2)) |
((col2 > lim3) & (col2 < lim4)) &
((col1+3.1*col2+col3*col4) > lim5))''')]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.