Optimal HDF5 dataset chunk shape for reading rows

Optimal HDF5 dataset chunk shape for reading rows - python

I have a reasonable size (18GB compressed) HDF5 dataset and am looking to optimize reading rows for speed. Shape is (639038, 10000). I will be reading a selection of rows (say ~1000 rows) many times, located across the dataset. So I can't use x:(x+1000) to slice rows.
Reading rows from out-of-memory HDF5 is already slow using h5py since I have to pass a sorted list and resort to fancy indexing. Is there a way to avoid fancy indexing, or is there a better chunk shape/size I can use?
I have read rules of thumb such as 1MB-10MB chunk sizes and choosing shape consistent what I'm reading. However, building a large number of HDF5 files with different chunk shapes for testing is computationally expensive and very slow.
For each selection of ~1,000 rows, I immediately sum them to get an array of length 10,000. My current dataset looks like this:
'10000': {'chunks': (64, 1000),
'compression': 'lzf',
'compression_opts': None,
'dtype': dtype('float32'),
'fillvalue': 0.0,
'maxshape': (None, 10000),
'shape': (639038, 10000),
'shuffle': False,
'size': 2095412704}
What I have tried already:
Rewriting the dataset with chunk shape (128, 10000), which I calculate to be ~5MB, is prohibitively slow.
I looked at dask.array to optimise, but since ~1,000 rows fit easily in memory I saw no benefit.

Finding the right chunk cache size
At first I want to discuss some general things.
It is very important to know that each individual chunk could only be read or written as a whole. The standard chunk-cache size of h5py which can avoid excessive disk I/Os is only one MB per default and should in many cases be increased, which will be discussed later on.
As an example:
We have a dset with shape (639038, 10000), float32 (25,5 GB uncompressed)
we want to write our data column wise dset[:,i]=arr and read it row wise arr=dset[i,:]
we choose a completely wrong chunk-shape for this type of work ie (1,10000)
In this case reading speed won't be to bad (although the chunk size is a little small) because we read only the data we are using. But what happens when we write on that dataset? If we access a column one floating point number of each chunk is written. This means we are actually writing the whole dataset (25,5 GB) with every iteration and read the whole dataset every other time. This is because if you modify a chunk, you have to read it first if it is not cached (I assume a chunk-cache-size below 25,5 GB here).
So what can we improve here?
In such a case we have to make a compromise between write/read speed and the memory which is used by the chunk-cache.
An assumption which will give both decent/read and write speed:
We choose a chunk-size of (100, 1000)
If we want to iterate over the first Dimension we need at least (1000*639038*4 ->2,55 GB) cache to avoid additional IO-overhead as described above and (100*10000*4 -> 0,4 MB).
So we should provide at least 2,6 GB chunk-data-cache in this example.
Conclusion
There is no generally right chunk size or shape, it depends heavily on the task which one to use. Never choose your chunk size or shape without making some minds about the chunk-cache. RAM is orders of magnite faster than the fastest SSD in regards of random read/write.
Regarding your problem
I would simply read the random rows, the improper chunk-cache-size is your real problem.
Compare the performance of the following code with your version:
import h5py as h5
import time
import numpy as np
def ReadingAndWriting():
File_Name_HDF5='Test.h5'
#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)
#We are using 4GB of chunk_cache_mem here ("rdcc_nbytes")
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
#Writing columns
t1=time.time()
for i in range(0,shape[1]):
d[:,i:i+1]=np.expand_dims(Array, 1)
f.close()
print(time.time()-t1)
# Reading random rows
# If we read one row there are actually 100 read, but if we access a row
# which is already in cache we would see a huge speed up.
f = h5.File(File_Name_HDF5,'r',rdcc_nbytes=1024**2*4000,rdcc_nslots=1e7)
d = f["Test"]
for j in range(0,639):
t1=time.time()
# With more iterations it will be more likely that we hit a already cached row
inds=np.random.randint(0, high=shape[0]-1, size=1000)
for i in range(0,inds.shape[0]):
Array=np.copy(d[inds[i],:])
print(time.time()-t1)
f.close()
The simplest form of fancy slicing
I wrote in the comments, that I couldn't see this behavior in recent versions. I was wrong. Compare the following:
def Writing():
File_Name_HDF5='Test.h5'
#shape = (639038, 10000)
shape = (639038, 1000)
chunk_shape=(100, 1000)
Array=np.array(np.random.rand(shape[0]),np.float32)
# Writing_1 normal indexing
###########################################
f = h5c.File(File_Name_HDF5, 'w',chunk_cache_mem_size=1024**2*4000)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
t1=time.time()
for i in range(shape[1]):
d[:,i:i+1]=np.expand_dims(Array, 1)
f.close()
print(time.time()-t1)
# Writing_2 simplest form of fancy indexing
###########################################
f = h5.File(File_Name_HDF5, 'w',rdcc_nbytes =1024**2*4000,rdcc_nslots=1e7)
d = f.create_dataset('Test', shape ,dtype=np.float32,chunks=chunk_shape,compression="lzf")
#Writing columns
t1=time.time()
for i in range(shape[1]):
d[:,i]=Array
f.close()
print(time.time()-t1)
This gives on my HDD 34 seconds for the first version and 78 seconds for the second version.

Related

(Python/h5py) Memory-efficient processing of HDF5 dataset slices

I'm working in python with a large dataset of images, each 3600x1800. There are ~360000 images in total. Each image is added to the stack one by one since an initial processing step is run on each image. H5py has proven effective at building the stack, image by image, without it filling the entire memory.
The analysis I am running is calculated on the grid cells - so on 1x1x360000 slices of the stack. Since the analysis of each slice depends on the max and min values of within that slice, I think it is necessary to hold the 360000-long array in memory. I have a fair bit of RAM to work with (~100GB) but not enough to hold the entire stack of 3600x1800x360000 in memory at once.
This means I need (or I think I need) a time-efficient way of accessing the 360000-long arrays. While h5py is efficient at adding each image to the stack, it seems that slicing perpendicular to the images is much, much slower (hours or more).
Am I missing an obvious method to slice the data perpendicular to the images?
Code below is a timing benchmark for 2 different slice directions:
file = "file/path/to/large/stack.h5"
t0 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][:,:,1]
print('Time = ' + str(time.time()- t0))
t1 = time.time()
with h5py.File(file, 'r') as f:
dat = f['Merged_liqprec'][500,500,:]
print('Time = ' + str(time.time()- t1))
Output:
## time to read a image slice, e.g. [:,:,1]:
Time = 0.0701
## time to read a slice thru the image stack, e.g. [500,500,:]:
Time = multiple hours, server went offline for maintenance while running

You have the right approach. Using numpy slicing notation to read the stack of interest reduces the memory footprint.
However, with this large dataset, I suspect I/O performance is going to depend on chunking: 1) Did you define chunks= when you created the dataset, and 2) if so, what is the chunked size? The chunk is the shape of the data block used when reading or writing. When an element in a chunk is accessed, the entire chunk is read from disk. As a result, you will not be able to optimize the shape for both writing images (3600x1800x1) and reading stacked slices (1x1x360000). The optimal shape for writing an image is (shape[0], shape[1], 1) and for reading a stacked slice is (1, 1, shape[2])
Tuning the chunk shape is not trivial. h5py docs recommend the chunk size should be between 10 KiB and 1 MiB, larger for larger datasets. You could start with chunks=True (h5py determines the best shape) and see if that helps.
Assuming you will create this file once, and read many times, I suggest optimizing for reading. However, creating the file may take a long time. I wrote a simple example that you can use to "tinker" with the chunk shape on a small file to observe the behavior before you work on the large file. The table below shows the effect of different chunk shapes on 2 different file sizes. The code follows the table.
Dataset shape=(36,18,360)
chunks
Writing
Reading
None
0.349
0.102
(36,18,1)
0.187
0.436
(1,1,360)
2.347
0.095
Dataset shape=(144,72,1440) (4x4x4 larger)
chunks
Writing
Reading
None
59.963
1.248
(9, 9, 180) / True
11.334
1.588
(144, 72, 100)
79.844
2.637
(10, 10, 1440)
56.945
1.464
Code below:
f = 4
a0, a1, n = f*36, f*18, f*360
c0, c1, cn = a0, a1, 100
#c0, c1, cn = 10, 10, n
arr = np.random.randint(256,size=(a0,a1))
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
# Use chunks=True for default size or use chunks=(c0,c1,cn) for user defined size
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
print(f'chunks={ds.chunks}')
for i in range(n):
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
for i in range(ds.shape[0]):
for j in range(ds.shape[1]):
dat = ds[i,j,:]
print(f"Time to read file:{(time.time()-start): .3f}")

HDF5 chunked storage improves I/O time, but can still be very slow when reading small slices (e.g., [1,1,360000]). To further improve performance, you need to read larger slices into an array as described by #Jérôme Richard in the comments below your question. You can then quickly access a single slice from the array (because it is in memory).
This answer combines the 2 techniques: 1) HDF5 chunked storage (from my first answer), and 2) reading large slices into an array and then reading a single [i,j] slice from that array. The code to create the file is very similar to the first answer. It is setup to create a dataset of shape [3600, 1800, 100] with default chunk size ([113, 57, 7] on my system). You can increase n to test with larger datasets.
When reading the file, the large slice [0] and [1] dimensions are set equal to the associated chunk shape (so I only access each chunk once). As a result, the process to read the file is "slightly more complicated" (but worth it). There are 2 loops: the 1st loop reads a large slice into an array 'dat_chunk', and the 2nd loop reads a [1,1,:] slice from 'dat_chunk' into a second array 'dat'.
Differences in timing data for 100 images is dramatic. It takes 8 seconds to read all of the data using the method below. It required 74 min with the first answer (reading each [i,j] pair directly). Clearly that is too slow. Just for fun, I increased the dataset size to 1000 images (shape=[3600, 1800, 1000]) in my test file and reran. It takes 4:33 (m:ss) to read all slices with this method. I didn't even try with the previous method (for obvious reasons). Note: my computer is pretty old & slow with a HDD, so your timing data should be faster.
Code to create the file:
a0, a1, n = 3600, 1800, 100
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True)
print(f'chunks={ds.chunks}')
for i in range(n):
arr = np.random.randint(256,size=(a0,a1))
ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")
Code to read the file using large array slices:
start = time.time()
with h5py.File('SO_73791464.h5','r') as h5f:
ds = h5f['images']
print(f'shape={ds.shape}')
print(f'chunks={ds.chunks}')
ds_i_max, ds_j_max, ds_k_max = ds.shape
ch_i_max, ch_j_max, ch_k_max = ds.chunks
i = 0
while i < ds_i_max:
i_stop = min(i+ch_i_max,ds_i_max)
print(f'i_range: {i}:{i_stop}')
j = 0
while j < ds_j_max:
j_stop = min(j+ch_j_max,ds_j_max)
print(f' j_range: {j}:{j_stop}')
dat_chunk = ds[i:i_stop,j:j_stop,:]
# print(dat_chunk.shape)
for ic in range(dat_chunk.shape[0]):
for jc in range(dat_chunk.shape[1]):
dat = dat_chunk[ic,jc,:]
j = j_stop
i = i_stop
print(f"Time to read file:{(time.time()-start): .3f}")

How to create a large array efficiently in Python?

Currently I have some tracks that simulate people walking around in an area 1280x720pixels that span over 12 hours. The recordings stored are x, y coordinate and the timeframe (in seconds) of the specific recordings.
I want to create a movie sequence that shows how the people are walking over the 12 hours. To do this I split the data into 43200 different frames. This correspond to every frame is one second. My end goal is to use these data in a machine learning algorithm.
The idea is then simple. Initilize the frames, loop through all the x,y coordinate and add them to the array frames with their respective timeframe:
>>> frames = np.zeros((43200, 1280, 720,1))
>>> for track in tracks:
>>> for x,y,time in track:
>>> frames[int(time), y,x] = 255 # to visualize the walking
This will in theory create a frame of 43200 that can be saved as a mp4, gif or some other format and be played. However, the problem occurs when I try to initialize the numpy array:
>>> np.zeros((43200,720,1280,1))
MemoryError: Unable to allocate 297. GiB for an array with shape (43200, 1280, 720, 1) and data type float64
This makes sense because im trying to allocate:
>>> (43200 * 1280 * 720 * 8) * 1024**3
296.630859375
I then thought about saving each frame to a npy file but each file will be 7.4MB which sums up to 320GB.
I also thought about splitting the frames up into five different arrays:
>>> a = np.zeros((8640, 720, 1280, 1))
>>> b = np.zeros((8640, 720, 1280, 1))
>>> c = np.zeros((8640, 720, 1280, 1))
>>> d = np.zeros((8640, 720, 1280, 1))
>>> e = np.zeros((8640, 720, 1280, 1))
But I think that seems cumbersome and it does not feel like the best solution. It will most likely slow the training of my machine learning algorithm. Is there a smarter way to do this?

I would just build the video a few frames at a time, then join the frames together using ffmpeg. There should be no need to store the whole video in memory at once based on the description of the use case.

I think you will have to split your data in different, small arrays, and that probably won't be an issue for machine learning purposes.
However, I don't know if you will be able to create these five numpy arrays as they will also take a total of 297Gb of RAM.
I would probably :
save the numpy arrays as PNGs using for instance matplotlib.pyplot.imsave, or
store them as short videos, as a person won't be seen more than that on your video anyway, or
reduce the fps or the resolution if you really want the whole video in one variable
Let me also add that :
The snippet of code you gave can be executed in a much faster time with frames = np.ones((43200, 1280, 720,1))*255, as intricated for loops are very expensive
If you were to create an array by setting all of its coefficients one by one, it would be more effective to initialize it with np.empty(shape), as it would spare you the time needed to put all the coefficients to zero only to overwrite them in your for loop

Running out of memory when building features (converting images into derived features [numpy arrays])?

I copied some image data to an instance on Google Cloud (8 vCPU's, 64GB memory, Tesla K80 GPU) and am running into memory problems when converting the raw data into features, and changing the data structure of the output. Eventually I'd like to use the derived features in Keras/Tensorflow neural net.
Process
After copying the data to a storage bucket, I run a build_features.py function to convert the raw data into processed data for the neural network. In this pipeline, I first take each raw image and put it into a list x (which stores the derived features).
Since I'm working with a large number of images (tens of thousands of images that are type float32 and have dimensions 250x500x3) the list x becomes quite large. Each element of x is numpy array that stores the image in shape 250x500x3.
Problem 1 - reduced memory as list x grows
I took 2 screenshots that show available memory decreasing as x grows (below). I'm eventually able to complete this step but I'm only left with a few GB of memory so I definitely want to fix this (in the future I want to work with larger data sets). How can I build features in a way where I'm not limited by the size of x?
Problem 2 - Memory error when converting x into numpy array
The step where the instance actually fails is the following:
x = np.array(x)
The failure message is:
Traceback (most recent call last):
File "build_features.py", line 149, in <module>
build_features(pipeline='9_11_2017_fan_3_lights')
File "build_features.py", line 122, in build_features
x = np.array(x)
MemoryError
How can I adjust this step so that I don't run out of memory?

Your code has two copies of every image - one in the list, and one in the array:
images = []
for i in range(many):
images[i] = load_img(i) # here's the first image
x = np.array(images) # joint them all together into a second copy
Just load the images straight into the array
x = np.zeros((many, 250, 500, 3)
for i in range(many):
x[i] = load_img(i)
Which means that you only hold a copy of one image at a time.
If you don't know the size or dtype of the image ahead of time, or don't want to hard code it, you can use:
x0 = load_img(0)
x = np.zeros((many,) + x0.shape, x0.dtype)
for i in range(1, many):
x[i] = load_img(i)
Having said that, you're on a tricky path here. If you don't have enough room to store your dataset twice in memory, you also don't have room to compute y = x + 1.
You might want to consider using np.float16 to buy more storage, at the cost of precision

Python (numpy) crashes system with large number of array elements

I'm trying to build a basic character recognition model using the many classifiers that scikit provides. The dataset being used is a standard handwritten set of alphanumeric samples (Chars74K image dataset taken from this source: EnglishHnd.tgz).
There are 55 samples of each character (62 alphanumeric characters in all), each being 900x1200 pixels. I'm flattening the matrix (first converting to grayscale) into a 1x1080000 array (each representing a feature).
for sample in sample_images: # sample images is the list of the .png files
img = imread(sample);
img_gray = rgb2gray(img);
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m);
img_gray = np.append(img_gray, sample_id); # sample id stores the label of the training sample
if len(samples) == 0: # samples is the final numpy ndarray
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1]);
else:
samples = np.append(samples, [img_gray], axis=0);
So the final data structure should have 55x62 arrays, where each array is 1080000 elements in capacity. Only the final structure is being stored (the scope of the intermediate matrices is local).
The amount of data being stored to learn the model is pretty large (I guess), because the program isn't really progressing beyond a point, and crashed my system to the extent that the BIOS had to be repaired!
Upto this point, the program is only gathering the data to send to the classifier ... the classification hasn't even been introduced into the code yet.
Any suggestions as to what can be done to handle the data more efficiently?
Note: I'm using numpy to store the final structure of flattened matrices.
Also, the system has an 8Gb RAM.

This seems like a case of stack overflow. You have 3,682,800,000 array elements, if I understand your question. What is the element type? If it is one byte, that is about 3 gigabytes of data, easily enough to fill up your stack size (usually about 1 megabyte). Even with one bit an element, you are still at 500 mb. Try using heap memory (up to 8 gigs on your machine)

I was encouraged to post this as a solution, although the comments above are probably more enlightening.
The issue with the users program is two fold. Really it's just overwhelming the stack.
Much more common, especially with image processing in things like computer graphics or computer vision, is to process the images one at a time. This could work well with sklearn where you could just be updating your models as you read in the image.
You could use this bit of code found from this stack article:
import os
rootdir = '/path/to/my/pictures'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file[-3:] == 'png': # or whatever your file type is / some check
# do your training here
img = imread(file)
img_gray = rgb2gray(img)
if n == 0 and m == 0: # n and m are global variables
n, m = np.shape(img_gray);
img_gray = np.reshape(img_gray, n*m)
# sample id stores the label of the training sample
img_gray = np.append(img_gray, sample_id)
# samples is the final numpy ndarray
if len(samples) == 0:
samples = np.append(samples, img_gray);
samples = np.reshape(samples, [1, n*m + 1])
else:
samples = np.append(samples, [img_gray], axis=0)
This is more of pseudocode, but the general flow should have the right idea. Let me know if there's anything else I can do! Also check out OpenCV if you're interested on some cool deep learning algorithms. They're a bunch of cool stuff there and images make for great sample data.
Hope this helps.

Improve speed of I/O operations ? For HDF5 data creation for caffe

The basic objective of my program is to read images and make a hd5 format file.
I'm splitting the hd5 data files into parts of 1000 for manageability.
The program reads and resizes the images and then writes to file.
I dont think that using a multi-threading would be improving the speed of this, but I might be wrong.
My dataset is around 15 million images.
I use a powerful pc with a 4GB gpu and 32 GB ram and a Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
P.S I could try using some other image transformation package like opencv, But has no basis for comparison.
As of now the program has been running for 3 days non stop and almost 80% done.
I would like to avoid this problem in the future when I do something similar.
ipfldr= "/path/to/img/fldr"
os.chdir(ipfldr)
SIZE = 58 # fixed size to all images
nof = 16
with open( '/path/to/txtfile', 'r' ) as T :
lines = T.readlines()
# If you do not have enough memory split data into
# multiple batches and generate multiple separate h5 files
print len(lines)
X = np.zeros( (1000,nof*3, SIZE, SIZE), dtype=np.int )
y = np.zeros( (1000,1), dtype=np.int )
for i,l in enumerate(lines):
sp = l.split(' ')#split files into 17 cats
cla= int(sp[0].split("/")[0])
for fr in range(0,nof,1):
img = caffe.io.load_image( sp[fr] )
img = caffe.io.resize( img, (3,SIZE, SIZE) ) # resize to fixed size
# you may apply other input transformations here...
X[i%1000,fr:fr+3] = img
y[i%1000] = cla
if i%1000==0
with h5py.File('val/'+'val'+str(int(i/1000))+'.h5','w') as H:
H.create_dataset( 'data', data=X ) # note the name X given to the dataset!
H.create_dataset( 'label', data=y ) # note the name y given to the dataset!
with open('val_h5_list.txt','w') as L:
L.write( 'val'+str(int(i/1000))+'.h5' ) # list all h5 files you are going to use
if (len(lines)-i >= 1000):
X = np.zeros( (1000,nof*3, SIZE, SIZE), dtype=np.int )
y = np.zeros( (1000,1), dtype=np.int )
else:
break

I am quite sure you can improve your performance with a Multi-threading approach, you haven't been spending 3 days on loading data from disk (you would need an unrealistic amount of disk space to read from), so you seem to be waiting for the resize process on CPU.
You could fx do: 1 Reader that read data in large chucks and puts single images in a queue. Some Workers that take image from queue, resize it and put it another queue. 1 Writer that takes the resized images out of second queues, and write them to disk when it gathered many (Reader and Writer can probably be the same process without efficiency loss, assuming you read / write to the same disk anyway).
My guess is 1 worker per HW thread (16 in your case), minus 2 for the core you put the reader and writer on (so 14), should be a good starting point.
This way you will isolate waits for IO access from the CPU work, and minimize the IO access overhead by doing large chunks of work per time you initialize a read / write.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimal HDF5 dataset chunk shape for reading rows - python

Related

(Python/h5py) Memory-efficient processing of HDF5 dataset slices

How to create a large array efficiently in Python?

Running out of memory when building features (converting images into derived features [numpy arrays])?

Python (numpy) crashes system with large number of array elements

Improve speed of I/O operations ? For HDF5 data creation for caffe

Categories

Resources