HDF5 core driver (H5FD_CORE): loading selected dataset(s) - python

Currently, I load HDF5 data in python via h5py and read a dataset into memory.
f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]
This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:
f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]
I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.
My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.
I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.
Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?

I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.
If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.
I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.
import h5py as h5
import time
import numpy as np
import h5py_cache as h5c
def Reading():
File_Name_HDF5='Test.h5'
t1=time.time()
f = h5.File(File_Name_HDF5, 'r',driver='core')
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5.File(File_Name_HDF5, 'r')
dset = f['Test'][:]
print(time.time()-t1)
f.close()
if __name__ == "__main__":
Reading()
This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)

Related

Saving deque to file minding performance and portability

I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Loading a .csv file in python consuming too much memory

I'm loading a 4gb .csv file in python. Since it's 4gb I expected it would be ok to load it at once, but after a while my 32gb of ram gets completely filled.
Am I doing something wrong? Why does 4gb is becoming so much larger in ram aspects?
Is there a faster way of loading this data?
fname = "E:\Data\Data.csv"
a = []
with open(fname) as csvfile:
reader = csv.reader(csvfile,delimiter=',')
cont = 0;
for row in reader:
cont = cont+1
print(cont)
a.append(row)
b = np.asarray(a)
You copied the entire content of the csv at least twice.
Once in a and again in b.
Any additional work over that data consumes additional memory to hold values and such
You could del a once you have b, but note that pandas library provides you a read_csv function and a way to count the rows of the generated Dataframe
You should be able to do an
a = list(reader)
which might be a little better.
Because it's Python :-D One of easiest approaches: create your own class for rows, which would store data at least in slots, which can save couple hundreds bytes per row (reader put them into dict, which is kinda huge even when empty)..... if go further, than you may try to store binary representation of data.
But maybe you can process data without saving entire data-array? It's will consume significantly less memory.

NetCDF Big data

I need to read in large (+15GB) NetCDF files into a program, which holds a 3D variable (etc Time as the record dimension, and the data is latitudes by longitudes).
I'm processing the data in a 3 level nested loop (checking each block of the NetCDF if it passes a certain criteria. For example;
from netCDF4 import Dataset
import numpy as np
File = Dataset('Somebigfile.nc', 'r')
Data = File.variables['Wind'][:]
Getdimensions = np.shape(Data)
Time = Getdimensions[0]
Latdim = Getdimensions[1]
Longdim = Getdimensions[2]
for t in range(0,Time):
for i in range(0,Latdim):
for j in range(0,Longdim):
if Data[t,i,j] > Somethreshold:
#Do something
Is there anyway I can read in the NetCDF file one time record at a time? Reducing the memory usage hugely. Any help hugely appreciated.
I know of NCO operators but would prefer not to use these methods to break up files before using the script.
It sounds like you've already settled on a solution but I'll throw out a much more elegant and vectorized (likely faster) solution that uses xarray and dask. Your nested for loop is going to be very inefficient. Combining xarray and dask, you can work on the data in your file incrementally in a semi-vectorized manor.
Since your Do something step isn't all that specific, you'll have to extrapolate from my example.
import xarray as xr
# xarray will open your file but doesn't load in any data until you ask for it
# dask handles the chunking and memory management for you
# chunk size can be optimized for your specific dataset.
ds = xr.open_dataset('Somebigfile.nc', chunks={'time': 100})
# mask out values below the threshold
da_thresh = ds['Wind'].where(ds['Wind'] > Somethreshold)
# Now just operate on the values greater than your threshold
do_something(da_thresh)
Xarray/Dask docs: http://xarray.pydata.org/en/stable/dask.html

Efficient way of inputting large raster data into PyTables

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.
Currently I am reading it as numpy array using Gdal, and writing the numpy array into
pytables using the code below:
import gdal, numpy as np, tables as tb
inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None
Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?
I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.
ds=gdal.Open('infile.tif')
x_total,y_total=ds.RasterXSize,ds.RasterYSize
n_chunks=100
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)
#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])
for x1,x2 in x_offsets:
for y1,y2 in y_offsets:
dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

Categories