Efficient way of inputting large raster data into PyTables - python

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.
Currently I am reading it as numpy array using Gdal, and writing the numpy array into
pytables using the code below:
import gdal, numpy as np, tables as tb
inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None
Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?

I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.
ds=gdal.Open('infile.tif')
x_total,y_total=ds.RasterXSize,ds.RasterYSize
n_chunks=100
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)
#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])
for x1,x2 in x_offsets:
for y1,y2 in y_offsets:
dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

Related

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

HDF5 core driver (H5FD_CORE): loading selected dataset(s)

Currently, I load HDF5 data in python via h5py and read a dataset into memory.
f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]
This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:
f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]
I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.
My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.
I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.
Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?
I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.
If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.
I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.
import h5py as h5
import time
import numpy as np
import h5py_cache as h5c
def Reading():
File_Name_HDF5='Test.h5'
t1=time.time()
f = h5.File(File_Name_HDF5, 'r',driver='core')
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
dset = f['Test'][:]
f.close()
print(time.time()-t1)
t1=time.time()
f = h5.File(File_Name_HDF5, 'r')
dset = f['Test'][:]
print(time.time()-t1)
f.close()
if __name__ == "__main__":
Reading()
This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)

Why 6GB csv file is not possible to read whole to memory (64GB) in numpy

I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM
CSV file has 10 milion of lines, each line has 131 records (mix of int and float)
I tried to read it to float numpy array
import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')
it failed due to memory.
when I read just one line and get size
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes
I get 1048 bytes
So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. Why it doesn't work?
Finaly I tried to optimize array in memory by defining types
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes
so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory.
Do you have any idea?
I need to use this array in Keras.
Thank you
genfromtext is regular python code, that converts the data to a numpy array only as a final step. During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. Maybe you could try numpy.fromfile, or the Pandas csv reader. Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop.

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

NetCDF Big data

I need to read in large (+15GB) NetCDF files into a program, which holds a 3D variable (etc Time as the record dimension, and the data is latitudes by longitudes).
I'm processing the data in a 3 level nested loop (checking each block of the NetCDF if it passes a certain criteria. For example;
from netCDF4 import Dataset
import numpy as np
File = Dataset('Somebigfile.nc', 'r')
Data = File.variables['Wind'][:]
Getdimensions = np.shape(Data)
Time = Getdimensions[0]
Latdim = Getdimensions[1]
Longdim = Getdimensions[2]
for t in range(0,Time):
for i in range(0,Latdim):
for j in range(0,Longdim):
if Data[t,i,j] > Somethreshold:
#Do something
Is there anyway I can read in the NetCDF file one time record at a time? Reducing the memory usage hugely. Any help hugely appreciated.
I know of NCO operators but would prefer not to use these methods to break up files before using the script.
It sounds like you've already settled on a solution but I'll throw out a much more elegant and vectorized (likely faster) solution that uses xarray and dask. Your nested for loop is going to be very inefficient. Combining xarray and dask, you can work on the data in your file incrementally in a semi-vectorized manor.
Since your Do something step isn't all that specific, you'll have to extrapolate from my example.
import xarray as xr
# xarray will open your file but doesn't load in any data until you ask for it
# dask handles the chunking and memory management for you
# chunk size can be optimized for your specific dataset.
ds = xr.open_dataset('Somebigfile.nc', chunks={'time': 100})
# mask out values below the threshold
da_thresh = ds['Wind'].where(ds['Wind'] > Somethreshold)
# Now just operate on the values greater than your threshold
do_something(da_thresh)
Xarray/Dask docs: http://xarray.pydata.org/en/stable/dask.html

Categories