Specify read buffer size in NetCDF (Python or C) - python

I would like to benchmark NetCDF I/O performance compared to reading a plain binary file in Python and C.
For reading the binary file, I have defined different buffer sizes to read. Here is the code of the read function in Python:
def parse_os_read(filename, buffersize):
f = open(filename, 'rb')
for loops in itertools.count():
record = f.read(buffersize)
if not record:
break
return loops
I can then call this function with different buffer sizes ranging from 64 KB to several 100 MB.
I would like to implement now a comparable sequential reading benchmark for NetCDF. However, I could not identify a read function, where I can set a pre-defined buffer size. I assume the NetCDF file should have a 1D array with random floats as a data structure.
Any thoughts?

Related

Faster alternative than pandas.read_csv for reding txt files with data

I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Fast way to reverse float32 endianness in binary file

I have a binary file, several hundred MBs in size. It contains samples in float32 big-endian format (4 bytes per sample). I want to convert them to little-endian format. Some background: I want to write them to a .wav file later on and that needs data in the little-endian format afaik.
The below code is what I currently use. It seems to work fine, but is quite slow (I assume because I am writing 4 bytes at a time only):
import struct
infile = "infile_big_endian.raw"
outfile = "outfile_little_endian.raw"
with open(infile, "rb") as old, open(outfile , "wb") as new:
for chunk in iter(lambda: old.read(4), b""):
chunk = struct.pack("<f", struct.unpack(">f", chunk)[0])
new.write(chunk)
Is there a quicker way to do this in python?
NumPy might be faster:
numpy.memmap(infile, dtype=numpy.int32).byteswap().tofile(outfile)
Or overwriting the input file:
numpy.memmap(infile, dtype=numpy.int32).byteswap(inplace=True).flush()
We memory-map the array and use byteswap to reverse the endianness at C speed. I've used int32 instead of float32 just in case NaNs might be a problem with float32.

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

Efficient way of inputting large raster data into PyTables

I am looking for the efficient way to feed up the raster data file (GeoTiff) with 20GB size into PyTables for further out of core computation.
Currently I am reading it as numpy array using Gdal, and writing the numpy array into
pytables using the code below:
import gdal, numpy as np, tables as tb
inraster = gdal.Open('infile.tif').ReadAsArray().astype(np.float32)
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=np.shape(inraster)
dataset[:] = inraster
dataset.flush()
dataset.close()
f.close()
inraster = None
Unfortunately, since my input file is extremely large, while reading it as numpy error, my PC shows memory error. Is there any alternative way to feed up the data into PyTables or any suggestions to improve my code?
I do not have a geotiff file, so I fiddled around with a normal tif file. You may have to omit the 3 in the shape and the slice in the writing if the data to the pytables file. Essentially, I loop over the array without reading everything into memory in one go. You have to adjust n_chunks so the chunksize that gets read in one go does not exceed your system memory.
ds=gdal.Open('infile.tif')
x_total,y_total=ds.RasterXSize,ds.RasterYSize
n_chunks=100
f = tb.openFile('myhdf.h5','w')
dataset = f.createCArray(f.root, 'mydata', atom=tb.Float32Atom(),shape=(3,y_total,x_total)
#prepare the chunk indices
x_offsets=linspace(0,x_total,n_chunks).astype(int)
x_offsets=zip(x_offsets[:-1],x_offsets[1:])
y_offsets=linspace(0,y_total,n_chunks).astype(int)
y_offsets=zip(y_offsets[:-1],y_offsets[1:])
for x1,x2 in x_offsets:
for y1,y2 in y_offsets:
dataset[:,y1:y2,x1:x2]=ds.ReadAsArray(xoff=x1,yoff=y1,xsize=x2-x1, ysize=y2-y1)

Categories