Fast way to reverse float32 endianness in binary file - python

I have a binary file, several hundred MBs in size. It contains samples in float32 big-endian format (4 bytes per sample). I want to convert them to little-endian format. Some background: I want to write them to a .wav file later on and that needs data in the little-endian format afaik.
The below code is what I currently use. It seems to work fine, but is quite slow (I assume because I am writing 4 bytes at a time only):
import struct
infile = "infile_big_endian.raw"
outfile = "outfile_little_endian.raw"
with open(infile, "rb") as old, open(outfile , "wb") as new:
for chunk in iter(lambda: old.read(4), b""):
chunk = struct.pack("<f", struct.unpack(">f", chunk)[0])
new.write(chunk)
Is there a quicker way to do this in python?

NumPy might be faster:
numpy.memmap(infile, dtype=numpy.int32).byteswap().tofile(outfile)
Or overwriting the input file:
numpy.memmap(infile, dtype=numpy.int32).byteswap(inplace=True).flush()
We memory-map the array and use byteswap to reverse the endianness at C speed. I've used int32 instead of float32 just in case NaNs might be a problem with float32.

Related

Compressing numpy float arrays

I have a large numpy float array (~4k x 16k float 64) that I want to store on disk. I am trying to understand the differences in the following compression approaches :
1) Use np.save - Save in .npy format and zip this using GZIP (like in one of the answers to Compress numpy arrays efficiently)
f = gzip.GzipFile("my_file.npy.gz", "w")
numpy.save(f, my_array)
f.close()
I get equivalent file sizes if I do the following as well
numpy.save('my_file',my_array)
check_call(['gzip', os.getcwd()+'/my_file.npy'])
2) Write the array into a binary file using tofile(). Close the file and zip this generated binary file using GZIP.
f = open("my_file","wb")
my_array.tofile(f)
f.close()
with open('my_file', 'rb') as f_in:
with gzip.open('my_file.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
The above is a workaround to the following code which does not achieve any compression. This is expected according to the GzipFile docs.
f = gzip.GzipFile("my_file_1.gz", "w")
my_array.tofile(f)
f.close()
Here is my question: The file size using 1) is about 6 times smaller than that using 2). From what I understand in the .npy format, it is the exact same way as a binary file with the exception of some headers etc. to preserve array shape. I don't see any reason as to why the file sizes should differ so drastically.

Specify read buffer size in NetCDF (Python or C)

I would like to benchmark NetCDF I/O performance compared to reading a plain binary file in Python and C.
For reading the binary file, I have defined different buffer sizes to read. Here is the code of the read function in Python:
def parse_os_read(filename, buffersize):
f = open(filename, 'rb')
for loops in itertools.count():
record = f.read(buffersize)
if not record:
break
return loops
I can then call this function with different buffer sizes ranging from 64 KB to several 100 MB.
I would like to implement now a comparable sequential reading benchmark for NetCDF. However, I could not identify a read function, where I can set a pre-defined buffer size. I assume the NetCDF file should have a 1D array with random floats as a data structure.
Any thoughts?

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

How to read a compressed binary file as an array of floats

I need to read a compressed unformatted binary file in as an array of floats. The only way that I have found to do this, is to use os to unzip, read it using np.fromfile, and then zip it up again.
os.system('gunzip filename.gz')
array = np.fromfile('filename','f4')
os.system('gzip filename')
However, this is not acceptable. Apart from being messy, I need to read files when I don't have write permission. I understand that np.fromfile can not directly read a compressed file. I have found people recommending that I use this:
f=gzip.GzipFile('filename')
file_content = f.read()
But this returns something like this: '\x00\x80\xe69\x00\x80\xd19\x00\x80'
instead of an array of floats. Does anyone know how to convert this output into an array of floats, or have a better way to do this?
After you've read the file content, you should be able to get an array using numpy.fromstring:
import gzip
import numpy as np
f=gzip.GzipFile('filename')
file_content = f.read()
array = np.fromstring(file_content, dtype='f4')

Optimize writing a ctypes float array as byte stream to file from python

I do have a small problem with the way I currently handle writing data to a file.
The data is stored as a ctypes pointer to an array of floats of a known size (over a 100 million).
The file itself is supposed to be a raw file that holds a certain amount of header info plus the representation of the above mentioned volume data as unsigned int.
What I currently do (after opening the file and writing the header information) is this:
texVolume = voxelData.from_address(ptr.as_pointer())
for pixel in texVolume.dataset_p.contents:
f.write(c_uint8(int(pixel*255)))
f.close()
with voxelData being a ctypes struct holding (among other data) a ctypes float array pointer in the variable "dataset_p". This is the data I need to store as unsigned int bytes.
While my current implementation works just find, it is very slow and saving a file shouldn't be as slow as this.
I already tried doing the whole conversion from float to uint in bulk before the loop, but I can't seem to get it to work with the c_uint8 conversion anymore, throwing instead a TypeError ("only lenght-1 arrays can be converted to Python scalars").
Thank you for any help you may provide. If anything is unclear, don't hesitate to ask.

Categories