Read binary flatfile and skip bytes - python

I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32 from bytes at position 304 to position 308. However, I cannot find a method provided by NumPy that lets me select which bytes to read, only an initial offset as defined in numpy.fromfile.
For example, if my file contains 1000 groups of 400 bytes, I need an array of size 1000 such that:
arr[0] = bytes 304-308
arr[1] = bytes 704-708
...
arr[-1] = bytes 399904 - 399908
Is there a NumPy method that would allow me to specify which bytes to read from a buffer?

Another way to rephrase what you are looking for (slightly), is to say you want to read uint32 numbers starting at offset 304, with a stride of 400 bytes. np.fromfile does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.
The simplest is probably to load the entire file and subset the column you want:
data = np.fromfile(filename, dtype=np.uint32)[304 // 4::400 // 4].copy()
If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:
dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()
Here, _1 and _2 are used to discard the unneeded bytes with 1-byte resolution rather than 4.
Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.
Memory maps can be implemented via Pythons mmap module, and wrapped in an ndarray using the buffer parameter, or you can use the np.memmap class that does it for you:
mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400 // 4))
data = np.array(mm[:, 304 // 4])
del mm
Using a raw mmap is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32:
with open(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()
The final call to copy is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.

Related

Fastest way to convert a bytearray into a numpy array

I have a bytearray which I want to convert to a numpy array of int16 to perform FFT operations on. The bytearray is coming out of a UDP socket so first I convert two consecutive bytes into an int16 using struct.unpack, and then convert it a numpy array using np.asarray.
Current approach, however, is too slow. Original bytearray is of length 1e6 bytes, so each of the mentioned steps (struct.unpack and np.asarray) takes 20 ms and with a total of 40ms. This is a relatively long frame time for my applications so I need it a bit shortened.
Currently, I'm doing this:
temp1 = self.data_buffer[0:FRAME_LEN_B]
self.temp_list = np.asarray(struct.unpack('h' * (len(temp1) // 2), temp1))
You can try np.frombuffer. This can wrap any object supporting the buffer protocol, which bytearray explicitly does, into an array:
arr = np.frombuffer(self.data_buffer, dtype=np.int16, size=FRAME_LEN_B // 2)
You can manipulate the array however you want after that: slice, reshape, transpose, etc.
If your native byte order is opposite to what you have coming in from the network, you can swap the interpretation order without having to swap the data in-place:
dt = np.dtype(np.int16)
dt.newbyteorder('>')
arr = np.frombuffer(self.data_buffer, dtype=dt, size=FRAME_LEN_B // 2)
If the order is non-native, operations on the array may take longer, as the data will have to be swapped every time on the fly. You can therefore change the byte order in-place ahead of time if that is the case:
arr.byteswap(inplace=True)
This will overwrite the contents of the original packet. If you want to make a separate copy, just set inplace=False, which is the default.

How to get uncompressed size of a > 4GB .gz file in python

So there is this super interesting thread already about getting original size of a .gz file. Turns out the size one can get from the 4 file ending bytes are 'just' there to make sure extraction was successful. However: Its fine to rely on it IF the extracted data size is below 2**32 bytes. ie. 4 GB.
Now IF there are more than 4 GB of uncompressed data there must be multiple members in the .gz! The last 4 bytes only indicating the uncompressed size of the last chunk!
So how do we get the ending bytes of the other chunks?
Reading the gzip specs I don't see a length of the
+=======================+
|...compressed blocks...|
+=======================+
Ok. Must depend on the CM - compression method. Which is probably deflate. Let's see the RFC about it. There on page 11 it says there is a LEN attribute for "Non-compressed blocks" but it gets funky when they tell about the Compressed ones ...
I can imagine something like
full_size = os.path.getsize(gz_path)
gz = gzip.open(gz_path)
pos = 0
size = 0
while True:
try:
head_len = get_header_length(gz, pos)
block_len = get_block_length(gz, pos + head_len)
size += get_orig_size(gz, pos + head_len + block_len)
pos += head_len + block_len + 8
except:
break
print('uncompressed size of "%s" is: %i bytes' % (gz_path, full_size)
But how to get_block_length?!? :|
This was probably never intended because ... "stream data". But I don't wanna give up now.
One big bummer already: Even 7zip shows such a big .gz with the exact uncompressed size of just the very last 4 bytes.
Does someone have another idea?
First off, no, there do not need to be multiple members. There is no limit on the length of a gzip member. If the uncompressed data is more than 4 GB, then the last four bytes simply represents that length modulo 232. A gzip file with more than 4 GB of uncompressed data is in fact very likely to be a single member.
Second, the fact that you can have multiple members is true even for small gzip files. The uncompressed data does not need to be more than 4 GB for the last four bytes of the file to be useless.
The only way to reliably determine the amount of uncompressed data in a gzip file is to decompress it. You don't have to write the data out, but you have to process the entire gzip file and count the number of uncompressed bytes.
I'm coming here to leave an estimate of what you are looking for. The good answer is the one given by Mark Adler: the only reliable way to determine the uncompressed size of a gzip file is by actually decompressing it.
But I'm working with an estimate that will usually give good results, but it can fail at the boundaries. The assumptions are:
there is only one stream in the file
the stream have a similar compression ratio compared to the whole file
The idea is to get the compression ratio of the beginning of the file (get a 1M sample, decompress and measure), use it to extrapolate the uncompressed size from the compressed size, and finally, substitute the 32 least significant bits by the size module 32 obtained from the gzip stream.
The caveat comes at the multiple of 4GiB boudaries, as it could over/underestimate the size and give an estimate +/-4GiB displaced.
The code would be:
from io import SEEK_END
import os
import pack
import zlib
def estimate_uncompressed_gz_size(filename):
# From the input file, get some data:
# - the 32 LSB from the gzip stream
# - 1MB sample of compressed data
# - compressed file size
with open(filename, "rb") as gz_in:
sample = gz_in.read(1000000)
gz_in.seek(-4, SEEK_END)
lsb = struct.unpack('I', gz_in.read(4))[0]
file_size = os.fstat(gz_in.fileno()).st_size
# Estimate the total size by decompressing the sample to get the
# compression ratio so we can extrapolate the uncompressed size
# using the compression ratio and the real file size
dobj = zlib.decompressobj(31)
d_sample = dobj.decompress(sample)
compressed_len = len(sample) - len(dobj.unconsumed_tail)
decompressed_len = len(d_sample)
estimate = file_size * decompressed_len / compressed_len
# 32 LSB to zero
mask = ~0xFFFFFFFF
# Kill the 32 LSB to be substituted by the data read from the file
adjusted_estimate = (estimate & mask) | lsb
return adjusted_estimate
Workarounds around the stated caveats could be to check the difference between estimate and adjusted estimate, and if bigger than 2GiB, add/substract 4GiB accordingly. But at the end, it will be always an estimate, not a reliable number.

simple reading of fortran binary data not so simple in python

I have a binary output file from a FORTRAN code. Want to read it in python. (Reading with FORTRAN and outputting text to read for python is not an option. Long story.) I can read the first record in a simplistic manner:
>>> binfile=open('myfile','rb')
>>> pad1=struct.unpack('i',binfile.read(4))[0]
>>> ver=struct.unpack('d',binfile.read(8))[0]
>>> pad2=struct.unpack('i',binfile.read(4))[0]
>>> pad1,ver,pad2
(8,3.13,8)
Just fine. But this is a big file and I need to do this more efficiently. So I try:
>>> (pad1,ver,pad2)=struct.unpack('idi',binfile.read(16))
This won't run. Gives me an error and tells me that unpack needs an argument with a length of 20. This makes no sense to me since the last time I checked, 4+8+4=16. When I give in and replace the 16 with 20, it runs, but the three numbers are populated with numerical junk. Does anyone see what I am doing wrong? Thanks!
The size you get is due to alignment, try struct.calcsize('idi') to verify the size is actually 20 after alignment. To use the native byte-order without alignment, specify struct.calcsize('=idi') and adapt it to your example.
For more info on the struct module, check http://docs.python.org/2/library/struct.html
The struct module is mainly intended to interoperate with C structures and because of this it aligns the data members. idi corresponds to the following C structure:
struct
{
int int1;
double double1;
int int2;
}
double entries require 8 byte alignment in order to function efficiently (or even correctly) with most CPU load operations. That's why 4 bytes of padding are being added between int1 and double1, which increases the size of the structure to 20 bytes. The same padding is performed by the struct module, unless you suppress the padding by adding < (on little endian machines) or > (on big endian machines), or simply = at the beginning of the format string:
>>> struct.unpack('idi', d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 20
>>> struct.unpack('<idi', d)
(-1345385859, 2038.0682530887993, 428226400)
>>> struct.unpack('=idi', d)
(-1345385859, 2038.0682530887993, 428226400)
(d is a string of 16 random chars.)
I recommend using arrays to read a file that was written by FORTRAN with UNFORMATTED, SEQUENTIAL.
Your specific example using arrays, would be as follows:
import array
binfile=open('myfile','rb')
pad = array.array('i')
ver = array.array('d')
pad.fromfile(binfile,1) # read the length of the record
ver.fromfile(binfile,1) # read the actual data written by FORTRAN
pad.fromfile(binfile,1) # read the length of the record
If you have FORTRAN records that write arrays of integers and doubles, which is very common, your python would look something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_integers = array.array('i')
my_floats = array.array('d')
number_of_integers = 1000 # replace with how many you need to read
number_of_floats = 10000 # replace with how many you need to read
pad.fromfile(binfile,1) # read the length of the record
my_integers.fromfile(binfile,number_of_integers) # read the integer data
my_floats.fromfile(binfile,number_of_floats) # read the double data
pad.fromfile(binfile,1) # read the length of the record
Final comment is that if you have characters on the file, you can read those into an array as well, and then decode it into a string. Something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_characters = array.array('B')
number_of_characters = 63 # replace with number of characters to read
pad.fromfile(binfile,1) # read the length of the record
my_characters.fromfile(binfile,number_of_characters ) # read the data
my_string = my_characters.tobytes().decode(encoding='utf_8')
pad.fromfile(binfile,1) # read the length of the record

Storing 'struct' data to binary file

I need to store a binary file with a 12 byte header composed of 4 fields. They are namely: sSamples (4-bytes integer), sSampPeriod (4-bytes integer), sSampSize (2-bytes integer), and finally sParmKind (2-bytes integer).
I'm using 'struct' to my variables to the desired fields. Now that I have them defined separately, how could I merge them all to store the '12 bytes header'?
sSamples = struct.pack('i', nSamples) # 4-bytes integer
sSampPeriod = struct.pack('i', nSampPeriod) # 4-bytes integer
sSampSize = struct.pack('H', nSampSize) # 2-bytes integer / unsigned short
sParmKind = struct.pack('H', 9) # 2-bytes integer / unsigned short
In addition, I've a npVect float array of dimensionality D (numpy.ndarray - float32). How could I store this vector in the same binary file, but after the header?
As Cody Brocious wrote, you can pack your entire header at once:
header = struct.pack('<iiHH', nSamples, nSampPeriod, nSampSize, nParmKind)
He also mentioned endianness, which is important if you want to pack your data so as to reliably unpack it on machines with different architectures. The < at the beginning of my format string specifies "pack this data using a little-endian convention".
As for the array, you'll have to pack its length in order to determine how many values to unpack when you read it again. Doing it all in one call:
flattened = npVect.ravel() # get a 1-D array of numbers
arrSize = len(flattened)
# pack header, count of numbers, and numbers, all in one call
packed = struct.pack('<iiHHi%df' % arrSize,
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize, *flattened)
Depending on how big your array is likely to be, you could end up with a huge string representing the entire contents of your binary file, and you might want to look into alternatives to struct which don't require you to have the entire file in memory.
Unpacking:
fmt = '<iiHHi'
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize = struct.unpack(fmt, packed)
# Use unpack_from to start reading after the packed header and count
flattened = struct.unpack_from('<%df' % arrSize, packed, struct.calcsize(fmt))
npVect = np.ndarray(flattened, dtype='float32').reshape(# your dimensions go here
)
EDIT: Oops, the array format isn't quite as simple as that :) The general idea holds, though: flatten your array into a list of numbers using any method you like, pack the number of values, then pack each value. On the other side, read the array as a flat list, then impose whatever structure you need on it.
EDIT: Changed format strings to use repeat specifiers, rather than string multiplication. Thanks to John Machin for pointing it out.
EDIT: Added numpy code to flatten the array before packing and reconstruct it after unpacking.
struct.pack returns a string, so you can combine the fields simply by string concatenation:
header = sSamples + sSampPeriod + sSampSize + sParmKind
assert len( header ) == 12

What is the equivalent of 'fread' from Matlab in Python?

I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)
Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.
From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.
The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.

Categories