Unpacking binary files - python

I have to read a binary file. So I have totally immersed myself in the python struct module.
However there are still things that confuse me. Let's consider the following chunk of code:
import struct
print struct.pack('5c', *'Hello')
to_pack = (5.9, 14.87, 'HEAD', 32321, 238, 99)
packed = struct.pack('2f4s3i', *to_pack)
print "packed: ", packed
output:
Hello
packed: �̼#��mAHEADA~�
I packed successively 2 floats, a 4 chars string, and three integers.
Then when unpacking:
unpacked = struct.unpack('2f4s3i', packed)
print "unpacked: ", unpacked
Output:
unpacked: (5.900000095367432, 14.869999885559082, 'HEAD', 32321, 238, 99)
So the packing function turned my original data into binary data, whilst unpacking did the
opposite. However, does it mean I necessarily have to know how my data is organized, do I
necessarily have to know which types are encoded, and their respective order?
What if I don't, how could I guess the right type order of my data? For instance if I do:
unpacked = struct.unpack('2f4s3h', packed) # I replaced the 3i with 3h
print "unpacked: ", unpacked
I would get a nice error:
unpacked = struct.unpack('2f4s3h', packed)
struct.error: unpack requires a string argument of length 18
So it seems to me that whatever the binary data I get when reading a binary file, if
I don't know the correct types in the right order, I could not convert it to its original
form.
Is there a way to convert the data back to non-binary without specifying the expected types,
or would I really be stuck with a non-usable binary file?
I mean, even among those creating huge binary files from gigantic ones, how woud they
manage to retrieve their data successfully?
For information, my example was taken from this pdf file: https://gebloggendings.files.wordpress.com/2012/07/struct.pdf

Yes, it's raw binary data, so you need to tell Python about its structure in order to usefully unpack it. Python doesn't know whether that 24-byte blob of data you created in packed is 6 floats, or 6 ints, or 3 doubles, or any combination of those, or something completely different.
>>> unpack('6f', packed)
(5.900000095367432, 14.869999885559082, 773.08251953125, 4.5291367665442413e-41, 3.3350903450930646e-43, 1.3872854796815689e-43)
>>> unpack('6i', packed)
(1086115021, 1097722757, 1145128264, 32321, 238, 99)
>>> unpack('3d', packed)
(15686698.023046875, 6.8585591728324e-310, 2.10077583423e-312)
>>> unpack('dfid', packed)
(15686698.023046875, 773.08251953125, 32321, 2.10077583423e-312)

Related

Write List of Tuples (int, float) to Stream Without Converting to String

I have a list in Python that consists of tuples that have the following format: (int, float). I want to write this list to a io byte or io raw stream without having to convert the ints and/or floats to a string. How can I do this? Thanks.
There are many formats which can be used to serialize Python objects into bytes. There are pros and cons for each of them.
If the data has only a list of tuples of integers and flaots, that make the job rather simple.
Let's assume, this is the data:
data = 100 * [(1, 1.111), (18, 1.234), (555555, 0.001), (-1, 1e70)]
Which of them falls into the category of "strings" is not clear to me. The most obvious "string" format would be str(data). How big is it?
>>> len(str(data))
5500
This takes up 5500 bytes. The question asks for something more compressed. So, we're looking for something much shorter than 5500 bytes.
JSON is a very popular format (it is also a string). How big is it?
>>> len(json.dumps(data))
5500
This has the same size (5500 bytes), but at least it is well defined. Can it be smaller? How about a BZipped JSON?
>>> len(bz2.compress(json.dumps(data).encode('utf-8')))
131
That is much better!
This was probably very good because of a repeating pattern. Is there a format which does not use zipping? Maybe pickle?
>>> len(pickle.dumps(data))
862
Not as good as zip (of course!), but still good.
Could we make a BZipped pickle?
>>> len(bz2.compress(pickle.dumps(data)))
155
Better, but there is no reason for it to be better than BZipped JSON.
How about some other format? You could convert each tuple to the equivalent of this C structure, using the struct module:
struct {
int i;
double f;
};
However, then you'd have to know how big the int can be. Python int can be as big aas you want, but if you e.g. know that all numbers are between 0 and 255, you just need one byte. For the float, you need 64 bits (i.e. 8 bytes), or you lose precision. So, this will go up to about 1000 bytes. Not very good.
There are also other built-in options documented in Python's documentation on Persistence.
You can also invent your own format.
In the end, you have to decide what suits you best.
You can dump integers and floats into bytes directly really easily using the struct module.
>>> import struct
>>> data = [(2, 1.0), (3, 2.0), (25, 55.5)]
>>> for tup in data:
bytes_data = struct.pack("<ld", *tup)
print(bytes_data)
b'\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?'
b'\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00#'
b'\x19\x00\x00\x00\x00\x00\x00\x00\x00\xc0K#'
As an aside the string I use as the first argument to the pack function is a format identifier that tells you what type and size of each number, in this case l is a long signed int, d is a float double.

Python: Converting two sequential 2-byte registers (4 bytes) into IEEE floating-point big endian

I am hooking up an instrument to a laptop over TCP/IP. I have been using a python package to talk to it, and have it return numbers. There are two probes hooked up to this instrument, and I believe these are the bytes corresponding to the temperature readings of these two probes.
The instrument, by default, is set to Big Endian and these data should be of a 32-bit floating point variety - meaning that the variable (b) in the code chunk represents two numbers. b is representative of the output that I would get from the TCP functions.
>>> b = [16746, 42536, 16777, 65230]
>>>
My goal in this is to convert these into their float values, and automating the process. Currently, I am running b through the (hex) function to retrieve the hexadecimal equivalents of each byte:
>>> c =[hex(value) for value in b]
>>>
>>> c
>['0x416a', '0xa628', '0x4189', '0xfece']
>>>
... then I have manually created data_1 and data_2 below to match these hex values, then unpacked them using struct.unpack as I found in this other answer:
>>> data_1 = b'\x41\x6a\xa6\x28'
>>> import struct
>>> struct.unpack('>f', data_1)
>(14.665565490722656,)
>>> data_2 = b'\x41\x89\xfe\xce'
>>> struct.unpack('>f', data_2)
>(17.24941635131836,)
>>>
Some questions:
Am I fundamentally missing something? I am a biologist by trade, and usually a R programmer, so Python is relatively new to me.
I am primarily looking for a streamlined way to get from the TCP output (b) to the number outputs of struct.unpack. The eventual goal of this project is to constantly be polling the sensors for data, which will be graphed/displayed on screen as well as being saved to a .csv.
Thank you!
The function below produces same numbers you found:
import struct
def bigIntToFloat(bigIntlist):
pair = []
for bigInt in bigIntlist:
pair.append(bytes.fromhex(format(bigInt, '04x')))
if len(pair) == 2:
yield struct.unpack('>f', b''.join(pair))[0]
pair = []
The key parts are format(bigInt, '04x') which turns an integer into a hex value without the (in this case) unneeded '0x', while ensuring it's zero-padding to four characters, and bytes.fromhex, which turns the output of that into a bytes object suitable for struct.unpack.
As for whether you're missing something, that's hard for me to say, but I will say that the numbers you give look "reasonable" - that is, if you had the ordering wrong, I'd expect the numbers to be vastly different from each other, rather than slightly.
The simplest way is to use struct.pack to turn those numbers back into a byte string, then unpack as you were doing. pack and unpack can also work with multiple values at a time; the only snag is that pack expects individual arguments instead of a list, so you must put a * in front to expand the list.
>>> struct.unpack('>2f', struct.pack('>4H', *b))
(14.665565490722656, 17.24941635131836)

How would you unpack a 32bit int in Python?

I'm fairly weak with structs but I have a feeling they're the best way to do this. I have a large string of binary data and need to pull 32 of those chars, starting at a specific index, and store them as an int. What is the best way to do this?
Since I need to start at an initial position I have been playing with struct.unpack_from(). Based on the format table here, I thought the 'i' formatting being 4 bytes is exactly what I needed but the code below executes and prints "(825307441,)" where I was expecting either the binary, decimal or hex form. Can anyone explain to me what 825307441 represents?
Also is there a method of extracting the data in a similar fashion but returning it in a list instead of a tuple? Thank you
st = "1111111111111111111111111111111"
test = struct.unpack_from('i',st,0)
print test
Just use int
>>> st = "1111111111111111111111111111111"
>>> int(st,2)
2147483647
>>> int(st[1:4],2)
7
You can slice the string any way you want to get the indices you desire. Passing 2 to int tells int that you are passing it a string in binary

simple reading of fortran binary data not so simple in python

I have a binary output file from a FORTRAN code. Want to read it in python. (Reading with FORTRAN and outputting text to read for python is not an option. Long story.) I can read the first record in a simplistic manner:
>>> binfile=open('myfile','rb')
>>> pad1=struct.unpack('i',binfile.read(4))[0]
>>> ver=struct.unpack('d',binfile.read(8))[0]
>>> pad2=struct.unpack('i',binfile.read(4))[0]
>>> pad1,ver,pad2
(8,3.13,8)
Just fine. But this is a big file and I need to do this more efficiently. So I try:
>>> (pad1,ver,pad2)=struct.unpack('idi',binfile.read(16))
This won't run. Gives me an error and tells me that unpack needs an argument with a length of 20. This makes no sense to me since the last time I checked, 4+8+4=16. When I give in and replace the 16 with 20, it runs, but the three numbers are populated with numerical junk. Does anyone see what I am doing wrong? Thanks!
The size you get is due to alignment, try struct.calcsize('idi') to verify the size is actually 20 after alignment. To use the native byte-order without alignment, specify struct.calcsize('=idi') and adapt it to your example.
For more info on the struct module, check http://docs.python.org/2/library/struct.html
The struct module is mainly intended to interoperate with C structures and because of this it aligns the data members. idi corresponds to the following C structure:
struct
{
int int1;
double double1;
int int2;
}
double entries require 8 byte alignment in order to function efficiently (or even correctly) with most CPU load operations. That's why 4 bytes of padding are being added between int1 and double1, which increases the size of the structure to 20 bytes. The same padding is performed by the struct module, unless you suppress the padding by adding < (on little endian machines) or > (on big endian machines), or simply = at the beginning of the format string:
>>> struct.unpack('idi', d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 20
>>> struct.unpack('<idi', d)
(-1345385859, 2038.0682530887993, 428226400)
>>> struct.unpack('=idi', d)
(-1345385859, 2038.0682530887993, 428226400)
(d is a string of 16 random chars.)
I recommend using arrays to read a file that was written by FORTRAN with UNFORMATTED, SEQUENTIAL.
Your specific example using arrays, would be as follows:
import array
binfile=open('myfile','rb')
pad = array.array('i')
ver = array.array('d')
pad.fromfile(binfile,1) # read the length of the record
ver.fromfile(binfile,1) # read the actual data written by FORTRAN
pad.fromfile(binfile,1) # read the length of the record
If you have FORTRAN records that write arrays of integers and doubles, which is very common, your python would look something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_integers = array.array('i')
my_floats = array.array('d')
number_of_integers = 1000 # replace with how many you need to read
number_of_floats = 10000 # replace with how many you need to read
pad.fromfile(binfile,1) # read the length of the record
my_integers.fromfile(binfile,number_of_integers) # read the integer data
my_floats.fromfile(binfile,number_of_floats) # read the double data
pad.fromfile(binfile,1) # read the length of the record
Final comment is that if you have characters on the file, you can read those into an array as well, and then decode it into a string. Something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_characters = array.array('B')
number_of_characters = 63 # replace with number of characters to read
pad.fromfile(binfile,1) # read the length of the record
my_characters.fromfile(binfile,number_of_characters ) # read the data
my_string = my_characters.tobytes().decode(encoding='utf_8')
pad.fromfile(binfile,1) # read the length of the record

Storing 'struct' data to binary file

I need to store a binary file with a 12 byte header composed of 4 fields. They are namely: sSamples (4-bytes integer), sSampPeriod (4-bytes integer), sSampSize (2-bytes integer), and finally sParmKind (2-bytes integer).
I'm using 'struct' to my variables to the desired fields. Now that I have them defined separately, how could I merge them all to store the '12 bytes header'?
sSamples = struct.pack('i', nSamples) # 4-bytes integer
sSampPeriod = struct.pack('i', nSampPeriod) # 4-bytes integer
sSampSize = struct.pack('H', nSampSize) # 2-bytes integer / unsigned short
sParmKind = struct.pack('H', 9) # 2-bytes integer / unsigned short
In addition, I've a npVect float array of dimensionality D (numpy.ndarray - float32). How could I store this vector in the same binary file, but after the header?
As Cody Brocious wrote, you can pack your entire header at once:
header = struct.pack('<iiHH', nSamples, nSampPeriod, nSampSize, nParmKind)
He also mentioned endianness, which is important if you want to pack your data so as to reliably unpack it on machines with different architectures. The < at the beginning of my format string specifies "pack this data using a little-endian convention".
As for the array, you'll have to pack its length in order to determine how many values to unpack when you read it again. Doing it all in one call:
flattened = npVect.ravel() # get a 1-D array of numbers
arrSize = len(flattened)
# pack header, count of numbers, and numbers, all in one call
packed = struct.pack('<iiHHi%df' % arrSize,
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize, *flattened)
Depending on how big your array is likely to be, you could end up with a huge string representing the entire contents of your binary file, and you might want to look into alternatives to struct which don't require you to have the entire file in memory.
Unpacking:
fmt = '<iiHHi'
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize = struct.unpack(fmt, packed)
# Use unpack_from to start reading after the packed header and count
flattened = struct.unpack_from('<%df' % arrSize, packed, struct.calcsize(fmt))
npVect = np.ndarray(flattened, dtype='float32').reshape(# your dimensions go here
)
EDIT: Oops, the array format isn't quite as simple as that :) The general idea holds, though: flatten your array into a list of numbers using any method you like, pack the number of values, then pack each value. On the other side, read the array as a flat list, then impose whatever structure you need on it.
EDIT: Changed format strings to use repeat specifiers, rather than string multiplication. Thanks to John Machin for pointing it out.
EDIT: Added numpy code to flatten the array before packing and reconstruct it after unpacking.
struct.pack returns a string, so you can combine the fields simply by string concatenation:
header = sSamples + sSampPeriod + sSampSize + sParmKind
assert len( header ) == 12

Categories