Python Struct for Binary Data

Python Struct for Binary Data - python

I'm trying to read binary files containing a stream of float and int16 values. Those values are stored alternating.
[float][int16][float][int16]... and so on
now I want to read this data file by a python program using the struct functions.
For reading a block of say one such float-int16-pairs I assume the format string would be "fh".
Following output makes sense, the total size is 6 bytes
In [73]: struct.calcsize('fh')
Out[73]: 6
Now I'd like to read larger blocks at once to speed up the program...
In [74]: struct.calcsize('fhfh')
Out[74]: 14
Why is this not returning 12?

Quoting the documentation:
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. This behavior is chosen so that the bytes of a packed struct correspond exactly to the layout in memory of the corresponding C struct. To handle platform-independent data formats or omit implicit pad bytes, use standard size and alignment instead of native size and alignment: see Byte Order, Size, and Alignment for details.
https://docs.python.org/2/library/struct.html
If you want calcsize('fhfh') to be exactly twice calcsize('fh'), then you'll need to specify an alignment character.
Try '<fhfh' or '>fhfh', instead.

You have to specify the byte order or Endianness as size and alignment are based off of that So if you try this:
>>> struct.calcsize('fhfh')
>>> 14
>>> struct.calcsize('>fhfh')
>>> 12
The reason why is because in struct not specifying an endian defaults to native
for more details check here: https://docs.python.org/3.0/library/struct.html#struct.calcsize

Related

Why I get this output while I trying to print this packed struct?

I running this code in python:
import struct
res = struct.pack('hhl', 1, 2, 3)
print(res)
and I get the following output:
b'\x01\x00\x02\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'
but I don't understand why this is the output? after all, the format h means 2 bytes, and the format l means 4 bytes. so why i get this output in this case?

From the struct doc,
To handle platform-independent data formats or omit implicit pad
bytes, use standard size and alignment instead of native size and
alignment
By default h may be padded, depending on the platform you are running on. Select "big-endian" (>), "little-endian" (<) or the alternate native style (=) to remove padding. For example,
>>> struct.Struct('hhl').size
16
>>> struct.Struct('<hhl').size
8
>>> struct.Struct('>hhl').size
8
>>> struct.Struct('=hhl').size
8
You would choose one depending on what pattern you are trying to match. If its a C structure for a natively compiled app, it depends on native memory layout (e.g., 16 bit architectures) and whether the compiler packs or pads. If a network protocol, big or little endian, depending on its specification.

Python Version for Ruby's array.pack() and unpack()?

In Ruby, I could easily pack an array representing some sequence into a binary string:
# for int
# "S*!" directive means format for 16-bit int, and using native endianess
# 16-bit int, so each digit was represented by two bytes. "\x01\x00" and "\x02\x00"
# here the native endianess is "little endian", so you should
# look at it backwards, "\x01\x00" becomes 0001, and "\x02\x00" becomes 0002
"\x01\x00\x02\x00".unpack("S!*")
# [1, 2]
# for hex
# "H*" means every element in the array is a digit for the hexstream
["037fea0651b358c361de"].pack("H*")
# "\x03\x7F\xEA\x06Q\xB3X\xC3a\xDE"
API doc for pack and unpack.
I couldn't find an uniform and equivalent way of transforming sequence to bytes (or vice versa) in python.
While struct provides methods for packing into bytes objects, the format string available has no option for hexstream.
EDIT: What I really want is something as versatile as Ruby's arr.pack and str.unpack, which supports multiple formatting and endianess control.

for a string in the utf-8 range it would be:
from binascii import unhexlify
strg = "464F4F"
unhexlify(strg).decode() # FOO (str)
if your content is just binary
strg = "037fea0651b358c361de"
unhexlify(strg) # b'\x03\x7f\xea\x06Q\xb3X\xc3a\xde' (bytes)
also bytes.fromhex (as in Davis Herring's answer) may be worth checking out.

struct does only fixed-width encodings that correspond to a memory dump of something like a C struct. You want bytes.fromhex or binascii.unhexlify, depending on the source type (which is never a list).
After any such conversion, you can use struct.unpack on a byte string containing any number of “records” corresponding to the format string; each is decoded into an element of the returned tuple. The format string supports the usual integer sizes and endianness choices; it is of course possible to construct a format dynamically to do things like read a matrix whose dimensions are chosen at runtime:
mat=struct.unpack("%dd"%cols,buf) # rows determined from len(buf)
It’s also possible to construct a lower-memory array if the element type is primitive; then you can follow up with byteswap as Alec A mentioned. NumPy offers similar facilities.

Try memoryview.cast, which allows you to change the endianness of an array or byte object.
Storing values as arrays makes things easier, as you can use the byteswap function.

Python struct calsize different from actual

I am trying to read one short and long from a binary file using python struct.
But the
print(struct.calcsize("hl")) # o/p 16
which is wrong, It should have been 2 bytes for short and 8 bytes for long. I am not sure i am using the struct module the wrong way.
When i print the value for each it is
print(struct.calcsize("h")) # o/p 2
print(struct.calcsize("l")) # o/p 8
Is there a way to force python to maintain the precision on datatypes?

By default struct alignment rules, 16 is the correct answer. Each field is aligned to match its size, so you end up with a short for two bytes, then six bytes of padding (to reach the next address aligned to a multiple of eight bytes), then eight bytes for the long.
You can use a byte order prefix (any of them disable padding), but they also disable machine native sizes (so struct.calcsize("=l") will be a fixed 4 bytes on all systems, and struct.calcsize("=hl") will be 6 bytes on all systems, not 10, even on systems with 8 byte longs).
If you want to compute struct sizes for arbitrary structures using machine native types with non-default padding rules, you'll need to go to the ctypes module, define your ctypes.Structure subclass with the desired _pack_ setting, then use ctypes.sizeof to check the size, e.g.:
from ctypes import Structure, c_long, c_short, sizeof
class HL(Structure):
_pack_ = 1 # Disables padding for field alignment
# Defines (unnamed) fields, a short followed by long
_fields_ = [("", c_short),
("", c_long)]
print(sizeof(HL))
which outputs 10 as desired.
This could be factored out as a utility function if needed (this is a simplified example that doesn't handle all struct format codes, but you can expand if needed):
from ctypes import *
FMT_TO_TYPE = dict(zip("cb?hHiIlLqQnNfd",
(c_char, c_byte, c_bool, c_short, c_ushort, c_int, c_uint,
c_long, c_ulong, c_longlong, c_ulonglong,
c_ssize_t, c_size_t, c_float, c_double)))
def calcsize(fmt, pack=None):
'''Compute size of a format string with arbitrary padding (defaults to native)'''
class _(Structure):
if pack is not None:
_pack_ = pack
_fields_ = [("", FMT_TO_TYPE[c]) for c in fmt]
return sizeof(_)
which, once defined, lets you compute sizes padded or unpadded like so:
>>> calcsize("hl") # Defaults to native "natural" alignment padding
16
>>> calcsize("hl", 1) # pack=1 means no alignment padding between members
10

This is what the doc says:
By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. This behavior is chosen so that the bytes of a packed struct correspond exactly to the layout in memory of the corresponding C struct. To handle platform-independent data formats or omit implicit pad bytes, use standard size and alignment instead of native size and alignment
Changing it from standard to native is pretty easy: you just append the prefix = before the format characters.
print(struct.calcsize("=hl"))
EDIT
Since from the native to standard mode, some default sizes are changed, you have two options:
keeping the native mode, but switching the format characters, in this way: struct.calcsize("lh"). In C even the order of your variable inside the struct is important. Here the padding is 8 bytes, it means that every variable has to be referenced at multiple of 8 bytes.
Using the format characters of the standard mode, so: struct.calcsize("=hq")

python: unexpected behavior from struct pack()

I'm having trouble using the struct.pack() for packing an integer.
With
struct.pack("BIB", 1, 0x1234, 0)
I'm expecting
'\x01\x00\x00\x034\x12\x00'
but instead I got
'\x01\x00\x00\x004\x12\x00\x00\x00'
I'm probably missing something here. Please help.

'\x01\x00\x00\x004\x12\x00\x00\x00'
^ this '4' is not part of a hex escape
is actually the same as:
'\x01\x00\x00\x00\x34\x12\x00\x00\x00'
Because the ASCII code for "4" is 0x34.
Because you used the default (native) format, Python used native alignment for the data, so the second field was aligned to offset 4 and 3 zeroes were added before it.
To get a result more like what you wanted, use the format >BIB or <BIB (for big-endian or little-endian respectively) This gives you '\x01\x00\x00\x12\x34\x00' or '\x01\x34\x12\x00\x00\x00'. Neither of those are exactly what you specified, because the example you gave was not proper big-endian or little-endian representation of 0x1234.
See also: section Byte Order, Size, and Alignment in the documentation.

From the docs
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types
involved; similarly, alignment is taken into account when unpacking.
This behavior is chosen so that the bytes of a packed struct
correspond exactly to the layout in memory of the corresponding C
struct. To handle platform-independent data formats or omit implicit
pad bytes, use standard size and alignment instead of native size and
alignment: see Byte Order, Size, and Alignment for details.
You can get your desired result by forcing the byte order. (chr(0x34) == '4')
>>> struct.pack(">BIB", 1, 0x1234, 0)
'\x01\x00\x00\x124\x00'

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.

What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data

Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.

The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g

You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.

The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.