Switching endianness in the middle of a struct.unpack format string - python

I have a bunch of binary data (the contents of a video game save-file, as it happens) where a part of the data contains both little-endian and big-endian integer values. Naively, without reading much of the docs, I tried to unpack it this way...
struct.unpack(
'3sB<H<H<H<H4s<I<I32s>IbBbBbBbB12s20sBB4s',
string_data
)
...and of course I got this cryptic error message:
struct.error: bad char in struct format
The problem is that struct.unpack format strings do not expect individual fields to be marked with endianness. The actually correct format-string here would be something like
struct.unpack(
'<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s',
string_data
)
except that this will flip the endianness of the third I field (parsing it as little-endian, when I really want to parse it as big-endian).
Is there an easy and/or "Pythonic" solution to my problem? I have already thought of three possible solutions, but none of them is particularly elegant. In the absence of better ideas I'll probably go with number 3:
I could extract a substring and parse it separately:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = struct.unpack('>I', string_data[56:60])
I could flip the bits in the field after the fact:
(my.f1, my.f2, ...) = struct.unpack('<3sBHHHH4sII32sIbBbBbBbB12s20sBB4s', string_data)
my.f11 = swap32(my.f11)
I could just change my downstream code to expect this field to be represented differently — it's actually a bitmask, not an arithmetic integer, so it wouldn't be too hard to flip around all the bitmasks I'm using with it; but the big-endian versions of these bitmasks are more mnemonically relevant than the little-endian versions.

A little late to the party, but I just had the same problem. I solved it with a custom numpy dtype, which allows to mix elements with different endianess (see https://numpy.org/doc/stable/reference/generated/numpy.dtype.html):
t=np.dtype('>u4,<u4') # Compound type with two 4-byte unsigned int with different byte order
a=np.zeros(shape=1, dtype=t) # Create an array of length one with above type
a[0][0]=1 # Assign first uint
a[0][1]=1 # Assign second uint
bytes=a.tobytes() # bytes should be b'\x01\x00\x00\x00\x00\x00\x00\x01'
b=np.frombuffer(buf, dtype=t) # should yield array[(1,1)]
c=np.frombuffer(buf, dtype=np.uint32) # yields array([ 1, 16777216]

Related

Python Version for Ruby's array.pack() and unpack()?

In Ruby, I could easily pack an array representing some sequence into a binary string:
# for int
# "S*!" directive means format for 16-bit int, and using native endianess
# 16-bit int, so each digit was represented by two bytes. "\x01\x00" and "\x02\x00"
# here the native endianess is "little endian", so you should
# look at it backwards, "\x01\x00" becomes 0001, and "\x02\x00" becomes 0002
"\x01\x00\x02\x00".unpack("S!*")
# [1, 2]
# for hex
# "H*" means every element in the array is a digit for the hexstream
["037fea0651b358c361de"].pack("H*")
# "\x03\x7F\xEA\x06Q\xB3X\xC3a\xDE"
API doc for pack and unpack.
I couldn't find an uniform and equivalent way of transforming sequence to bytes (or vice versa) in python.
While struct provides methods for packing into bytes objects, the format string available has no option for hexstream.
EDIT: What I really want is something as versatile as Ruby's arr.pack and str.unpack, which supports multiple formatting and endianess control.
for a string in the utf-8 range it would be:
from binascii import unhexlify
strg = "464F4F"
unhexlify(strg).decode() # FOO (str)
if your content is just binary
strg = "037fea0651b358c361de"
unhexlify(strg) # b'\x03\x7f\xea\x06Q\xb3X\xc3a\xde' (bytes)
also bytes.fromhex (as in Davis Herring's answer) may be worth checking out.
struct does only fixed-width encodings that correspond to a memory dump of something like a C struct. You want bytes.fromhex or binascii.unhexlify, depending on the source type (which is never a list).
After any such conversion, you can use struct.unpack on a byte string containing any number of “records” corresponding to the format string; each is decoded into an element of the returned tuple. The format string supports the usual integer sizes and endianness choices; it is of course possible to construct a format dynamically to do things like read a matrix whose dimensions are chosen at runtime:
mat=struct.unpack("%dd"%cols,buf) # rows determined from len(buf)
It’s also possible to construct a lower-memory array if the element type is primitive; then you can follow up with byteswap as Alec A mentioned. NumPy offers similar facilities.
Try memoryview.cast, which allows you to change the endianness of an array or byte object.
Storing values as arrays makes things easier, as you can use the byteswap function.

Proper way for converting to bigendian for network submission

I need to get an int through the network. Is this the proper way to convert to bytes in big-endian?
pack("I",socket.htonl(integer_value))
I unpack it as:
socket.ntohl(unpack("I",data)[0])
I noticed that pack-unpack also have the <> to use for endian conversion so I am not sure if I could just directly use that instead or if htonl is safer.
You should use only the struct module for communicating with another system. By using the htonl first, you'll end up with an indeterminate order being transmitted.
Since you need to convert the integer into a string of bytes in order to send it to another system, you'll need to use struct.pack (because htonl just returns a different integer than the one passed as argument and you cannot directly send an integer). And in using struct.pack you must choose an endianness for that string of bytes (if you don't specify one, you'll get a default ordering which may not be the same on the receiving side so you really need to choose one).
Converting an integer to a sequence of bytes in a definite order is exactly what struct.pack("!I", integer_value) does and a sequence of bytes in a definite order is exactly what you need on the receiving end.
On the other hand, if you use struct.pack("!I", socket.htonl(integer_value)), what does that do? Well, first it puts the integer into big-endian order (network byte order), then it takes your already big-endian integer and converts it to bytes in "big-endian order". But, on a little endian machine, that will actually reverse the ordering again, and you will end up transmitting the integer in little-endian byte order if you do both those two operations.
But on a big-endian machine htonl is a no-op, and then you're converting the result into bytes in big-endian order.
So using ntohl actually defeats the purpose and a receiving machine would have to know the byte-order used on the sending machine in order to properly decode it. Observe...
Little-endian box:
>>> print(socket.htonl(27))
452984832
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x1b\x00\x00\x00'
Big-endian box:
>>> print(socket.htonl(27))
27
>>> print(struct.pack("!I", 27))
b'\x00\x00\x00\x1b'
>>> print(struct.pack("!I", socket.htonl(27)))
b'\x00\x00\x00\x1b'
struct.unpack() uses '!' in the format specifiers for network byte order. But its the same as '>'...

python proper way to decode raw udp packet

I'm making a script to get Valve's server information (players online, map, etc)
the packet I get when I request for information is this:
'\xff\xff\xff\xffI\x11Stargate Central CAP SBEP\x00sb_wuwgalaxy_fix\x00garrysmod\x00Spacebuild\x00\xa0\x0f\n\x0c\x00dw\x00\x0114.09.08\x00\xb1\x87i\x06\xb4g\x17.\x15#\x01gm:spacebuild3\x00\xa0\x0f\x00\x00\x00\x00\x00\x00'
This may help you to see what I'm trying to do https://developer.valvesoftware.com/wiki/Server_queries#A2S_INFO
The problem is, I don't know how to decode this properly, it's easy to get the string but I have no idea how to get other types like byte and short
for example '\xa0\x0f'
For now I'm doing multiple split but do you know if there is any better way of doing this?
Python has functions for encoding/decoding different data types into bytes. Take a look at the struct package, the functions struct.pack() and struct.unpack() are your friends there.
taken from https://docs.python.org/2/library/struct.html
>>> from struct import *
>>> pack('hhl', 1, 2, 3)
'\x00\x01\x00\x02\x00\x00\x00\x03'
>>> unpack('hhl', '\x00\x01\x00\x02\x00\x00\x00\x03')
(1, 2, 3)
The first argument of the unpack function defines the format of the data stored in the second argument. Now you need to translate the description given by valve to a format string. If you wanted to unpack 2 bytes and a short from a data string (that would have a length of 4 bytes, of course), you could do something like this:
(first_byte, second_byte, the_short) = unpack("cc!h", data)
You'll have to take care yourself, to get the correct part of the data string (and I don't know if those numbers are signed or not, be sure to take care of that, too).
The strings you'll have to do differently (they are null-terminated here, so start were you know a string starts and read to the first "\0" byte).
pack() work's the other way around and stores data in a byte string. Take a look at the examples on the python doc and play around with it a bit to get a feel for it (when a tuple is returned/needed, e.g.).
struct supports you in getting the right byte order, which most of the time is network byte order and different from your system. That is of course only necessary for multi byte integers (like short) - so a format string of `"!h" should unpack a short correctly.

python: unexpected behavior from struct pack()

I'm having trouble using the struct.pack() for packing an integer.
With
struct.pack("BIB", 1, 0x1234, 0)
I'm expecting
'\x01\x00\x00\x034\x12\x00'
but instead I got
'\x01\x00\x00\x004\x12\x00\x00\x00'
I'm probably missing something here. Please help.
'\x01\x00\x00\x004\x12\x00\x00\x00'
^ this '4' is not part of a hex escape
is actually the same as:
'\x01\x00\x00\x00\x34\x12\x00\x00\x00'
Because the ASCII code for "4" is 0x34.
Because you used the default (native) format, Python used native alignment for the data, so the second field was aligned to offset 4 and 3 zeroes were added before it.
To get a result more like what you wanted, use the format >BIB or <BIB (for big-endian or little-endian respectively) This gives you '\x01\x00\x00\x12\x34\x00' or '\x01\x34\x12\x00\x00\x00'. Neither of those are exactly what you specified, because the example you gave was not proper big-endian or little-endian representation of 0x1234.
See also: section Byte Order, Size, and Alignment in the documentation.
From the docs
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types
involved; similarly, alignment is taken into account when unpacking.
This behavior is chosen so that the bytes of a packed struct
correspond exactly to the layout in memory of the corresponding C
struct. To handle platform-independent data formats or omit implicit
pad bytes, use standard size and alignment instead of native size and
alignment: see Byte Order, Size, and Alignment for details.
You can get your desired result by forcing the byte order. (chr(0x34) == '4')
>>> struct.pack(">BIB", 1, 0x1234, 0)
'\x01\x00\x00\x124\x00'

How Does One Read Bytes from File in Python

Similar to this question, I am trying to read in an ID3v2 tag header and am having trouble figuring out how to get individual bytes in python.
I first read all ten bytes into a string. I then want to parse out the individual pieces of information.
I can grab the two version number chars in the string, but then I have no idea how to take those two chars and get an integer out of them.
The struct package seems to be what I want, but I can't get it to work.
Here is my code so-far (I am very new to python btw...so take it easy on me):
def __init__(self, ten_byte_string):
self.whole_string = ten_byte_string
self.file_identifier = self.whole_string[:3]
self.major_version = struct.pack('x', self.whole_string[3:4]) #this
self.minor_version = struct.pack('x', self.whole_string[4:5]) # and this
self.flags = self.whole_string[5:6]
self.len = self.whole_string[6:10]
Printing out any value except is obviously crap because they are not formatted correctly.
If you have a string, with 2 bytes that you wish to interpret as a 16 bit integer, you can do so by:
>>> s = '\0\x02'
>>> struct.unpack('>H', s)
(2,)
Note that the > is for big-endian (the largest part of the integer comes first). This is the format id3 tags use.
For other sizes of integer, you use different format codes. eg. "i" for a signed 32 bit integer. See help(struct) for details.
You can also unpack several elements at once. eg for 2 unsigned shorts, followed by a signed 32 bit value:
>>> a,b,c = struct.unpack('>HHi', some_string)
Going by your code, you are looking for (in order):
a 3 char string
2 single byte values (major and minor version)
a 1 byte flags variable
a 32 bit length quantity
The format string for this would be:
ident, major, minor, flags, len = struct.unpack('>3sBBBI', ten_byte_string)
Why write your own? (Assuming you haven't checked out these other options.) There's a couple options out there for reading in ID3 tag info from MP3s in Python. Check out my answer over at this question.
I am trying to read in an ID3v2 tag header
FWIW, there's already a module for this.
I was going to recommend the struct package but then you said you had tried it. Try this:
self.major_version = struct.unpack('H', self.whole_string[3:5])
The pack() function convers Python data types to bits, and the unpack() function converts bits to Python data types.

Categories