Python struct.unpack(ing) when there are multiple byte-orders?

Python struct.unpack(ing) when there are multiple byte-orders? - python

I have a function that reads a binary file and then unpacks the file's contents using struct.unpack(). My function works just fine. It is faster if/when I unpack the whole of the file using a long 'format' string. Problem is that sometimes the byte-alignment changes so my format string (which is invalid) would look like '<10sHHb>llh' (this is just an example (they are usually way longer)). Is there any ultra slick/pythonic way of handling this situation?

Nothing super-slick, but if speed counts, the struct module top-level functions are wrappers that have to repeatedly recheck a cache for the actual struct.Struct instance corresponding to the format string; while you must make separate format strings, you might solve part of your speed problem by avoiding that repeated cache check.
Instead of doing:
buffer = memoryview(somedata)
allresults = []
while buffer:
allresults += struct.unpack_from('<10sHHb', buffer)
buffer = buffer[struct.calcsize('<10sHHb'):]
allresults += struct.unpack_from('>llh', buffer)
buffer = buffer[struct.calcsize('>llh'):]
You'd do:
buffer = memoryview(somedata)
structa = struct.Struct('<10sHHb')
structb = struct.Struct('>llh')
allresults = []
while buffer:
allresults += structa.unpack_from(buffer)
buffer = buffer[structa.size:]
allresults += structb.unpack_from(buffer)
buffer = buffer[structb.size:]
No, it's not much nicer looking, and the speed gains aren't likely to blow you away. But you've got weird data, so this is the least brittle solution.
If you want unnecessarily clever/brittle solutions, you could do this with ctypes custom Structures, nesting BigEndianStructure(s) inside a LittleEndianStructure or vice-versa. For your example format :
from ctypes import *
class BEStruct(BigEndianStructure):
_fields_ = [('x', 2 * c_long), ('y', c_short)]
_pack_ = True
class MainStruct(LittleEndianStructure):
_fields_ = [('a', 10 * c_char), ('b', 2 * c_ushort), ('c', c_byte), ('big', BEStruct)]
_pack_ = True
would give you a structure such that you could do:
mystruct = MainStruct()
memoryview(mystruct).cast('B')[:] = bytes(range(25))
and you'd then get results in the expected order, e.g.:
>>> hex(mystruct.b[0]) # Little endian as expected in main struct
'0xb0a'
>>> hex(mystruct.big.x[0]) # Big endian from inner big endian structure
'0xf101112'
While clever in a way, it's likely it will run slower (ctypes attribute lookup is weirdly slow in my experience), and unlike struct module functions, you can't just unpack into top-level named variables in a single line, it's attribute access all the way.

Related

Python Ctypes - Memmove not working correctly

I would like to move data from one variable to another.
I have the following code:
a = 'a' # attempting to move the contents of b into here
b = 'b'
obj = ctypes.py_object.from_address(id(a))
obj2 = ctypes.py_object.from_address(id(b))
ptr = ctypes.pointer(obj)
ptr2 = ctypes.pointer(obj2)
ctypes.memmove(ptr, ptr2, ctypes.sizeof(obj2))
print(a, b) # expected result: b b
a does not change, and gives no errors.
Is this simply not possible, or is it something I am doing wrong?

NOT RECOMMENDED But interesting for learning...
It's possible on CPython due to the implementation detail that id(obj) returns the address of the internal PyObject, but a very bad idea. Python strings are immutable, so corrupting their inner workings is going to break things. Python objects have internal data like reference counts, type, length that will be corrupted by blindly copying over them.
import ctypes as ct
import sys
# Using strings that are more unique and less likely to be used inside Python
# (lower reference counts).
a = '123'
b = '456'
# Create ctypes byte buffers that reference the same memory as a and b
bytes_a = (ct.c_ubyte * sys.getsizeof(a)).from_address(id(a))
bytes_b = (ct.c_ubyte * sys.getsizeof(b)).from_address(id(b))
# View the bytes as hex. The first bytes are the reference counts.
# The last bytes are the ASCII bytes of the strings.
print(bytes(bytes_a).hex())
print(bytes(bytes_b).hex())
ct.memmove(bytes_b, bytes_a, len(bytes_a))
# Does what you want, but Python crashes on exit in my case
print(a,b)
Output:
030000000000000060bc9563fc7f00000300000000000000bf4fda89331c3232e5a5a97d1b020000000000000000000031323300
030000000000000060bc9563fc7f00000300000000000000715a1b84492b4696e5feaf7d1b020000000000000000000034353600
123 123
Exception ignored deletion of interned string failed:
KeyError: '123'
111
Safe way to do make a copy of the memory and view it
import ctypes as ct
import sys
a = '123'
# Copy memory at address to a Python bytes object.
bytes_a = ct.string_at(id(a), sys.getsizeof(a))
print(bytes_a.hex())
Output:
020000000000000060bc5863fc7f000003000000000000001003577d19c6d60be59f53919b010000000000000000000031323300

Return data from c++ dll with Python

I'm programming an interface with 3M document scanners.
I am calling a function called MMMReader_GetData
MMMReaderErrorCode MMMReader_GetData(MMMReaderDataType aDataType,void* DataPtr,int* aDataLen);
Description:
After a data item has been read from a document it may be obtained via
this API. The buffer supplied in the aDataPtr parameter will be
written to with the data, and aDataLen updated to be the length of the
data.
The problem is how can I create a void* DataPrt and how can get it the data?
I have tried:
from ctypes import *
lib=cdll.LoadLibrary('MMMReaderHighLevelAPI.dll')
CD_CODELINE = 0
aDataLen = c_int()
aDataPtr = c_void_p()
index= c_int(0)
r = lib.MMMReader_GetData(CD_CODELINE,byref(aDataPtr),byref(aDataLen),index)
aDataLen always returns a value but aDataPtr returns None

What you need to do is allocate a "buffer". The address of the buffer will be passed as the void* parameter, and the size of the buffer in bytes will be passed as the aDataLen parameter. Then the function will put its data in the buffer you gave it, and then you can read the data back out of the buffer.
In C or C++ you would use malloc or something similar to create a buffer. When using ctypes, you can use ctypes.create_string_buffer to make a buffer of a certain length, and then pass the buffer and the length to the function. Then once the function fills it in, you can read the data out of the buffer you created, which works like a list of characters with [] and len().

With ctypes, it is best to define the argument types and return value for better error checking, and declaring pointer types is especially important on 64-bit systems.
from ctypes import *
MMMReaderErrorCode = c_int # Set to an appropriate type
MMMReaderDataType = c_int # ditto...
lib = CDLL('MMMReaderHighLevelAPI')
lib.MMMReader_GetData.argtypes = MMMReaderDataType,c_void_p,POINTER(c_int)
lib.MMMReader_GetData.restype = MMMReaderErrorCode
CD_CODELINE = 0
# Make sure to pass in the original buffer size.
# Assumption: the API should update it on return with the actual size used (or needed)
# and will probably return an error code if the buffer is not large enough.
aDataLen = c_int(256)
# Allocate a writable buffer of the correct size.
aDataPtr = create_string_buffer(aDataLen.value)
# aDataPtr is already a pointer, so no need to pass it by reference,
# but aDataLen is a reference so the value can be updated.
r = lib.MMMReader_GetData(CD_CODELINE,aDataPtr,byref(aDataLen))
On return you can access just the returned portion of the buffer by string slicing, e.g.:
>>> from ctypes import *
>>> aDataLen = c_int(10)
>>> aDataPtr = create_string_buffer(aDataLen.value)
>>> aDataPtr.raw
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> aDataLen.value = 5 # Value gets updated
>>> aDataPtr[:aDataLen.value] # Get the valid portion of buffer
'\x00\x00\x00\x00\x00'

There are several issues with your code:
You need to allocate the buffer pointed to by aDataPtr.
You need to pass the buffer length in aDataLen. According to [1], if the buffer isn't big enough, MMMReader_GetData will reallocate it as needed.
You should pass aDataPtr directly, not byref.
You are passing an extra argument to the method (the index argument) based on the method descriptor of MMMReader_GetData you provided.
Try the following:
import ctypes
lib = ctypes.cdll.LoadLibrary('MMMReaderHighLevelAPI.dll')
CD_CODELINE = 0
aDataLen = ctypes.c_int(1024)
aDataPtr = ctypes.create_string_buffer(aDataLen.value)
err = lib.MMMReader_GetData(CD_CODELINE, aDataPtr, ctype.byref(aDataLen))
Then you can read the content of the buffer as a regular character array. The actual length is returned back for you in aDataLen.
[1] 3M Page Reader Programmers' Guide: https://wenku.baidu.com/view/1a16b6d97f1922791688e80b.html

Python readinto: How to convert from an array.array to a custom ctype structure

I have created an array of integers and I would like them to be interpreted by the structure definition which I have created
from ctypes import *
from array import array
class MyStruct(Structure):
_fields_ = [("init", c_uint),
("state", c_char),
("constant", c_int),
("address", c_uint),
("size", c_uint),
("sizeMax", c_uint),
("start", c_uint),
("end", c_uint),
("timestamp", c_uint),
("location", c_uint),
("nStrings", c_uint),
("nStringsMax", c_uint),
("maxWords", c_uint),
("sizeFree", c_uint),
("stringSizeMax", c_uint),
("stringSizeFree", c_uint),
("recordCount", c_uint),
("categories", c_uint),
("events", c_uint),
("wraps", c_uint),
("consumed", c_uint),
("resolution", c_uint),
("previousStamp", c_uint),
("maxTimeStamp", c_uint),
("threshold", c_uint),
("notification", c_uint),
("version", c_ubyte)]
# arr = array.array('I', [1])
# How can I do this?
# mystr = MyStruct(arr) magic
# (mystr.helloworld == 1) == True
I can do the following:
mystr = MyStruct()
rest = array.array('I')
with open('myfile.bin', 'rb') as binaryFile:
binaryFile.readinto(mystr)
rest.fromstring(binaryFile.read())
# Now create another struct with rest
rest.readinto(mystr) # Does not work
How can I avoid using a file to convert an array of Ints to a struct if the data is contained in an array.array('I')? I am not sure what the Structure constructor accepts or how the readinto works.

Solution #1: Star unpacking for one-line initialization
Star-unpacking will work, but only if all the fields in your structure are integer types. In Python 2.x, c_char cannot be initialized from an int (it works fine in 3.5). If you change the type of state to c_byte, then you can just do:
mystr = MyStruct(*myarr)
This doesn't actually benefit from any array specific magic (the values are briefly converted to Python ints in the unpacking step, so you're not reducing peak memory usage), so you'd only bother with an array if initializing said array was easier than directly reading into the structure for whatever reason.
If you go the star unpacking route, reading .state will now get you int values instead of len 1 str values. If you want to initialize with int, but read as one character str, you can use a protected name wrapped in a property:
class MyStruct(Structure):
_fields_ = [...
("_state", c_byte), # "Protected" name int-like; constructor expects int
...]
#property
def state(self):
return chr(self._state)
#state.setter
def state(self, x):
if isinstance(x, basestring):
x = ord(x)
self._state = x
A similar technique could be used without propertys by defining your own __init__ that converted the state argument passed:
class MyStruct(Structure):
_fields_ = [("init", c_uint),
("state", c_char),
...]
def __init__(self, init=0, state=b'\0', *args, **kwargs):
if not isinstance(state, basestring):
state = chr(state)
super(MyStruct, self).__init__(init, state, *args, **kwargs)
Solution #2: Direct memcpy-like solutions to reduce temporaries
You can use some array specific magic to avoid the temporary Python level ints though (and avoid the need to change state to c_byte) without real file objects using a fake (in-memory) file-like object:
import io
mystr = MyStruct() # Default initialize
# Use BytesIO to gain the ability to write the raw bytes to the struct
# because BytesIO's readinto isn't finicky about exact buffer formats
io.BytesIO(myarr.tostring()).readinto(mystr)
# In Python 3, where array implements the buffer protocol, you can simplify to:
io.BytesIO(myarr).readinto(mystr)
# This still performs two memcpys (one occurs internally in BytesIO), but
# it's faster by avoiding a Python level method call
This only works because your non-c_int width attributes are followed by c_int width attributes (so they're padded out to four bytes anyway); if you had two c_ubyte/c_char/etc. types back to back, then you'd have problems (because one value of the array would initialize two fields in the struct, which does not appear to be what you want).
If you were using Python 3, you could benefit from array specific magic to avoid the cost of both unpacking and the two step memcpy of the BytesIO technique (from array -> bytes -> struct). It works in Py3 because Py3's array type supports the buffer protocol (it didn't in Py2), and because Py3's memoryview features a cast method that lets you change the format of the memoryview to make it directly compatible with array:
mystr = MyStruct() # Default initialize
# Make a view on mystr's underlying memory that behaves like a C array of
# unsigned ints in native format (matching array's type code)
# then perform a "memcpy" like operation using empty slice assignment
# to avoid creating any Python level values.
memoryview(mystr).cast('B').cast('I')[:] = myarr
Like the BytesIO solution, this only works because your fields all happen to pad to four bytes in size
Performance
Performance-wise, star unpacking wins for small numbers of fields, but for large numbers of fields (your case has a couple dozen), direct memcpy based approaches win out; in tests for a 23 field class, the BytesIO solution won over star unpacking on my Python 2.7 install by a factor of 2.5x (star unpacking was 2.5 microseconds, BytesIO was 1 microsecond).
The memoryview solution scales similarly to the BytesIO solution, though as of 3.5, it's slightly slower than the BytesIO approach (likely a result of the need to construct several temporary memoryviews to perform the necessary casting operations and/or the memoryview slice assignment code being general purpose for many possible formats, so it's not simple memcpy in implementation). memoryview might scale better for much larger copies (if the losses are due to the fixed cast overhead), but it's rare that you'd have a struct large enough to matter; it would only be in more general purpose copying scenarios (to and from ctypes arrays or the like) that memoryview would potentially win.

Does this have to be an array? could you use a list maybe? you can unpack from a list in to a function you can use the * operator:
mystr = MyStruct(*arr)
or a dict with:
mystr = MyStruct(**arr)

Ctypes Offset Into A Buffer

I have a string buffer: b = create_string_buffer(numb) where numb is a number of bytes.
In my wrapper I need to splice up this buffer. When calling a function that expects a POINTER(c_char) I can do: myfunction(self, byref(b, offset)) but in a Structure:
class mystruct(Structure):
_fields_ = [("buf", POINTER(c_char))]
I am unable to do this, getting an argument type exception. So my question is: how can I assign .buf to be an offset into b. Direct assignment works so .buf = b, however this is unsuitable. (Python does not hold up to well against ~32,000 such buffers being created every second, hence my desire to use a single spliced buffer.)

ctypes.cast
>>> import ctypes
>>> b = ctypes.create_string_buffer(500)
>>> b[:6] = 'foobar'
>>> ctypes.cast(ctypes.byref(b, 4), ctypes.POINTER(ctypes.c_char))
<ctypes.LP_c_char object at 0x100756e60>
>>> _.contents
c_char('a')

How to pack and unpack using ctypes (Structure <-> str)

This might be a silly question but I couldn't find a good answer in the docs or anywhere.
If I use struct to define a binary structure, the struct has 2 symmetrical methods for serialization and deserialization (pack and unpack) but it seems ctypes doesn't have a straightforward way to do this. Here's my solution, which feels wrong:
from ctypes import *
class Example(Structure):
_fields_ = [
("index", c_int),
("counter", c_int),
]
def Pack(ctype_instance):
buf = string_at(byref(ctype_instance), sizeof(ctype_instance))
return buf
def Unpack(ctype, buf):
cstring = create_string_buffer(buf)
ctype_instance = cast(pointer(cstring), POINTER(ctype)).contents
return ctype_instance
if __name__ == "__main__":
e = Example(12, 13)
buf = Pack(e)
e2 = Unpack(Example, buf)
assert(e.index == e2.index)
assert(e.counter == e2.counter)
# note: for some reason e == e2 is False...

The PythonInfo wiki has a solution for this.
FAQ: How do I copy bytes to Python from a ctypes.Structure?
def send(self):
return buffer(self)[:]
FAQ: How do I copy bytes to a ctypes.Structure from Python?
def receiveSome(self, bytes):
fit = min(len(bytes), ctypes.sizeof(self))
ctypes.memmove(ctypes.addressof(self), bytes, fit)
Their send is the (more-or-less) equivalent of pack, and receiveSome is sort of a pack_into. If you have a "safe" situation where you're unpacking into a struct of the same type as the original, you can one-line it like memmove(addressof(y), buffer(x)[:], sizeof(y)) to copy x into y. Of course, you'll probably have a variable as the second argument, rather than a literal packing of x.

Have a look at this link on binary i/o in python:
http://www.dabeaz.com/blog/2009/08/python-binary-io-handling.html
Based on this you can simply write the following to read from a buffer (not just files):
g = open("foo","rb")
q = Example()
g.readinto(q)
To write is simply:
g.write(q)
The same for using sockets:
s.send(q)
and
s.recv_into(q)
I did some testing with pack/unpack and ctypes and this approach is the fastest except for writing straight in C

Tested on Python3
e = Example(12, 13)
serialized = bytes(e)
deserialized = Example.from_buffer_copy(serialized)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python struct.unpack(ing) when there are multiple byte-orders? - python

Related

Python Ctypes - Memmove not working correctly

Return data from c++ dll with Python

Python readinto: How to convert from an array.array to a custom ctype structure

Ctypes Offset Into A Buffer

How to pack and unpack using ctypes (Structure <-> str)

Categories

Resources