Python says I need 4 bytes for a format code of "BH":
struct.error: unpack requires a string argument of length 4
Here is the code, I am putting in 3 bytes as I think is needed:
major, minor = struct.unpack("BH", self.fp.read(3))
"B" Unsigned char (1 byte) + "H" Unsigned short (2 bytes) = 3 bytes (!?)
struct.calcsize("BH") says 4 bytes.
EDIT: The file is ~800 MB and this is in the first few bytes of the file so I'm fairly certain there's data left to be read.
The struct module mimics C structures. It takes more CPU cycles for a processor to read a 16-bit word on an odd address or a 32-bit dword on an address not divisible by 4, so structures add "pad bytes" to make structure members fall on natural boundaries. Consider:
struct { 11
char a; 012345678901
short b; ------------
char c; axbbcxxxdddd
int d;
};
This structure will occupy 12 bytes of memory (x being pad bytes).
Python works similarly (see the struct documentation):
>>> import struct
>>> struct.pack('BHBL',1,2,3,4)
'\x01\x00\x02\x00\x03\x00\x00\x00\x04\x00\x00\x00'
>>> struct.calcsize('BHBL')
12
Compilers usually have a way of eliminating padding. In Python, any of =<>! will eliminate padding:
>>> struct.calcsize('=BHBL')
8
>>> struct.pack('=BHBL',1,2,3,4)
'\x01\x02\x00\x03\x04\x00\x00\x00'
Beware of letting struct handle padding. In C, these structures:
struct A { struct B {
short a; int a;
char b; char b;
}; };
are typically 4 and 8 bytes, respectively. The padding occurs at the end of the structure in case the structures are used in an array. This keeps the 'a' members aligned on correct boundaries for structures later in the array. Python's struct module does not pad at the end:
>>> struct.pack('LB',1,2)
'\x01\x00\x00\x00\x02'
>>> struct.pack('LBLB',1,2,3,4)
'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04'
By default, on many platforms the short will be aligned to an offset at a multiple of 2, so there will be a padding byte added after the char.
To disable this, use: struct.unpack("=BH", data). This will use standard alignment, which doesn't add padding:
>>> struct.calcsize('=BH')
3
The = character will use native byte ordering. You can also use < or > instead of = to force little-endian or big-endian byte ordering, respectively.
Related
This question already has answers here:
Convert bytes to int?
(7 answers)
Closed 8 months ago.
I receive a 32-bit number over the serial line, using num = ser.read(4). Checking the value of num in the shell returns something like a very unreadable b'\xcbu,\x0c'.
I can check against the ASCII table to find the values of "u" and ",", and determine that the hex value of the received number is actually equal to "cb 75 2c 0c", or in the format that Python outputs, it's b'\xcb\x75\x2c\x0c'. I can also type the value into a calculator and convert it to decimal (or run int(0xcb752c0c) in Python), which returns 3413453836.
How can I do this conversion from a binary string literal to an integer in Python?
I found two alternatives to solve this problem.
Using the int.from_bytes(bytes, byteorder, *, signed=False) method
Using the struct.unpack(format, buffer) from the builtin struct module
Using int.from_bytes
Starting from Python 3.2, you can use int.from_bytes.
Second argument, byteorder, specifies endianness of your bytestring. It can be either 'big' or 'little'. You can also use sys.byteorder to get your host machine's native byteorder.
from the docs:
The byteorder argument determines the byte order used to represent the integer. If byteorder is "big", the most significant byte is at the beginning of the byte array. If byteorder is "little", the most significant byte is at the end of the byte array. To request the native byte order of the host system, use sys.byteorder as the byte order value.
int.from_bytes(bytes, byteorder, *, signed=False)
Code applying in your case:
>>> import sys
>>> int.from_bytes(b'\x11', byteorder=sys.byteorder)
17
>>> bin(int.from_bytes(b'\x11', byteorder=sys.byteorder))
'0b10001'
Here is the official demonstrative code from the docs:
>>> int.from_bytes(b'\x00\x10', byteorder='big')
16
>>> int.from_bytes(b'\x00\x10', byteorder='little')
4096
>>> int.from_bytes(b'\xfc\x00', byteorder='big', signed=True)
-1024
>>> int.from_bytes(b'\xfc\x00', byteorder='big', signed=False)
64512
>>> int.from_bytes([255, 0, 0], byteorder='big')
16711680
Using the struct.unpack method
The function you need to achieve your goal is struct.unpack.
To understand where you can use it, we need to understand the parameters to give and their impact on the result.
struct.unpack(format, buffer)
Unpack from the buffer buffer (presumably packed by pack(format, ...)) according to the format string format. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes must match the size required by the format, as reflected by calcsize().
The buffer is the bytes that we have to give and the format is how the bytestring should be read.
The information will be split into a string, format characters, that can be endianness, ctype, bytesize, ecc..
from the docs:
Format characters have the following meaning; the conversion between C and Python values should be obvious given their types. The ‘Standard size’ column refers to the size of the packed value in bytes when using standard size; that is, when the format string starts with one of '<', '>', '!' or '='. When using native size, the size of the packed value is platform-dependent.
This table represents the format characters currently avaiable in Python 3.10.6:
Format
C-Type
Standard Size
x
pad byte
c
char
1
b
signed char
1
B
uchar
1
?
bool
1
h
short
2
H
ushort
2
i
int
4
I
uint
4
l
long
4
L
ulong
4
q
long long
8
Q
unsigned long long
8
n
ssize_t
N
unsigned ssize_t
f
float
d
double
s
char[]
p
char[]
P
void*
and here is a table to use it to correct byte order:
Character
Byte order
Size
Alignment
#
native
native
Native
=
native
standard
None
<
little-endian
standard
None
>
big-endian
standard
None
!
network (= big-endian)
standard
None
Examples
Here is an example how you can use it:
>>> import struct
>>> format_chars = '<i' #4 bytes, endianness is 'little'
>>> struct.unpack(format_chars,b"f|cs")
(1935899750,)
check the builtin struct module.
https://docs.python.org/3/library/struct.html
in your case, it should probably be something like:
import struct
struct.unpack(">I", b'\xcb\x75\x2c\x0c')
but it depends on Endianness and signed/unsigned, so do read the entire doc.
I'm trying to translate this c code into python, but Im having problems with the char* to ushort* conversion:
void sendAsciiCommand(string command) {
unsigned int nchars = command.length() + 1; // Char count of command string
unsigned int nshorts = ceil(nchars / 2); // Number of shorts to store the string
std::vector<unsigned short> regs(nshorts); // Vector of short registers
// Transform char array to short array with endianness conversion
unsigned short *ascii_short_ptr = (unsigned short *)(command.c_str());
for (unsigned int i = 0; i < nshorts; i++)
regs[i] = htons(ascii_short_ptr[i]);
return std::string((char *)regs.data());
}
As long I have tried with this code in Python 2.7:
from math import ceil
from array import array
command = "hello"
nchars = len(command) + 1
nshorts = ceil(nchars/2)
regs = array("H", command)
But it gives me the error:
ValueError: string length not a multiple of item size
Any help?
The exception text:
ValueError: string length not a multiple of item size
means what is says, i.e., the length of the string from which you are trying to create an array must be a multiple of the item size. In this case the item size is that of an unsigned short, which is 2 bytes. Therefore the length of the string must be a multiple of 2. hello has length 5 which is not a multiple of 2, so you can't create an array of 2 byte integers from it. It will work if the string is 6 bytes long, e.g. hello!.
>>> array("H", 'hello!')
array('H', [25960, 27756, 8559])
You might still need to convert to network byte order. array uses the native byte order on your machine, so if your native byte order is little endian you will need to convert it to big endian (network byte order). Use sys.byteorder to check and array.byteswap() to swap the byte order if required:
import sys
from array import array
s = 'hello!'
regs = array('H', s)
print(regs)
# array('H', [25960, 27756, 8559])
if sys.byteorder != 'big':
regs.byteswap()
print(regs)
# array('H', [26725, 27756, 28449])
However, it's easier to use struct.unpack() to convert straight to network byte order if necessary:
import struct
s = 'hello!'
n = len(s)/struct.calcsize('H')
regs = struct.unpack('!{}H'.format(n), s)
print(regs)
#(26725, 27756, 28449)
If you really need an array:
regs = array('H', struct.unpack('!{}H'.format(n), s))
It's also worth pointing out that your C++ code contains an error. If the string length is odd an extra byte will be read at the end of the string and this will be included in the converted data. That extra byte will be \0 as the C string should be null terminated, but the last unsigned short should either be ignored, or you should check that the length of the string is multiple of an unsigned short, just as Python does.
On a 64-bit system an integer in Python takes 24 bytes. This is 3 times the memory that would be needed in e.g. C for a 64-bit integer. Now, I know this is because Python integers are objects. But what is the extra memory used for? I have my guesses, but it would be nice to know for sure.
Remember that the Python int type does not have a limited range like C int has; the only limit is the available memory.
Memory goes to storing the value, the current size of the integer storage (the storage size is variable to support arbitrary sizes), and the standard Python object bookkeeping (a reference to the relevant object and a reference count).
You can look up the longintrepr.h source (the Python 3 int type was traditionally known as the long type in Python 2); it makes effective use of the PyVarObject C type to track integer size:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The ob_digit array stores 'digits' of either 15 or 30 bits wide (depending on your platform); so on my 64-bit OS X system, an integer up to (2 ^ 30) - 1 uses 1 'digit':
>>> sys.getsizeof((1 << 30) - 1)
28
but if you use 2 30-bit digits in the number an additional 4 bytes are needed, etc:
>>> sys.getsizeof(1 << 30)
32
>>> sys.getsizeof(1 << 60)
36
>>> sys.getsizeof(1 << 90)
40
The base 24 bytes then are the PyObject_VAR_HEAD structure, holding the object size, the reference count and the type pointer (each 8 bytes / 64 bits on my 64-bit OS X platform).
On Python 2, integers <= sys.maxint but >= -sys.maxint - 1 are stored using a simpler structure storing just the single value:
typedef struct {
PyObject_HEAD
long ob_ival;
} PyIntObject;
because this uses PyObject instead of PyVarObject there is no ob_size field in the struct and the memory size is limited to just 24 bytes; 8 for the long value, 8 for the reference count and 8 for the type object pointer.
From longintrepr.h, we see that a Python 'int' object is defined with this C structure:
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
Digit is a 32-bit unsigned value. The bulk of the space is taken by the variable size object header. From object.h, we can find its definition:
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
typedef struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
We can see that we are using a Py_ssize_t, 64-bits assuming 64-bit system, to store the count of "digits" in the value. This is possibly wasteful. We can also see that the general object header has a 64-bit reference count, and a pointer to the object type, which will also be a 64-bits of storage. The reference count is necessary for Python to know when to deallocate the object, and the pointer to the object type is necessary to know that we have an int and not, say, a string, as C structures have no way to test the type of an object from an arbitrary pointer.
_PyObject_HEAD_EXTRA is defined to nothing on most builds of python, but can be used to store a linked list of all Python objects on the heap if the build enables that option, using another two pointers of 64-bits each.
I'm trying to pack integers as bytes in python and unpack them in C. So in my python code I have something like
testlib = ctypes.CDLL('/something.so')
testlib.process(repr(pack('B',10)))
which packs 10 as a byte and calls the function "process" in my C code.
What do I need in my C code to unpack this packed data? That is, what do I need to do to get 10 back from the given packed data.
Assuming you have a 10 byte string containing 10 integers, just copy the data.
char packed_data[10];
int unpacked[10];
int i;
for(i = 0; i < 10; ++i)
unpacked[i] = packed_data[i];
... or using memcpy
On the other hand, if you're using 4 bytes pr int when packing, you can split the char string in C and use atoi on it. How are you exchanging data from Python to C ?
From http://www.cs.bell-labs.com/cm/cs/pearls/sol01.html:
The C code is like that:
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N/BITSPERWORD];
void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); }
void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); }
int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); }
I found ctypes, BitArrays,numpy but I'm not sure whether they could be as efficient as the C codes above.
For example, if I write codes like this:
from ctypes import c_int
a=[c_int(9)]*1024*1024
Would the space used be 1M Bytes, or much more?
Does anyone know some good libraries that can do the same things in Python?
Numpy or ctypes are both good choices. But are you sure your Python code really needs to be as efficient as C, and are you sure this code is a performance hotspot?
The best thing to do is to use the Python profiler to ensure that this code really does need to be as efficient as C. If it truly does, then it will probably be easiest to just keep the code in C and link to it using something like ctypes or SWIG.
Edit: To answer your updated question, a numpy array of size N with element size M will contain N*M bytes of contiguous memory, plus a header and some bytes for views.
Here are a couple of related links:
Python memory usage of numpy arrays
Memory usage of numpy-arrays
you can also check the built-in array module:
>>> import array
>>> help(array)
Help on built-in module array:
NAME
array
FILE
(built-in)
DESCRIPTION
This module defines an object type which can efficiently represent
an array of basic values: characters, integers, floating point
numbers. Arrays are sequence types and behave very much like lists,
except that the type of objects stored in them is constrained. The
type is specified at object creation time by using a type code, which
is a single character. The following type codes are defined:
Type code C Type Minimum size in bytes
'b' signed integer 1
'B' unsigned integer 1
'u' Unicode character 2 (see note)
'h' signed integer 2
'H' unsigned integer 2
'i' signed integer 2
'I' unsigned integer 2
'l' signed integer 4
'L' unsigned integer 4
'f' floating point 4
'd' floating point 8
This:
a=[c_int()]
makes a list which contains a reference to a c_int object.
Multiplying the list merely duplicates the references, so:
a = [c_int()] * 1024 * 1024
actually creates a list of 1024 * 1024 references to the same single c_int object.
If you want an array of 1024 * 1024 c_ints, do this:
a = c_int * (1024 * 1024)