Is it possible to define byte order when converting a numpy array to binary string (with tobytes())?
I would want to force little endianness, but I don't want byte-swapping if it is not necessary.
When interfacing with C code I use this pattern
numpy.ascontiguousarray(x, dtype='>i4')
That dtype string specifies the endianess and precise bit width.
You can check ndarray.flags to see if conversions are necessary.
Related
Consider a C buffer of N elements created with:
from ctypes import byref, c_double
N = 3
buffer = (c_double * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# load buffer in numpy
data = np.frombuffer(buffer, dtype=c_double)
Works great. But my issue is that the dtype may be numerical (float, double, int8, ...) or string.
from ctypes import byref, c_char_p
N = 3
buffer = (c_char_p * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# Load in.. a list?
data = [v.decode("utf-8") for v in buffer]
How can I load those UTF-8 encoded string directly in a numpy array? np.char.decode seems to be a good candidate, but I can't figure out how to use it. np.char.decode(np.frombuffer(buffer, dtype=np.bytes_)) is failing with ValueError: itemsize cannot be zero in type.
EDIT: The buffer can be filled from the Python API. The corresponding lines are:
x = [list of strings]
x = [v.encode("utf-8") for v in x]
buffer = (c_char_p * N)(*x)
push_function(byref(buffer))
Note that this is a different buffer from the one above. push_function pushes the data in x on the network while pull_function retrieves the data from the network. Both are part of the LabStreamingLayer C++ library.
Edit 2: I suspect I can get this to work if I can reload the 'push' buffer into a numpy array before sending it to the network. The 'pull' buffer is probably the same. In that sense, here is a MWE demonstrating the ValueError described above.
from ctypes import c_char_p
import numpy as np
x = ["1", "23"]
x = [elt.encode("utf-8") for elt in x]
buffer = (c_char_p * 2)(*x)
np.frombuffer(buffer, dtype=np.bytes_) # fails
[elt.decode("utf-8") for elt in buffer] # works
You can convert a byte buffer to a Python string using string_at of ctypes. Using buffer.decode("utf-8") also works as your saw (on one c_char_p, not an array of them).
c_char_p * N is an array of pointer of characters (basically an array of C strings having a C type char*[3]). The point is Numpy stores strings using a flat buffer so a copy is nearly mandatory. All the strings of a Numpy array have a bounded size and the reserved size of the overall array is arr.size * maxStrSize * bytePerChar where maxStrSize is the biggest string of the array unless manually changed/specified and bytePerChar is 1 for Numpy byte string arrays (ie. S) and typically 4 for Numpy unicode string arrays (ie. U). Indeed, Numpy should use the UCS-4 encoding for unicode string (AFAIK, unicode strings could also be represented in memory as UCS-2 depending on how the Python interpreter was compiled, but one can check if the UCS-4 coding is used by checking if np.dtype('U1').itemsize == 4 is actually true). The only way not to do a copy is if your C++ code can directly write in a preallocated Numpy array. This means the C++ code must use the same representation than Numpy arrays and the bounded size of all the strings is known before calling the C++ function.
np.frombuffer interprets a buffer as a 1-dimensional array. Thus the buffer needs to be flat while your buffer is not so np.frombuffer cannot be directly used in this case.
A quite inefficient solution is simply to convert strings to CPython bytes array and then build a Numpy array with all of them so Numpy will find the biggest string, allocate the big buffer and copy each strings. This is trivial to implement: np.array([elt.decode("utf-8") for elt in buffer]). This is not very efficient since CPython does the conversion of each string and allocates string that are then read by Numpy before being deallocated.
A faster solution is to copy each string in a raw buffer and then use np.frombuffer. But this is not so simple in practice: one need to check the size of the strings using strlen (or to known the bounded size if any), then allocate a big buffer, then use a memcpy loop (one should not forget to write the final 0 character after that if a string is smaller than the maximum size) and then finally use np.frombuffer (by specifying dtype='S%d' % maxLen). This can be certainly done in Cython or using C extensions for the sake of performance. A better alternative is to preallocate a Numpy array and write directly in its raw buffer. There is a problem though: this only works for ASCII/byte string arrays (ie. S), not for unicode ones (ie. U). For unicode strings, the strings needs to be decoded from the UTF-8 encoding and then encoded back to an UCS-2/UCS-4 byte-buffer. np.frombuffer cannot be used in this case because of the zero-sized dtype as pointed out by #BillHorvath. Thus, one need to do that more manually since AFAIK there is no way to do that efficiently using only CPython or Numpy. The best is certainly to do that in C using fast specialized libraries. Note that unicode strings tends to be inherently inefficient (because of the variable size of each character) so please consider using byte strings if the target strings are guaranteed to be ASCII ones.
It looks like the error message you're seeing is because bytes_ is a flexible data type, whose itemsize is 0 by default:
The 24 built-in array scalar type objects all convert to an associated data-type object. This is true for their sub-classes as well...Note that not all data-type information can be supplied with a type-object: for example, flexible data-types have a default itemsize of 0, and require an explicitly given size to be useful.
And reconstructing an array from a buffer using a dtype that is size 0 by default fails by design.
If you know in advance the type and length of the data you'll see in the buffer, then this answer might have the solution you're looking for:
One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.
In Ruby, I could easily pack an array representing some sequence into a binary string:
# for int
# "S*!" directive means format for 16-bit int, and using native endianess
# 16-bit int, so each digit was represented by two bytes. "\x01\x00" and "\x02\x00"
# here the native endianess is "little endian", so you should
# look at it backwards, "\x01\x00" becomes 0001, and "\x02\x00" becomes 0002
"\x01\x00\x02\x00".unpack("S!*")
# [1, 2]
# for hex
# "H*" means every element in the array is a digit for the hexstream
["037fea0651b358c361de"].pack("H*")
# "\x03\x7F\xEA\x06Q\xB3X\xC3a\xDE"
API doc for pack and unpack.
I couldn't find an uniform and equivalent way of transforming sequence to bytes (or vice versa) in python.
While struct provides methods for packing into bytes objects, the format string available has no option for hexstream.
EDIT: What I really want is something as versatile as Ruby's arr.pack and str.unpack, which supports multiple formatting and endianess control.
for a string in the utf-8 range it would be:
from binascii import unhexlify
strg = "464F4F"
unhexlify(strg).decode() # FOO (str)
if your content is just binary
strg = "037fea0651b358c361de"
unhexlify(strg) # b'\x03\x7f\xea\x06Q\xb3X\xc3a\xde' (bytes)
also bytes.fromhex (as in Davis Herring's answer) may be worth checking out.
struct does only fixed-width encodings that correspond to a memory dump of something like a C struct. You want bytes.fromhex or binascii.unhexlify, depending on the source type (which is never a list).
After any such conversion, you can use struct.unpack on a byte string containing any number of “records” corresponding to the format string; each is decoded into an element of the returned tuple. The format string supports the usual integer sizes and endianness choices; it is of course possible to construct a format dynamically to do things like read a matrix whose dimensions are chosen at runtime:
mat=struct.unpack("%dd"%cols,buf) # rows determined from len(buf)
It’s also possible to construct a lower-memory array if the element type is primitive; then you can follow up with byteswap as Alec A mentioned. NumPy offers similar facilities.
Try memoryview.cast, which allows you to change the endianness of an array or byte object.
Storing values as arrays makes things easier, as you can use the byteswap function.
I want to convert a numpy array to a bytestring in python 2.7. Lets say my numpy array a is a simple 2x2 array, looking like this:
[[1,10],
[16,255]]
My question is, how to convert this array to a string of bytes or bytearray with the output looking like:
\x01\x0A\x10\xff
or equally well:
bytearray(b'\x01\x0A\x10\xff')
Assuming a is an array of np.int8 type, you can use tobytes() to get the output you specify:
>>> a.tobytes()
b'\x01\n\x10\xff'
Note that my terminal prints \x0A as the newline character \n.
Calling the Python built in function bytes on the array a does the same thing, although tobytes() allows you to specify the memory layout (as per the documentation).
If a has a type which uses more bytes for each number, your byte string might be padded with a lot of unwanted null bytes. You can either cast to the smaller type, or use slicing (or similar). For example if a is of type int64:
>>> a.tobytes()[::8]
b'\x01\n\x10\xff
As a side point, you can also interpret the underlying memory of the NumPy array as bytes using view. For instance, if a is still of int64 type:
>>> a.view('S8')
array([[b'\x01', b'\n'],
[b'\x10', b'\xff']], dtype='|S8')
Is there a direct way instead of the following?
np.uint32(int.from_bytes(b'\xa3\x8eq\xb5', 'big'))
Using np.fromstring for this is deprecated now. Use np.frombuffer instead. You can also pass in a normal numpy dtype:
import numpy as np
np.frombuffer(b'\xa3\x8eq\xb5', dtype=np.uint32)
The trick is to get the right datatype. To read big endian uint32 from a string the datatype (as a string) is '>u4'.
>>> np.fromstring(b'\xa3\x8eq\xb5', dtype='>u4')
array([2744021429], dtype=uint32)
This gives you an array back, but getting a scalar from there is a pretty trivial matter. More importantly, it allows you to read a large number of these objects in one go (which you can't do with your int.from_bytes trick).
I'm not sure about the data type.
np.fromstring(b'\xa3\x8eq\xb5', dtype='<i')
I'm trying to read binary files containing a stream of float and int16 values. Those values are stored alternating.
[float][int16][float][int16]... and so on
now I want to read this data file by a python program using the struct functions.
For reading a block of say one such float-int16-pairs I assume the format string would be "fh".
Following output makes sense, the total size is 6 bytes
In [73]: struct.calcsize('fh')
Out[73]: 6
Now I'd like to read larger blocks at once to speed up the program...
In [74]: struct.calcsize('fhfh')
Out[74]: 14
Why is this not returning 12?
Quoting the documentation:
Note By default, the result of packing a given C struct includes pad bytes in order to maintain proper alignment for the C types involved; similarly, alignment is taken into account when unpacking. This behavior is chosen so that the bytes of a packed struct correspond exactly to the layout in memory of the corresponding C struct. To handle platform-independent data formats or omit implicit pad bytes, use standard size and alignment instead of native size and alignment: see Byte Order, Size, and Alignment for details.
https://docs.python.org/2/library/struct.html
If you want calcsize('fhfh') to be exactly twice calcsize('fh'), then you'll need to specify an alignment character.
Try '<fhfh' or '>fhfh', instead.
You have to specify the byte order or Endianness as size and alignment are based off of that So if you try this:
>>> struct.calcsize('fhfh')
>>> 14
>>> struct.calcsize('>fhfh')
>>> 12
The reason why is because in struct not specifying an endian defaults to native
for more details check here: https://docs.python.org/3.0/library/struct.html#struct.calcsize