Consider a C buffer of N elements created with:
from ctypes import byref, c_double
N = 3
buffer = (c_double * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# load buffer in numpy
data = np.frombuffer(buffer, dtype=c_double)
Works great. But my issue is that the dtype may be numerical (float, double, int8, ...) or string.
from ctypes import byref, c_char_p
N = 3
buffer = (c_char_p * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# Load in.. a list?
data = [v.decode("utf-8") for v in buffer]
How can I load those UTF-8 encoded string directly in a numpy array? np.char.decode seems to be a good candidate, but I can't figure out how to use it. np.char.decode(np.frombuffer(buffer, dtype=np.bytes_)) is failing with ValueError: itemsize cannot be zero in type.
EDIT: The buffer can be filled from the Python API. The corresponding lines are:
x = [list of strings]
x = [v.encode("utf-8") for v in x]
buffer = (c_char_p * N)(*x)
push_function(byref(buffer))
Note that this is a different buffer from the one above. push_function pushes the data in x on the network while pull_function retrieves the data from the network. Both are part of the LabStreamingLayer C++ library.
Edit 2: I suspect I can get this to work if I can reload the 'push' buffer into a numpy array before sending it to the network. The 'pull' buffer is probably the same. In that sense, here is a MWE demonstrating the ValueError described above.
from ctypes import c_char_p
import numpy as np
x = ["1", "23"]
x = [elt.encode("utf-8") for elt in x]
buffer = (c_char_p * 2)(*x)
np.frombuffer(buffer, dtype=np.bytes_) # fails
[elt.decode("utf-8") for elt in buffer] # works
You can convert a byte buffer to a Python string using string_at of ctypes. Using buffer.decode("utf-8") also works as your saw (on one c_char_p, not an array of them).
c_char_p * N is an array of pointer of characters (basically an array of C strings having a C type char*[3]). The point is Numpy stores strings using a flat buffer so a copy is nearly mandatory. All the strings of a Numpy array have a bounded size and the reserved size of the overall array is arr.size * maxStrSize * bytePerChar where maxStrSize is the biggest string of the array unless manually changed/specified and bytePerChar is 1 for Numpy byte string arrays (ie. S) and typically 4 for Numpy unicode string arrays (ie. U). Indeed, Numpy should use the UCS-4 encoding for unicode string (AFAIK, unicode strings could also be represented in memory as UCS-2 depending on how the Python interpreter was compiled, but one can check if the UCS-4 coding is used by checking if np.dtype('U1').itemsize == 4 is actually true). The only way not to do a copy is if your C++ code can directly write in a preallocated Numpy array. This means the C++ code must use the same representation than Numpy arrays and the bounded size of all the strings is known before calling the C++ function.
np.frombuffer interprets a buffer as a 1-dimensional array. Thus the buffer needs to be flat while your buffer is not so np.frombuffer cannot be directly used in this case.
A quite inefficient solution is simply to convert strings to CPython bytes array and then build a Numpy array with all of them so Numpy will find the biggest string, allocate the big buffer and copy each strings. This is trivial to implement: np.array([elt.decode("utf-8") for elt in buffer]). This is not very efficient since CPython does the conversion of each string and allocates string that are then read by Numpy before being deallocated.
A faster solution is to copy each string in a raw buffer and then use np.frombuffer. But this is not so simple in practice: one need to check the size of the strings using strlen (or to known the bounded size if any), then allocate a big buffer, then use a memcpy loop (one should not forget to write the final 0 character after that if a string is smaller than the maximum size) and then finally use np.frombuffer (by specifying dtype='S%d' % maxLen). This can be certainly done in Cython or using C extensions for the sake of performance. A better alternative is to preallocate a Numpy array and write directly in its raw buffer. There is a problem though: this only works for ASCII/byte string arrays (ie. S), not for unicode ones (ie. U). For unicode strings, the strings needs to be decoded from the UTF-8 encoding and then encoded back to an UCS-2/UCS-4 byte-buffer. np.frombuffer cannot be used in this case because of the zero-sized dtype as pointed out by #BillHorvath. Thus, one need to do that more manually since AFAIK there is no way to do that efficiently using only CPython or Numpy. The best is certainly to do that in C using fast specialized libraries. Note that unicode strings tends to be inherently inefficient (because of the variable size of each character) so please consider using byte strings if the target strings are guaranteed to be ASCII ones.
It looks like the error message you're seeing is because bytes_ is a flexible data type, whose itemsize is 0 by default:
The 24 built-in array scalar type objects all convert to an associated data-type object. This is true for their sub-classes as well...Note that not all data-type information can be supplied with a type-object: for example, flexible data-types have a default itemsize of 0, and require an explicitly given size to be useful.
And reconstructing an array from a buffer using a dtype that is size 0 by default fails by design.
If you know in advance the type and length of the data you'll see in the buffer, then this answer might have the solution you're looking for:
One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.
Related
I have a numpy.ndarray named values containing numpy.unicode_ strings and I have a C function foo that consumes an array of C-strings. There is a CFFI wrapper interface for foo.
So I have tried to do something like this
p = ffi.from_buffer("char**", values)
and also
p = ffi.from_buffer("char*[]", values)
This doesn't give any errors in CFFI. But once I run the code it crashes in the C implementation of foo and indeed when I look at the pointers they look bad:
(gdb) p d
$1 = (char **) 0x1f978a50
(gdb) p d[0]
$2 = 0x7300000061 <error: Cannot access memory at address 0x7300000061>
I am on a 64 bit architecture.
It won't work like you are trying to do, because the numpy array contains pointers to Python objects (all of type str), I believe. In any case, it is something else than a raw array of char * pointers to the UTF8-encoded versions of the strings.
I think there is no automatic way to do the conversion. You need to do the loop over the items manually, and manually convert all the strings to char[] arrays, and make sure they are all kept alive long enough. This should do it:
items = [ffi.new("char[]", x.encode('utf-8')) for x in values]
p = ffi.new("char *[]", items)
# keep 'items' alive as long as you need 'p'
or, if all you need is to call a C function that expects a char ** argument, you can rely on the automatic Python-list-to-C-array conversion, as long as every item of the Python list is a char *:
items = [ffi.new("char[]", x.encode('utf-8')) for x in values]
lib.my_c_function(items)
The problem is that numpy is not really representing an array of C strings as char*[]. But it is more like a big single char[] in which all strings are occurring using strides equal to .itemsize which in the case of an array of strings is the size of the biggest occurring string. Shorter strings are padded with zero bytes. And the optional first argument cdecl in ffi.from_buffer is not involved in any rigorous type checking on the received underlying buffer/memory view. It is the responsibility of the programmer to know the correct type of the perceived buffer/memory view.
The cdecl argument will provide type safety when for instance used in conjunction with calls to other CFFI wrapped functions.
The way I solved this is by allocating a separate array of char pointers in cffi
t = ffi.new('char*[]', array_size)
Next massage the numpy array a bit to guarantee that each string is null terminated.
then to implement some logic in Python (or C and then wrapped in CFFI if performance is required)
to point each member in the char*[] array to its corresponding string in the numpy array.
I have a list of NumPy variably-sized arrays with dtype=np.uint8(these represent UTF-8 encoded strings). How do I efficiently and fast convert this list to a single dtype=np.unicode_ array?
l = [np.frombuffer(b'asd', dtype = np.uint8), np.frombuffer(b'asdasdas', dtype = np.uint8)]
# The following will work, but will first create a temporary string which is inefficient.
# I'm looking for a method that would directly allocate a target np.unicode_-typed array
# and encode the data into it.
a = np.array([s.tostring().decode('utf-8') for s in l])
The arrays are not just ASCII encoded, they do contain other characters:
s = b'8 \xd0\x93\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb4 \xd0\x91\xd0\xb0\xd0\xb9\xd0\xba\xd0\xbe\xd0\xbd\xd1\x83\xd1\x80 (\xd0\xa0\xd0\xb5\xd1\x81\xd0\xbf\xd1\x83\xd0\xb1\xd0\xbb ...: \xd0\xb8\xd0\xba\xd0\xb0 \xd0\x9a\xd0\xb0\xd0\xb7\xd0\xb0\xd1\x85\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd)'
s.decode('utf-8') # works
update
It turns out Python utf-8 codec can be used to decode
an ndarray directly, without needing to copy its contents
to a bytesstring with .tostring() first: with the codecs
module it is possible to retrieve the callable that
coverts utf-8 byte-sequences to unicode strings without
having to go through str.decode
lst = [np.frombuffer(b'asd', dtype = np.uint8), np.frombuffer(b'asdasdas', dtype = np.uint8)]
import codecs
decoder = codes.getdecoder("utf-8")
data = np.array([decoder(item)[0] for item in lst], dtype="unicode")
This avoids one step of the conversion - there is another step that could
be avoided, because this will still create a list of all strings in memory
before calling the last .array constructor - numpy has a .fromiter array constructor - but it can't create an array with arbitrary unicode objects - it needs a fixed character width. That would end up consuming more memory than you are so far:
data = np.fromiter((decoder(item) for item in lst), count=len(lst), dtype="U120") # For max-length of 120 characters.
original- answer (mostly rome rambling)
Modern Python internal handling of Unicode text is quite efficient, with the internal unicode point representation depending on the widest char in a string.
Numpy, on the other side, just stores a 32bit value for each unicode character - and it has no business "understanding" utf-8. The Python language does that very well - and its fast. Although Python won't use any multi-threaded, multi-core, or hardware accelerated strategy when decoding utf-8 bytes to text, the decoding takes place in native code, and is as fast as you will get in a single CPU core.
Decoding a 4MB-sized text to unicode using plain Python just took under 30ms in my system.
In other words: you are getting worried with the wrong problem - unless whatever you are coding needs to convert about 100-bible-sized text corpus per second in a sustained way.
Just let Python do the utf-8 decoding, and handle the result back to numpy (which will encode it again in its 32bit format) - the the spent on this is so negligible for the great majority of real-world tasks that this is the way the Pandas library, for example, performs almost all of its actions on data: creating new copies of it after each operation.
In Ruby, I could easily pack an array representing some sequence into a binary string:
# for int
# "S*!" directive means format for 16-bit int, and using native endianess
# 16-bit int, so each digit was represented by two bytes. "\x01\x00" and "\x02\x00"
# here the native endianess is "little endian", so you should
# look at it backwards, "\x01\x00" becomes 0001, and "\x02\x00" becomes 0002
"\x01\x00\x02\x00".unpack("S!*")
# [1, 2]
# for hex
# "H*" means every element in the array is a digit for the hexstream
["037fea0651b358c361de"].pack("H*")
# "\x03\x7F\xEA\x06Q\xB3X\xC3a\xDE"
API doc for pack and unpack.
I couldn't find an uniform and equivalent way of transforming sequence to bytes (or vice versa) in python.
While struct provides methods for packing into bytes objects, the format string available has no option for hexstream.
EDIT: What I really want is something as versatile as Ruby's arr.pack and str.unpack, which supports multiple formatting and endianess control.
for a string in the utf-8 range it would be:
from binascii import unhexlify
strg = "464F4F"
unhexlify(strg).decode() # FOO (str)
if your content is just binary
strg = "037fea0651b358c361de"
unhexlify(strg) # b'\x03\x7f\xea\x06Q\xb3X\xc3a\xde' (bytes)
also bytes.fromhex (as in Davis Herring's answer) may be worth checking out.
struct does only fixed-width encodings that correspond to a memory dump of something like a C struct. You want bytes.fromhex or binascii.unhexlify, depending on the source type (which is never a list).
After any such conversion, you can use struct.unpack on a byte string containing any number of “records” corresponding to the format string; each is decoded into an element of the returned tuple. The format string supports the usual integer sizes and endianness choices; it is of course possible to construct a format dynamically to do things like read a matrix whose dimensions are chosen at runtime:
mat=struct.unpack("%dd"%cols,buf) # rows determined from len(buf)
It’s also possible to construct a lower-memory array if the element type is primitive; then you can follow up with byteswap as Alec A mentioned. NumPy offers similar facilities.
Try memoryview.cast, which allows you to change the endianness of an array or byte object.
Storing values as arrays makes things easier, as you can use the byteswap function.
I want to convert a numpy array to a bytestring in python 2.7. Lets say my numpy array a is a simple 2x2 array, looking like this:
[[1,10],
[16,255]]
My question is, how to convert this array to a string of bytes or bytearray with the output looking like:
\x01\x0A\x10\xff
or equally well:
bytearray(b'\x01\x0A\x10\xff')
Assuming a is an array of np.int8 type, you can use tobytes() to get the output you specify:
>>> a.tobytes()
b'\x01\n\x10\xff'
Note that my terminal prints \x0A as the newline character \n.
Calling the Python built in function bytes on the array a does the same thing, although tobytes() allows you to specify the memory layout (as per the documentation).
If a has a type which uses more bytes for each number, your byte string might be padded with a lot of unwanted null bytes. You can either cast to the smaller type, or use slicing (or similar). For example if a is of type int64:
>>> a.tobytes()[::8]
b'\x01\n\x10\xff
As a side point, you can also interpret the underlying memory of the NumPy array as bytes using view. For instance, if a is still of int64 type:
>>> a.view('S8')
array([[b'\x01', b'\n'],
[b'\x10', b'\xff']], dtype='|S8')
I'm trying to use the pack function in the struct module to encode data into formats required by a network protocol. I've run into a problem in that I don't see any way to encode arrays of anything other than 8-bit characters.
For example, to encode "TEST", I can use format specifier "4s". But how do I encode an array or list of 32-bit integers or other non-string types?
Here is a concrete example. Suppose I have a function doEncode which takes an array of 32-bit values. The protocol requires a 32-bit length field, followed by the array itself. Here is what I have been able to come up with so far.
from array import *
from struct import *
def doEncode(arr):
bin=pack('>i'+len(arr)*'I',len(arr), ???)
arr=array('I',[1,2,3])
doEncode(arr)
The best I have been able to come up with is generating a format to the pack string dynamically from the length of the array. Is there some way of specifying that I have an array so I don't need to do this, like there is with a string (which e.g. would be pack('>i'+len(arr)+'s')?
Even with the above approach, I'm not sure how I would go about actually passing the elements in the array in a similar dynamic way, i.e. I can't just say , arr[0], arr[1], ... because I don't know ahead of time what the length will be.
I suppose I could just pack each individual integer in the array in a loop, and then join all the results together, but this seems like a hack. Is there some better way to do this? The array and struct modules each seem to do their own thing, but in this case what I'm trying to do is a combination of both, which neither wants to do.
data = pack('>i', len(arr)) + arr.tostring()