Convert python byte string to numpy int?

Convert python byte string to numpy int? - python

Is there a direct way instead of the following?
np.uint32(int.from_bytes(b'\xa3\x8eq\xb5', 'big'))

Using np.fromstring for this is deprecated now. Use np.frombuffer instead. You can also pass in a normal numpy dtype:
import numpy as np
np.frombuffer(b'\xa3\x8eq\xb5', dtype=np.uint32)

The trick is to get the right datatype. To read big endian uint32 from a string the datatype (as a string) is '>u4'.
>>> np.fromstring(b'\xa3\x8eq\xb5', dtype='>u4')
array([2744021429], dtype=uint32)
This gives you an array back, but getting a scalar from there is a pretty trivial matter. More importantly, it allows you to read a large number of these objects in one go (which you can't do with your int.from_bytes trick).

I'm not sure about the data type.
np.fromstring(b'\xa3\x8eq\xb5', dtype='<i')

Related

Load c_char_p_Array in a Numpy array

Consider a C buffer of N elements created with:
from ctypes import byref, c_double
N = 3
buffer = (c_double * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# load buffer in numpy
data = np.frombuffer(buffer, dtype=c_double)
Works great. But my issue is that the dtype may be numerical (float, double, int8, ...) or string.
from ctypes import byref, c_char_p
N = 3
buffer = (c_char_p * N)()
# C++ function that fills the buffer byref
pull_function(byref(buffer))
# Load in.. a list?
data = [v.decode("utf-8") for v in buffer]
How can I load those UTF-8 encoded string directly in a numpy array? np.char.decode seems to be a good candidate, but I can't figure out how to use it. np.char.decode(np.frombuffer(buffer, dtype=np.bytes_)) is failing with ValueError: itemsize cannot be zero in type.
EDIT: The buffer can be filled from the Python API. The corresponding lines are:
x = [list of strings]
x = [v.encode("utf-8") for v in x]
buffer = (c_char_p * N)(*x)
push_function(byref(buffer))
Note that this is a different buffer from the one above. push_function pushes the data in x on the network while pull_function retrieves the data from the network. Both are part of the LabStreamingLayer C++ library.
Edit 2: I suspect I can get this to work if I can reload the 'push' buffer into a numpy array before sending it to the network. The 'pull' buffer is probably the same. In that sense, here is a MWE demonstrating the ValueError described above.
from ctypes import c_char_p
import numpy as np
x = ["1", "23"]
x = [elt.encode("utf-8") for elt in x]
buffer = (c_char_p * 2)(*x)
np.frombuffer(buffer, dtype=np.bytes_) # fails
[elt.decode("utf-8") for elt in buffer] # works

You can convert a byte buffer to a Python string using string_at of ctypes. Using buffer.decode("utf-8") also works as your saw (on one c_char_p, not an array of them).
c_char_p * N is an array of pointer of characters (basically an array of C strings having a C type char*[3]). The point is Numpy stores strings using a flat buffer so a copy is nearly mandatory. All the strings of a Numpy array have a bounded size and the reserved size of the overall array is arr.size * maxStrSize * bytePerChar where maxStrSize is the biggest string of the array unless manually changed/specified and bytePerChar is 1 for Numpy byte string arrays (ie. S) and typically 4 for Numpy unicode string arrays (ie. U). Indeed, Numpy should use the UCS-4 encoding for unicode string (AFAIK, unicode strings could also be represented in memory as UCS-2 depending on how the Python interpreter was compiled, but one can check if the UCS-4 coding is used by checking if np.dtype('U1').itemsize == 4 is actually true). The only way not to do a copy is if your C++ code can directly write in a preallocated Numpy array. This means the C++ code must use the same representation than Numpy arrays and the bounded size of all the strings is known before calling the C++ function.
np.frombuffer interprets a buffer as a 1-dimensional array. Thus the buffer needs to be flat while your buffer is not so np.frombuffer cannot be directly used in this case.
A quite inefficient solution is simply to convert strings to CPython bytes array and then build a Numpy array with all of them so Numpy will find the biggest string, allocate the big buffer and copy each strings. This is trivial to implement: np.array([elt.decode("utf-8") for elt in buffer]). This is not very efficient since CPython does the conversion of each string and allocates string that are then read by Numpy before being deallocated.
A faster solution is to copy each string in a raw buffer and then use np.frombuffer. But this is not so simple in practice: one need to check the size of the strings using strlen (or to known the bounded size if any), then allocate a big buffer, then use a memcpy loop (one should not forget to write the final 0 character after that if a string is smaller than the maximum size) and then finally use np.frombuffer (by specifying dtype='S%d' % maxLen). This can be certainly done in Cython or using C extensions for the sake of performance. A better alternative is to preallocate a Numpy array and write directly in its raw buffer. There is a problem though: this only works for ASCII/byte string arrays (ie. S), not for unicode ones (ie. U). For unicode strings, the strings needs to be decoded from the UTF-8 encoding and then encoded back to an UCS-2/UCS-4 byte-buffer. np.frombuffer cannot be used in this case because of the zero-sized dtype as pointed out by #BillHorvath. Thus, one need to do that more manually since AFAIK there is no way to do that efficiently using only CPython or Numpy. The best is certainly to do that in C using fast specialized libraries. Note that unicode strings tends to be inherently inefficient (because of the variable size of each character) so please consider using byte strings if the target strings are guaranteed to be ASCII ones.

It looks like the error message you're seeing is because bytes_ is a flexible data type, whose itemsize is 0 by default:
The 24 built-in array scalar type objects all convert to an associated data-type object. This is true for their sub-classes as well...Note that not all data-type information can be supplied with a type-object: for example, flexible data-types have a default itemsize of 0, and require an explicitly given size to be useful.
And reconstructing an array from a buffer using a dtype that is size 0 by default fails by design.
If you know in advance the type and length of the data you'll see in the buffer, then this answer might have the solution you're looking for:
One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.

Numpy tobytes() with defined byteorder

Is it possible to define byte order when converting a numpy array to binary string (with tobytes())?
I would want to force little endianness, but I don't want byte-swapping if it is not necessary.

When interfacing with C code I use this pattern
numpy.ascontiguousarray(x, dtype='>i4')
That dtype string specifies the endianess and precise bit width.
You can check ndarray.flags to see if conversions are necessary.

Convert numpy array to hex bytearray

I want to convert a numpy array to a bytestring in python 2.7. Lets say my numpy array a is a simple 2x2 array, looking like this:
[[1,10],
[16,255]]
My question is, how to convert this array to a string of bytes or bytearray with the output looking like:
\x01\x0A\x10\xff
or equally well:
bytearray(b'\x01\x0A\x10\xff')

Assuming a is an array of np.int8 type, you can use tobytes() to get the output you specify:
>>> a.tobytes()
b'\x01\n\x10\xff'
Note that my terminal prints \x0A as the newline character \n.
Calling the Python built in function bytes on the array a does the same thing, although tobytes() allows you to specify the memory layout (as per the documentation).
If a has a type which uses more bytes for each number, your byte string might be padded with a lot of unwanted null bytes. You can either cast to the smaller type, or use slicing (or similar). For example if a is of type int64:
>>> a.tobytes()[::8]
b'\x01\n\x10\xff
As a side point, you can also interpret the underlying memory of the NumPy array as bytes using view. For instance, if a is still of int64 type:
>>> a.view('S8')
array([[b'\x01', b'\n'],
[b'\x10', b'\xff']], dtype='|S8')

Using python's pack with arrays

I'm trying to use the pack function in the struct module to encode data into formats required by a network protocol. I've run into a problem in that I don't see any way to encode arrays of anything other than 8-bit characters.
For example, to encode "TEST", I can use format specifier "4s". But how do I encode an array or list of 32-bit integers or other non-string types?
Here is a concrete example. Suppose I have a function doEncode which takes an array of 32-bit values. The protocol requires a 32-bit length field, followed by the array itself. Here is what I have been able to come up with so far.
from array import *
from struct import *
def doEncode(arr):
bin=pack('>i'+len(arr)*'I',len(arr), ???)
arr=array('I',[1,2,3])
doEncode(arr)
The best I have been able to come up with is generating a format to the pack string dynamically from the length of the array. Is there some way of specifying that I have an array so I don't need to do this, like there is with a string (which e.g. would be pack('>i'+len(arr)+'s')?
Even with the above approach, I'm not sure how I would go about actually passing the elements in the array in a similar dynamic way, i.e. I can't just say , arr[0], arr[1], ... because I don't know ahead of time what the length will be.
I suppose I could just pack each individual integer in the array in a loop, and then join all the results together, but this seems like a hack. Is there some better way to do this? The array and struct modules each seem to do their own thing, but in this case what I'm trying to do is a combination of both, which neither wants to do.

data = pack('>i', len(arr)) + arr.tostring()

Printing numpy.float64 with full precision

What is the proper/accepted way to print and convert a numpy.float64 to a string? I've noticed just using print or str() will lose some precision. However, repr maintains the full precision. For example:
>>> import numpy
>>> print numpy.float64('6374.345407799015')
6374.3454078
>>> print repr(numpy.float64('6374.345407799015'))
6374.3454077990154
I assume that just calling print turns into calling str() on the float64 object. So is __str__() for numpy.float64 implemented with something like '%s' % (float(self)) or somehow casts the float64 with Python's built-in float()? I tried to quickly look around the numpy source for this but wasn't immediately obvious what was happening.
I've always thought repr() should return valid Python code that could be used by eval() to re-create the object. Is this an accepted convention? Luckily in this case numpy does not follow this convention because repr() returns just the raw number as a string instead of something like "numpy.float64('6374.345407799015')".
So, all of this confuses me. What is the correct way to convert a numpy.float64 to a string and/or print it while guaranteeing you always have the same, full precision?

The astype method works well:
>>> numpy.float64('6374.345407799015').astype(str)
'6374.345407799015'

Look into numpy.set_printoptions. Specifically,
numpy.set_printoptions(precision=15)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.