Fastest way to convert a bytearray into a numpy array - python

I have a bytearray which I want to convert to a numpy array of int16 to perform FFT operations on. The bytearray is coming out of a UDP socket so first I convert two consecutive bytes into an int16 using struct.unpack, and then convert it a numpy array using np.asarray.
Current approach, however, is too slow. Original bytearray is of length 1e6 bytes, so each of the mentioned steps (struct.unpack and np.asarray) takes 20 ms and with a total of 40ms. This is a relatively long frame time for my applications so I need it a bit shortened.
Currently, I'm doing this:
temp1 = self.data_buffer[0:FRAME_LEN_B]
self.temp_list = np.asarray(struct.unpack('h' * (len(temp1) // 2), temp1))

You can try np.frombuffer. This can wrap any object supporting the buffer protocol, which bytearray explicitly does, into an array:
arr = np.frombuffer(self.data_buffer, dtype=np.int16, size=FRAME_LEN_B // 2)
You can manipulate the array however you want after that: slice, reshape, transpose, etc.
If your native byte order is opposite to what you have coming in from the network, you can swap the interpretation order without having to swap the data in-place:
dt = np.dtype(np.int16)
dt.newbyteorder('>')
arr = np.frombuffer(self.data_buffer, dtype=dt, size=FRAME_LEN_B // 2)
If the order is non-native, operations on the array may take longer, as the data will have to be swapped every time on the fly. You can therefore change the byte order in-place ahead of time if that is the case:
arr.byteswap(inplace=True)
This will overwrite the contents of the original packet. If you want to make a separate copy, just set inplace=False, which is the default.

Related

Deterministic method to hash np.array -> int

I am creating a system that stores large numpy arrays in pyarrow.plasma.
I want to give each array a unique, deterministic plasma.ObjectID, np.array is not hashable sadly
My current (broken) approach is:
import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
return x.to_bytes(
(x.bit_length() + 7) // 8, "big"
) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
arr_id = int(arr.sum() / (arr.shape[0]))
oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20
return plasma.ObjectID(oid)
But this can easily fail, for example:
arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))
It also involves summing the array, which could be very slow for large arrays.
Feel free to assume that the dtype of arr will be np.uint or np.int.
I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either
a) cheaper to compute
b) less likely to fail in practice
or, ideally, both!
The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.
def hasharr(arr):
hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
for dim in arr.shape:
hash.update(dim.to_bytes(4, byteorder='big'))
return hash.digest()
Exmaple:
>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"
I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.
I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.

Read binary flatfile and skip bytes

I have a binary file that has data organized into 400 byte groups. I want to build an array of type np.uint32 from bytes at position 304 to position 308. However, I cannot find a method provided by NumPy that lets me select which bytes to read, only an initial offset as defined in numpy.fromfile.
For example, if my file contains 1000 groups of 400 bytes, I need an array of size 1000 such that:
arr[0] = bytes 304-308
arr[1] = bytes 704-708
...
arr[-1] = bytes 399904 - 399908
Is there a NumPy method that would allow me to specify which bytes to read from a buffer?
Another way to rephrase what you are looking for (slightly), is to say you want to read uint32 numbers starting at offset 304, with a stride of 400 bytes. np.fromfile does not provide an argument to insert custom strides (although it probably should). You have a couple of different options going forward.
The simplest is probably to load the entire file and subset the column you want:
data = np.fromfile(filename, dtype=np.uint32)[304 // 4::400 // 4].copy()
If you want more control over the exact positioning of the bytes (e.g., if the offset or block size is not a multiple of 4), you can use structured arrays instead:
dt = np.dtype([('_1', 'u1', 304), ('data', 'u4'), ('_2', 'u1', 92)])
data = np.fromfile(filename, dtype=dt)['data'].copy()
Here, _1 and _2 are used to discard the unneeded bytes with 1-byte resolution rather than 4.
Loading the entire file is generally going to be much faster than seeking between reads, so these approaches are likely desirable for files that fit into memory. If that is not the case, you can use memory mapping, or an entirely home-grown solution.
Memory maps can be implemented via Pythons mmap module, and wrapped in an ndarray using the buffer parameter, or you can use the np.memmap class that does it for you:
mm = np.memmap(filename, dtype=np.uint32, mode='r', offset=0, shape=(1000, 400 // 4))
data = np.array(mm[:, 304 // 4])
del mm
Using a raw mmap is arguably more efficient because you can specify a strides and offset that look directly into the map, skipping all the extra data. It is also better, because you can use an offset and strides that are not multiples of the size of a np.uint32:
with open(filename, 'rb') as f, mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as mm:
data = np.ndarray(buffer=mm, dtype=np.uint32, offset=304, strides=400, shape=1000).copy()
The final call to copy is required because the underlying buffer will be invalidated as soon as the memory map is closed, possibly leading to a segfault.

How to do elementwise processing (first int, then pairwise absolute length) to avoid memory problems? (python)

I want to process a big list of uint numbers (test1) and I do that i chunks of "length". I need them as signed int and then I need it as absolute length from each pait of even and odd values in this list.
But I want to get rid of two problems:
it uses a a lot of ram
it takes ages!
So how could I make this faster? Any trick? I could also use numpy, no problem in doing so.
Thanks in advance!
test2 = -127 + test1[i:i+length*2048000*2 + 2048000*2*1]
test3 = (test2[::2]**2 + test2[1::2]**2)**0.5
An efficient way is to try to use Numpy functions, e.g:
n = 10
ff = np.random.randint(0, 255, n) # generate some data
ff2 = ff.reshape(n/2, 2) # new view on ff (only makes copy if needed)
l_ff = np.linalg.norm(ff2, axis=1) # calculate vector length of each row
Note that when modifying an entry in ff2 then ff will change as well and vice versa.
Internally, Numpy stores data as contiguous memory blocks. So there are further methods besides np.reshape() to exploit that structure. For efficient conversion of data types, you can try:
dd_s = np.arange(-5, 10, dtype=np.int8)
dd_u = dd_s.astype(np.uint8) # conversion from signed to unsigned

Simple to implement struct for memory efficient list of tuples

I need to create a list of the following type
[(latitude, longitude, date), ...]
where latitude and longitude are floats, and date is an integer. I'm running out of memory on my local machine because I need to store about 60 million of these tuples. What is the most memory efficient (and at the same time simple to implement) way of representing these tuples in python?
The precision of the latitude and longitude does not need to be so great (just enough to represent values such as -65.100234) and the integers need to be big enough to handle UNIX timestamps.
I have used swig before to define "c-structs" which are in general much more memory efficient than they python, but this is complicated to implement...maybe there some scipy or numpy way to declare such tuples that uses less memory...any ideas?
If you are fine with using NumPy, you could use a numpy.recarray. If you want 8 significant digits for your coordinates, single precision floats are probably just not enough, so your records would have two double precision floats and a 32-bit integer, which is twenty bytes in total, so 60 million records would need 1.2 GB of memory. Note that NumPy arrays have fixed size and need to be reallocated if the size changes.
Code example:
# Create an uninitialised array with 100 records
a = numpy.recarray(100,
formats=["f8", "f8", "i4"],
names=["latitude", "longitude", "date"])
# initialise to 0
a[:] = (0.0, 0.0, 0)
# assign a single record
a[0] = (-65.100234, -38.32432, 1309351408)
# access the date of the first record
a[0].date
# access the whole date column
a.date
If you want to avoid a dependency on NumPy, you could also use ctypes arrays of ctypes structures, which are less convenient than NumPy arrays, but more convenient than using SWIG.

Storing 'struct' data to binary file

I need to store a binary file with a 12 byte header composed of 4 fields. They are namely: sSamples (4-bytes integer), sSampPeriod (4-bytes integer), sSampSize (2-bytes integer), and finally sParmKind (2-bytes integer).
I'm using 'struct' to my variables to the desired fields. Now that I have them defined separately, how could I merge them all to store the '12 bytes header'?
sSamples = struct.pack('i', nSamples) # 4-bytes integer
sSampPeriod = struct.pack('i', nSampPeriod) # 4-bytes integer
sSampSize = struct.pack('H', nSampSize) # 2-bytes integer / unsigned short
sParmKind = struct.pack('H', 9) # 2-bytes integer / unsigned short
In addition, I've a npVect float array of dimensionality D (numpy.ndarray - float32). How could I store this vector in the same binary file, but after the header?
As Cody Brocious wrote, you can pack your entire header at once:
header = struct.pack('<iiHH', nSamples, nSampPeriod, nSampSize, nParmKind)
He also mentioned endianness, which is important if you want to pack your data so as to reliably unpack it on machines with different architectures. The < at the beginning of my format string specifies "pack this data using a little-endian convention".
As for the array, you'll have to pack its length in order to determine how many values to unpack when you read it again. Doing it all in one call:
flattened = npVect.ravel() # get a 1-D array of numbers
arrSize = len(flattened)
# pack header, count of numbers, and numbers, all in one call
packed = struct.pack('<iiHHi%df' % arrSize,
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize, *flattened)
Depending on how big your array is likely to be, you could end up with a huge string representing the entire contents of your binary file, and you might want to look into alternatives to struct which don't require you to have the entire file in memory.
Unpacking:
fmt = '<iiHHi'
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize = struct.unpack(fmt, packed)
# Use unpack_from to start reading after the packed header and count
flattened = struct.unpack_from('<%df' % arrSize, packed, struct.calcsize(fmt))
npVect = np.ndarray(flattened, dtype='float32').reshape(# your dimensions go here
)
EDIT: Oops, the array format isn't quite as simple as that :) The general idea holds, though: flatten your array into a list of numbers using any method you like, pack the number of values, then pack each value. On the other side, read the array as a flat list, then impose whatever structure you need on it.
EDIT: Changed format strings to use repeat specifiers, rather than string multiplication. Thanks to John Machin for pointing it out.
EDIT: Added numpy code to flatten the array before packing and reconstruct it after unpacking.
struct.pack returns a string, so you can combine the fields simply by string concatenation:
header = sSamples + sSampPeriod + sSampSize + sParmKind
assert len( header ) == 12

Categories