I am creating a system that stores large numpy arrays in pyarrow.plasma.
I want to give each array a unique, deterministic plasma.ObjectID, np.array is not hashable sadly
My current (broken) approach is:
import numpy as np
from pyarrow import plasma
def int_to_bytes(x: int) -> bytes:
return x.to_bytes(
(x.bit_length() + 7) // 8, "big"
) # https://stackoverflow.com/questions/21017698/converting-int-to-bytes-in-python-3
def get_object_id(arr):
arr_id = int(arr.sum() / (arr.shape[0]))
oid: bytes = int_to_bytes(arr_id).zfill(20) # fill from left with zeroes, must be of length 20
return plasma.ObjectID(oid)
But this can easily fail, for example:
arr = np.arange(12)
a1 = arr.reshape(3, 4)
a2 = arr.reshape(3,2,2)
assert get_object_id(a1) != get_object_id(a2), 'Hash collision'
# another good test case
assert get_object_id(np.ones(12)) != get_object_id(np.ones(12).reshape(4,3))
assert get_object_id(np.ones(12)) != get_object_id(np.zeros(12))
It also involves summing the array, which could be very slow for large arrays.
Feel free to assume that the dtype of arr will be np.uint or np.int.
I know think that it's impossible to never have a hash collision (I only have 20 bytes of ID and there are more than 2^20) possible inputs, so I am just looking for something that is either
a) cheaper to compute
b) less likely to fail in practice
or, ideally, both!
The hashlib module has some routines for computing hashes from byte strings (typically used for CRC). You can convert an ndarray into a bytes string with ndarray.tobytes however your examples will still fail because those arrays have the same bytes but different shapes. So you could just hash the shape as well.
def hasharr(arr):
hash = hashlib.blake2b(arr.tobytes(), digest_size=20)
for dim in arr.shape:
hash.update(dim.to_bytes(4, byteorder='big'))
return hash.digest()
Exmaple:
>>> hasharr(a1)
b'\x9f\xd7<\x16\xb6u\xfdM\x14\xc2\xe49.\xf0P\xaa[\xe9\x0bZ'
>>> hasharr(a2)
b"Z\x18+'`\x83\xd6\xc8\x04\xd4%\xdc\x16V)\xb3\x97\x95\xf7v"
I'm not an expert on blake2b so you'd have to do your own research to figure out how likely a collision would be.
I'm not sure why you tagged pyarrow but if you're wanting to do the same on pyarrow arrays without converting to numpy then you can get the buffers of an array with arr.buffers() and convert these buffers (there will be multiple and some may be None) to byte strings with buf.to_pybytes(). Just hash all the buffers. There will be no need to worry about the shape here because pyarrow arrays are always one dimensional.
Related
I have a bytearray which I want to convert to a numpy array of int16 to perform FFT operations on. The bytearray is coming out of a UDP socket so first I convert two consecutive bytes into an int16 using struct.unpack, and then convert it a numpy array using np.asarray.
Current approach, however, is too slow. Original bytearray is of length 1e6 bytes, so each of the mentioned steps (struct.unpack and np.asarray) takes 20 ms and with a total of 40ms. This is a relatively long frame time for my applications so I need it a bit shortened.
Currently, I'm doing this:
temp1 = self.data_buffer[0:FRAME_LEN_B]
self.temp_list = np.asarray(struct.unpack('h' * (len(temp1) // 2), temp1))
You can try np.frombuffer. This can wrap any object supporting the buffer protocol, which bytearray explicitly does, into an array:
arr = np.frombuffer(self.data_buffer, dtype=np.int16, size=FRAME_LEN_B // 2)
You can manipulate the array however you want after that: slice, reshape, transpose, etc.
If your native byte order is opposite to what you have coming in from the network, you can swap the interpretation order without having to swap the data in-place:
dt = np.dtype(np.int16)
dt.newbyteorder('>')
arr = np.frombuffer(self.data_buffer, dtype=dt, size=FRAME_LEN_B // 2)
If the order is non-native, operations on the array may take longer, as the data will have to be swapped every time on the fly. You can therefore change the byte order in-place ahead of time if that is the case:
arr.byteswap(inplace=True)
This will overwrite the contents of the original packet. If you want to make a separate copy, just set inplace=False, which is the default.
How can I compare whether two numpy arrays are exactly identical in memory, so that e.g.
np.array([0,1]) == np.array([0,1])
is True, but
np.array([0,1]) == np.array([[0,1]])
np.array([0,1], dtype=np.int32) == np.array([0,1], dtype=np.int64)
are both False. np.array_equal doesn't have a compare_dtypes option. I guess there might be other ways for the memory representation of an array to differ too (e.g. endian-ness)
You could use itemsize to compare them in terms of their length in bytes:
a1 = np.array([0,1], dtype=np.int32)
a2 = np.array([0,1], dtype=np.int64)
a1.itemsize == a2.itemsize
# False
If you want to compare both their size and content you could examine the raw contents of data memory with ndarray.tobytes:
a1.tobytes() == a2.tobytes()
# False
Depending on which aspects you want to cover the minimum would be comparing x.dtype (this includes endianness) x.shape and x.strides.
You may also want to look at some flags. For example, x.flags.aligned may be considered part of the memory layout in a broad sense as may be x.flags.writeable (and perhaps x.flags.owndata).
The C/F_CONTIGUOUS flags, on the other hand, are redundant once you know shape and strides, and finally, there are UPDATEIFCOPY and WRITEBACKIFCOPY which I don't understand well enough to comment on.
Currently I'm using pickle.dumps:
import pickle
a1 = np.array([0,1], dtype=np.int32)
a2 = np.array([0,1], dtype=np.int64)
pickle.dumps(a1) == pickle.dumps(a2)
But it seems a bit of a hack.
Let's say I have a numpy array of some integer type (say np.int64) and want to cast it to another type (say np.int8). How can I most effectively check if the operation is safe (preserving all values)?
There are two approaches I've come up with:
Approach 1: Use the type information
def is_safe(data, new_type):
if np.can_cast(data, new_type):
return True # Handle the trivial allowed cases
type_info = np.iinfo(new_type)
return np.all((data >= type_info.min) & (data <= type_info.max))
Approach 2: Use np.can_cast on all items
def is_safe(data, new_type):
if np.can_cast(data, new_type):
return True # Handle the trivial allowed cases
return all(np.can_cast(item, new_type) for item in np.nditer(item))
Both of these approaches seem to be valid (and work for trivial cases) but are they correct and efficient? Is there another, better approach?
P.S. To complicate things further, np.can_cast(np.int8, np.uint64) returns False (naturally) so changing between signed and unsigned integers has to be checked somewhat separately.
If you already know that the array is of a NumPy integer type, then the only check needed is that the values are within the range specified by min/max of the target integer range. This is a much simpler check than the generic can_cast, which has no a priori knowledge of the things it is fed. Consequently, can_cast takes longer. I tested this on casting integers 0-99 from np.int64 to np.int8.
So, while both approaches are correct, the first one is preferable if you know that data is a NumPy integer array.
>>> timeit.timeit("np.all((data >= type_info.min) & (data <= type_info.max))", setup="import numpy as np\ndata = np.array(range(100), dtype=np.int64)\ntype_info = np.iinfo(np.int8)")
6.745509549000417
>>> timeit.timeit("all(np.can_cast(item, np.uint8) for item in np.nditer(data))", setup="import numpy as np\ndata = np.array(range(100), dtype=np.int64)")
51.0065170609887
It is slightly faster (20% or so) to assign the min and max values to new variables:
type_info = np.iinfo(new_type)
a = type_info.min
b = type_info.max
return np.all((data >= a) & (data <= b))
I want to process a big list of uint numbers (test1) and I do that i chunks of "length". I need them as signed int and then I need it as absolute length from each pait of even and odd values in this list.
But I want to get rid of two problems:
it uses a a lot of ram
it takes ages!
So how could I make this faster? Any trick? I could also use numpy, no problem in doing so.
Thanks in advance!
test2 = -127 + test1[i:i+length*2048000*2 + 2048000*2*1]
test3 = (test2[::2]**2 + test2[1::2]**2)**0.5
An efficient way is to try to use Numpy functions, e.g:
n = 10
ff = np.random.randint(0, 255, n) # generate some data
ff2 = ff.reshape(n/2, 2) # new view on ff (only makes copy if needed)
l_ff = np.linalg.norm(ff2, axis=1) # calculate vector length of each row
Note that when modifying an entry in ff2 then ff will change as well and vice versa.
Internally, Numpy stores data as contiguous memory blocks. So there are further methods besides np.reshape() to exploit that structure. For efficient conversion of data types, you can try:
dd_s = np.arange(-5, 10, dtype=np.int8)
dd_u = dd_s.astype(np.uint8) # conversion from signed to unsigned
Let's consider a list of large integers, for example one given by:
def primesfrom2to(n):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
""" Input n>=6, Returns a array of primes, 2 <= p < n """
sieve = np.ones(n/3 + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in xrange(int(n**0.5)/3+1):
if sieve[i]:
k=3*i+1|1
sieve[ ((k*k)/3) ::2*k] = False
sieve[(k*k+4*k-2*k*(i&1))/3::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
primesfrom2to(2000000)
I want to calculate the sum of that, and the expected result is 142913828922.
But if I do:
sum(primesfrom2to(2000000))
I get 1179908154, which is clearly wrong. The problem is that I have an int overflow, but I don't understand why. Let's me explain.Consider this testing code:
a=primesfrom2to(2000000)
b=[float(i) for i in a]
c=[long(i) for i in a]
sumI=0
sumF=0
sumL=0
m=0
for i,j,k in zip(a,b,c):
m=m+1
sumI=sumI+i
sumF=sumF+j
sumL=sumL+k
print sumI,sumF,sumL
if sumI<0:
print i,m
break
I found out that the first integer overflow is happening at a[i=20444]=225289
If I do:
>>> sum(a[:20043])+225289
-2147310677
But if I do:
>>> sum(a[:20043])
2147431330
>>> 2147431330+225289
2147656619L
What's happening? Why such a different behaviour? Why can't sum switch automatically to long type and give the correct result?
Look at the types of your results. You are summing a numpy array, which is using numpy datatypes, which can overflow. When you do sum(a[:20043]), you get a numpy object back (some sort of int32 or the like), which overflows when added to another number. When you manually type in the same number, you're creating a Python builtin int, which can auto-promote to long. Numpy arrays cannot autopromote like Python builtin types, because the array type (and its memory layout) have to be fixed when the array is created. This makes operations much faster at the expense of type flexibility.
You may be able to get around the problem by using a different datatype (like np.int64) instead of np.bool. However, it depends how big your numbers are. A simple example:
# Python types ok
>>> 2**62
4611686018427387904L
>>> 2**63
9223372036854775808L
# numpy types overflow
>>> np.int64(2)**62
4611686018427387904
>>> np.int64(2)**63
-9223372036854775808
Your example works correctly for me on 64-bit Python, so I guess you're using 32-bit Python. If you can use 64-bit types you will be able to get past the limit you found, but as my example shows you will eventually overflow 64-bit ints too if your numbers get super huge.