Efficient numpy row-wise matrix multiplication using 3d arrays

Efficient numpy row-wise matrix multiplication using 3d arrays - python

I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).
An inefficient code sample showing what I try to achieve is given by:
N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
result.append(arr1[i].T # arr2[i])
result = np.array(result)
However, this application is quite slow for large N due to the loop. Is there a more efficient way to achieve this computation without using loops? I already tried to find a solution via tensordot and einsum to no avail.

The vectorization solution is to swap the last two axes of arr1:
>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) # arr2
array([[[ 6.95815626, 0.38299107, 0.40600482, 0.35990016],
[-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
[ 3.54989101, -0.31274318, 0.14188485, 0.19860495],
[ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],
[[ 0.67081341, -0.08816343, 0.35430089, 0.69962394],
[ 0.0316968 , 0.15129449, -0.51592291, 0.07118177],
[-0.22274906, -0.28955683, -1.78905988, 1.1486345 ],
[ 1.68432706, 1.93915798, 2.25785798, -2.34404577]]])
A simple benchmark for the super N:
In [225]: arr1.shape
Out[225]: (100000, 10, 50)
In [226]: %%timeit
...: result = []
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...: result = np.array(result)
...:
...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [227]: %timeit arr1.transpose(0, 2, 1) # arr2
843 ms ± 7.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use pre allocated lists and do not perform data conversion after the loop ends. The performance here is not much worse than vectorization, which means that the most overhead comes from the final data conversion:
In [375]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: # result = np.array(result)
...:
...:
1.22 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Performance of loop solution with data conversion:
In [376]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result[i] = arr1[i].T # arr2[i]
...: result = np.array(result)
...:
...:
11.3 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Another refers to the answer of #9769953 and makes additional optimization test. To my surprise, its performance is almost the same as the vectorization solution:
In [378]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for res, ar1, ar2 in zip(result, arr1.transpose(0, 2, 1), arr2):
...: np.matmul(ar1, ar2, out=res)
...:
843 ms ± 4.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.
Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig, and achieved the following results on my machine:
In [10]: %timeit result1 = arr1.transpose(0, 2, 1) # arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
versus
In [14]: %%timeit
...: result = np.empty_like(arr1, shape=(N, D, D))
...: for i in range(N):
...: result[i, ...] = arr1[i].T # arr2[i]
...:
...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5. Not too bad. Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.
Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array. Mainly so the results are compared for the same machine:
In [11]: %%timeit
...: result = [None] * N
...: for i in range(N):
...: result.append(arr1[i].T # arr2[i])
...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.

Related

How to vectorize computation on arrays of different dimensions?

I have some large numpy arrays of complex numbers I need to perform computations on.
import numpy as np
# Reduced sizes -- real ones are orders of magnitude larger
n, d, l = 50000, 3, 1000
# Two complex matrices and a column vector
x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
v = np.random.rand(l)[:, np.newaxis]
The function is basically x*v*s for each row of x (and s) and then that product is summed across the row. Because the arrays are different sizes, I can't figure out a way to vectorize the computation and it's way too slow to use a for-loop.
My current implementation is this (~3.5 seconds):
h = []
for i in range(len(x)):
h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
h = np.asarray(h)
I also tried using np.apply_along_axis() with an augmented matrix but it's only slightly faster (~2.6s) and not that readable.
def func(m, v):
return np.sum(m[:d]*v*m[d:], axis=1)
h = np.apply_along_axis(func, 1, np.hstack([x, s]), v)
What's a much quicker way to compute this result? I can leverage other packages such as dask if that helps.

With broadcasting this should work:
np.sum(((x*s)[...,None]*v[:,0], axis=1)
but with your sample dimensions I'm getting a memory error. The 'outer' broadcasted array (n,d,l) shape is too large for my memory.
I can reduce memory usage by iterating on the smaller d dimension:
res = np.zeros((n,l), dtype=x.dtype)
for i in range(d):
res += (x[:,i]*s[:,i])[:,None]*v[:,0]
This tests the same as your h, but I wasn't able to complete time tests. Generally iterating on the smaller dimension is faster.
I may repeat things with small dimensions.
This probably can also be expressed as an einsum problem, though it may not help with these dimensions.
In [1]: n, d, l = 5000, 3, 1000
...:
...: # Two complex matrices and a column vector
...: x = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: s = np.random.rand(n, d) + 1j*np.random.rand(n, d)
...: v = np.random.rand(l)[:, np.newaxis]
In [2]:
In [2]: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...:
...: h = np.asarray(h)
In [3]: h.shape
Out[3]: (5000, 1000)
In [4]: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
In [5]: res.shape
Out[5]: (5000, 1000)
In [6]: np.allclose(res,h)
Out[6]: True
In [7]: %%timeit
...: h = []
...: for i in range(len(x)):
...: h.append(np.sum(x[i,:]*v*s[i,:], axis=1))
...: h = np.asarray(h)
...:
...:
490 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %%timeit
...: res = np.zeros((n,l), dtype=x.dtype)
...: for i in range(d):
...: res += (x[:,i]*s[:,i])[:,None]*v[:,0]
...:
354 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]:
In [9]: np.sum((x*s)[...,None]*v[:,0], axis=1).shape
Out[9]: (5000, 1000)
In [10]: out = np.sum((x*s)[...,None]*v[:,0], axis=1)
In [11]: np.allclose(h,out)
Out[11]: True
In [12]: timeit out = np.sum((x*s)[...,None]*v[:,0], axis=1)
310 ms ± 964 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Some time savings, but not big.
And the einsum version:
In [13]: np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
Out[13]: (5000, 1000)
In [14]: np.allclose(np.einsum('ij,ij,k->ik',x,s,v[:,0]),h)
Out[14]: True
In [15]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0]).shape
167 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Good time savings. But I don't know how it will scale.
But the einsum made me realize that we can sum on d dimension earlier, before multiplying by v - and gain a lot in time and memory usage:
In [16]: np.allclose(np.sum(x*s, axis=1)[:,None]*v[:,0],h)
Out[16]: True
In [17]: timeit np.sum(x*s, axis=1)[:,None]*v[:,0]
68.4 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#cs95 got there first!
As per #PaulPanzer's comment, the optimize flag helps. It's probably making the same deduction - that we can sum on j early:
In [18]: timeit np.einsum('ij,ij,k->ik',x,s,v[:,0],optimize=True).shape
91.6 ms ± 991 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy - How to remove trailing N*8 zeros

I have 1d array, I need to remove all trailing blocks of 8 zeros.
[0,1,1,0,1,0,0,0, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0]
->
[0,1,1,0,1,0,0,0]
a.shape[0] % 8 == 0 always, so no worries about that.
Is there a better way to do it?
import numpy as np
P = 8
arr1 = np.random.randint(2,size=np.random.randint(5,10) * P)
arr2 = np.random.randint(1,size=np.random.randint(5,10) * P)
arr = np.concatenate((arr1, arr2))
indexes = []
arr = np.flip(arr).reshape(arr.shape[0] // P, P)
for i, f in enumerate(arr):
if (f == 0).all():
indexes.append(i)
else:
break
arr = np.delete(arr, indexes, axis=0)
arr = np.flip(arr.reshape(arr.shape[0] * P))

You can do it without allocating more space by using views and np.argmax to get the last nonzero element:
index = arr.size - np.argmax(arr[::-1])
Rounding up to the nearest multiple of eight is easy:
index = np.ceil(index / 8) * 8
Now chop off the rest:
arr = arr[:index]
Or as a one-liner:
arr = arr[:(arr.size - np.argmax(arr[::-1])) / 8) * 8]
This version is O(n) in time and O(1) in space because it reuses the same buffers for everything (including the output).
This has the additional advantage that it will work correctly even if there are no trailing zeros. Using argmax does rely on all the elements being the same though. If that is not the case, you will need to compute a mask first, e.g. with arr.astype(bool).
If you want to use your original approach, you could vectorize that too, although there will be a bit more overhead:
view = arr.reshape(-1, 8)
mask = view.any(axis = 1)
index = view.shape[0] - np.argmax(mask[::-1])
arr = arr[:index * 8]

There is a numpy function that does almost what you want np.trim_zeros. We can use that:
import numpy as np
def trim_mod(a, m=8):
t = np.trim_zeros(a, 'b')
return a[:len(a)-(len(a)-len(t))//m*m]
def test(a, t, m=8):
assert (len(a) - len(t)) % m == 0
assert len(t) < m or np.any(t[-m:])
assert not np.any(a[len(t):])
for _ in range(1000):
a = (np.random.random(np.random.randint(10, 100000))<0.002).astype(int)
m = np.random.randint(4, 20)
t = trim_mod(a, m)
test(a, t, m)
print("Looks correct")
Prints:
Looks correct
It seems to scale linearly in the number of trailing zeros:
But feels rather slow in absolute terms (units are ms per trial), so maybe np.trim_zeros is just a python loop.
Code for the picture:
from timeit import timeit
A = (np.random.random(1000000)<0.02).astype(int)
m = 8
T = []
for last in range(1, 1000, 9):
A[-last:] = 0
A[-last] = 1
T.append(timeit(lambda: trim_mod(A, m), number=100)*10)
import pylab
pylab.plot(range(1, 1000, 9), T)
pylab.show()

A low level approach :
import numba
#numba.njit
def trim8(a):
n=a.size-1
while n>=0 and a[n]==0 : n-=1
c= (n//8+1)*8
return a[:c]
Some tests :
In [194]: A[-1]=1 # best case
In [196]: %timeit trim_mod(A,8)
5.7 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [197]: %timeit trim8(A)
714 ns ± 33.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [198]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
4.83 ms ± 479 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [202]: A[:]=0 #worst case
In [203]: %timeit trim_mod(A,8)
2.5 s ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [204]: %timeit trim8(A)
1.14 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [205]: %timeit A[:(A.size - np.argmax(A[::-1]) // 8) * 8]
5.5 ms ± 950 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It has a short circuit mechanism like trim_zeros, but is much faster.

Searching large array by two columns

I have a large array, that looks like something below:
np.random.seed(42)
arr = np.random.permutation(np.array([
(1,1,2,2,2,2,3,3,4,4,4),
(8,9,3,4,7,9,1,9,3,4,50000)
]).T)
It isn't sorted, the rows of this array are unique, I also know the bounds for the values in both columns, they are [0, n] and [0, k]. So the maximum possible size of the array is (n+1)*(k+1), but the actual size is closer to log of that.
I need to search the array by both columns to find such row that arr[row,:] = (i,j), and return -1 when (i,j) is absent in the array. The naive implementation for such function is:
def get(arr, i, j):
cond = (arr[:,0] == i) & (arr[:,1] == j)
if np.any(cond):
return np.where(cond)[0][0]
else:
return -1
Unfortunately, since in my case arr is very large (>90M rows), this is very inefficient, especially since I would need to call get() multiple times.
Alternatively I tried translating this to a dict with (i,j) keys, such that
index[(i,j)] = row
that can be accessed by:
def get(index, i, j):
try:
retuen index[(i,j)]
except KeyError:
return -1
This works (and is much faster when tested on smaller data than I have), but again, creating the dict on-the-fly by
index = {}
for row in range(arr.shape[0]):
i,j = arr[row, :]
index[(i,j)] = row
takes huge amount of time and eats lots of RAM in my case. I was also thinking of first sorting arr and then using something like np.searchsorted, but this didn't lead me anywhere.
So what I need is a fast function get(arr, i, j) that returns
>>> get(arr, 2, 3)
4
>>> get(arr, 4, 100)
-1

A partial solution would be:
In [36]: arr
Out[36]:
array([[ 2, 9],
[ 1, 8],
[ 4, 4],
[ 4, 50000],
[ 2, 3],
[ 1, 9],
[ 4, 3],
[ 2, 7],
[ 3, 9],
[ 2, 4],
[ 3, 1]])
In [37]: (i,j) = (2, 3)
# we can use `assume_unique=True` which can speed up the calculation
In [38]: np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
Out[38]:
array([[False],
[False],
[False],
[False],
[ True],
[False],
[False],
[False],
[False],
[False],
[False]])
# we can use `assume_unique=True` which can speed up the calculation
In [39]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
In [40]: np.argwhere(mask)
Out[40]: array([[4, 0]])
If you need the final result as a scalar, then don't use keepdims argument and cast the array to a scalar like:
# we can use `assume_unique=True` which can speed up the calculation
In [41]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
In [42]: np.argwhere(mask)
Out[42]: array([[4]])
In [43]: np.asscalar(np.argwhere(mask))
Out[43]: 4

Solution
Python offers a set type to store unique values, but sadly no ordered version of a set. But you can use the ordered-set package.
Create an OrderedSet from the data. Fortunately, this only needs to be done once:
import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
try:
return o.index((i,j))
except KeyError:
return -1
Runtime
Finding the index of a value should be O(1), according to the documentation:
In [46]: %timeit get(arr, 2, 3)
10.6 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [47]: %timeit ordered_get(o, 2, 3)
1.16 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [48]: %timeit ordered_get(o, 2, 300)
1.05 µs ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Testing this for a much larger array:
a2 = random.randint(10000, size=1000000).reshape(-1,2)
o2 = ordered_set.OrderedSet()
for t in map(tuple, a2):
o2.add(t)
In [65]: %timeit get(a2, 2, 3)
1.05 ms ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit ordered_get(o2, 2, 3)
1.03 µs ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [67]: %timeit ordered_get(o2, 2, 30000)
1.06 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like it indeed is O(1) runtime.

def get_agn(arr, i, j):
idx = np.flatnonzero((arr[:,0] == j) & (arr[:,1] == j))
return -1 if idx.size == 0 else idx[0]
Also, just in case you are thinking about the ordered_set solution, here is a better one (however, in both cases see timing tests below):
d = { (i, j): k for k, (i, j) in enumerate(arr)}
def unordered_get(d, i, j):
return d.get((i, j), -1)
and it's "full" equivalent (that builds the dictionary inside the function):
def unordered_get_full(arr, i, j):
d = { (i, j): k for k, (i, j) in enumerate(arr)}
return d.get((i, j), -1)
Timing tests:
First, define #kmario23 function:
def get_kmario23(arr, i, j):
# fundamentally, kmario23's code re-aranged to return scalars
# and -1 when (i, j) not found:
mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
idx = np.argwhere(mask)[0]
return -1 if idx.size == 0 else np.asscalar(idx[0])
Second, define #ChristophTerasa function (original and the full version):
import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
try:
return o.index((i,j))
except KeyError:
return -1
def ordered_get_full(arr, i, j):
# "Full" version that builds ordered set inside the function
o = ordered_set.OrderedSet(map(tuple, arr))
try:
return o.index((i,j))
except KeyError:
return -1
Generate some large data:
arr = np.random.randint(1, 2000, 200000).reshape((-1, 2))
Timing results:
In [55]: %timeit get_agn(arr, *arr[-1])
149 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [56]: %timeit get_kmario23(arr, *arr[-1])
1.42 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: %timeit get_kmario23(arr, *arr[0])
1.2 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ordered set tests:
In [80]: o = ordered_set.OrderedSet(map(tuple, arr))
In [81]: %timeit ordered_get(o, *arr[-1])
1.74 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [82]: %timeit ordered_get_full(arr, *arr[-1]) # include ordered set creation time
166 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unordered dictionary tests:
In [83]: d = { (i, j): k for k, (i, j) in enumerate(arr)}
In [84]: %timeit unordered_get(d, *arr[-1])
1.18 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [85]: %timeit unordered_get_full(arr, *arr[-1])
102 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, when taking into account the time needed to create either ordered set or unordered dictionary, these methods are quite slow. You must plan running several hundred searches on the same data for these methods to make sense. Even then, there is no need to use ordered_set package - regular dictionaries are faster.

It seems I was over-thinking this problem, there is easy solution. I was considering either filtering and subsetting the array or using dict index[(i,j)] = row. Filtering and subsetting was slow (O(n) when searching), while using dict was fast (O(1) access time), but creating the dict was slow and memory intensive.
The simple solution for this problem is using nested dicts.
index = {}
for row in range(arr.shape[0]):
i,j = arr[row, :]
try:
index[i][j] = row
except KeyError:
index[i] = {}
index[i][j] = row
def get(index, i, j):
try:
return index[i][j]
except KeyError:
return -1
Alternatively, instead of dict on higher level, I could use index = defaultdict(dict), what would allow for assigning index[i][j] = row
directly, without the try ... except conditions, but then the defaultdict(dict) object would create empty {} when queried for nonexistent i by the get(index, i, j) function, so it would be expanding the index unnecessarily.
The access time is O(1) for the first dict and O(1) for the nested dicts, so basically it's O(1). The upper level dict has manageable size (bounded by n < n*k), while the nested dicts are small (the nesting order is chosen based on the fact that in my case k << n). Building the nested dict is also very fast, even for >90M rows in the array. Moreover, it can be easily extended to more complicated cases.

Numpy grouping by range of difference between elements

I have an array of angles that I want to group into arrays with a max difference of 2 deg between them.
eg: input:
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
output
('group', 1)
[[1]
[2]
[3]]
('group', 2)
[[4]
[4]
[5]]
('group', 3)
[[10]]
numpy.diff gets the difference of the next element from the current, I need the difference of the next elements from the first of the group
itertools.groupby groups the elements not within a definable range
numpy.digitize groups the elements by a predefined range, not by the range specified by the elements of the array.
(Maybe I can use this by getting the unique values of angles, grouping them by their difference and using that as the predefined range?)
.
My approach which works but seems extremely inefficient and non-pythonic:
(I am using expand_dims and vstack because I'm working with a 1d arrays (not just angles) but I've reduced them to simplify it for this question)
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
groupedangles = []
idx1 = 0
diffAngleMax = 2
while(idx1 < len(angles)):
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
for idx2 in xrange(idx1+1,len(angles)):
angleB = angles[idx2]
diffAngle = angleB - angleA
if abs(diffAngle) <= diffAngleMax:
group = np.vstack((group,angleB))
else:
idx1 = idx2
groupedangles.append(group)
break
if idx2 == len(angles) - 1:
if idx1 == idx2:
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
groupedangles.append(group)
break
for idx, x in enumerate(groupedangles):
print('group', idx+1)
print(x)
What is a better and faster way to do this?

Update Here is some Cython treatment
In [1]: import cython
In [2]: %load_ext Cython
In [3]: %%cython
...: import numpy as np
...: cimport numpy as np
...: def cluster(np.ndarray array, np.float64_t maxdiff):
...: cdef np.ndarray[np.float64_t, ndim=1] flat = np.sort(array.flatten())
...: cdef list breakpoints = []
...: cdef np.float64_t seed = flat[0]
...: cdef np.int64_t int = 0
...: for i in range(0, len(flat)):
...: if (flat[i] - seed) > maxdiff:
...: breakpoints.append(i)
...: seed = flat[i]
...: return np.split(array, breakpoints)
...:
Sparsity test
In [4]: angles = np.random.choice(np.arange(5000), 500).astype(np.float64)[:, None]
In [5]: %timeit cluster(angles, 2)
422 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Duplication test
In [6]: angles = np.random.choice(np.arange(500), 1500).astype(np.float64)[:, None]
In [7]: %timeit cluster(angles, 2)
263 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both tests show a significant improvement. The algorithm now sorts the input and makes a single run over the sorted array, which makes it stable O(N*log(N)).
Pre-update
This is a variation on seed clustering. It requires no sorting
def cluster(array, maxdiff):
tmp = array.copy()
groups = []
while len(tmp):
# select seed
seed = tmp.min()
mask = (tmp - seed) <= maxdiff
groups.append(tmp[mask, None])
tmp = tmp[~mask]
return groups
Example:
In [27]: cluster(angles, 2)
Out[27]:
[array([[1],
[2],
[3]]), array([[4],
[4],
[5]]), array([[10]])]
A benchmark for 500, 1000 and 1500 angles:
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.25 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(500), 1000)[:, None]
In [7]: %timeit cluster(angles, 2)
1.46 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: angles = np.random.choice(np.arange(500), 1500)[:, None]
In [9]: %timeit cluster(angles, 2)
1.99 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
While the algorithm is O(N^2) in the worst case and O(N) in the best case, the benchmarks above clearly show near-linear time growth, because the actual runtime depends on the structure of your data: sparsity and the duplication rate. In most real cases you won't hit the worst case.
Some sparsity benchmarks
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.06 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(1000), 500)[:, None]
In [7]: %timeit cluster(angles, 2)
1.79 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: angles = np.random.choice(np.arange(1500), 500)[:, None]
In [9]: %timeit cluster(angles, 2)
2.16 ms ± 90.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: angles = np.random.choice(np.arange(5000), 500)[:, None]
In [11]: %timeit cluster(angles, 2)
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here is a sorting based solution. One could try and be a bit smarter and use bincount and argpartition to avoid the sorting, but at N <= 500 it's not worth the trouble.
import numpy as np
def flexibin(a):
idx0 = np.argsort(a)
as_ = a[idx0]
A = np.r_[as_, as_+2]
idx = np.argsort(A)
uinv = np.flatnonzero(idx >= len(a))
linv = np.empty_like(idx)
linv[np.flatnonzero(idx < len(a))] = np.arange(len(a))
bins = [0]
curr = 0
while True:
for j in range(uinv[idx[curr]], len(idx)):
if idx[j] < len(a) and A[idx[j]] > A[idx[curr]] + 2:
bins.append(j)
curr = j
break
else:
return np.split(idx0, linv[bins[1:]])
a = 180 * np.random.random((500,))
bins = flexibin(a)
mn, mx = zip(*((np.min(a[b]), np.max(a[b])) for b in bins))
assert np.all(np.diff(mn) > 2)
assert np.all(np.subtract(mx, mn) <= 2)
print('all ok')

Count unique elements along an axis of a NumPy array

I have a three-dimensional array like
A=np.array([[[1,1],
[1,0]],
[[1,2],
[1,0]],
[[1,0],
[0,0]]])
Now I would like to obtain an array that has a nonzero value in a given position if only a unique nonzero value (or zero) occurs in that position. It should have zero if only zeros or more than one nonzero value occur in that position. For the example above, I would like
[[1,0],
[1,0]]
since
in A[:,0,0] there are only 1s
in A[:,0,1] there are 0, 1 and 2, so more than one nonzero value
in A[:,1,0] there are 0 and 1, so 1 is retained
in A[:,1,1] there are only 0s
I can find how many nonzero elements there are with np.count_nonzero(A, axis=0), but I would like to keep 1s or 2s even if there are several of them. I looked at np.unique but it doesn't seem to support what I'd like to do.
Ideally, I'd like a function like np.count_unique(A, axis=0) which would return an array in the original shape, e.g. [[1, 3],[2, 1]], so I could check whether 3 or more occur and then ignore that position.
All I could come up with was a list comprehension iterating over the that I'd like to obtain
[[len(np.unique(A[:, i, j])) for j in range(A.shape[2])] for i in range(A.shape[1])]
Any other ideas?

You can use np.diff to stay at numpy level for the second task.
def diffcount(A):
B=A.copy()
B.sort(axis=0)
C=np.diff(B,axis=0)>0
D=C.sum(axis=0)+1
return D
# [[1 3]
# [2 1]]
it's seems to be a little faster on big arrays:
In [62]: A=np.random.randint(0,100,(100,100,100))
In [63]: %timeit diffcount(A)
46.8 ms ± 769 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: timeit [[len(np.unique(A[:, i, j])) for j in range(A.shape[2])]\
for i in range(A.shape[1])]
149 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally counting unique is simpler than sorting, a ln(A.shape[0]) factor can be win.
A way to win this factor is to use the set mechanism :
In [81]: %timeit np.apply_along_axis(lambda a:len(set(a)),axis=0,A)
183 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this is not faster.
Another way is to do it by hand :
def countunique(A,Amax):
res=np.empty(A.shape[1:],A.dtype)
c=np.empty(Amax+1,A.dtype)
for i in range(A.shape[1]):
for j in range(A.shape[2]):
T=A[:,i,j]
for k in range(c.size): c[k]=0
for x in T:
c[x]=1
res[i,j]= c.sum()
return res
At python level:
In [70]: %timeit countunique(A,100)
429 ms ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Which is not so bad for a pure python approach. Then just shift this code at low level with numba :
import numba
countunique2=numba.jit(countunique)
In [71]: %timeit countunique2(A,100)
3.63 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Which will be difficult to improve a lot.

One approach would be to use A as first axis indices for setting a boolean array of the same lengths along the other two axes and then simply counting the non-zeros along the first axis of it. Two variants would be possible - One keeping it as 3D and another would be to reshape into 2D for some performance benefit as indexing into 2D would be faster. Thus, the two implementations would be -
def nunique_axis0_maskcount_app1(A):
m,n = A.shape[1:]
mask = np.zeros((A.max()+1,m,n),dtype=bool)
mask[A,np.arange(m)[:,None],np.arange(n)] = 1
return mask.sum(0)
def nunique_axis0_maskcount_app2(A):
m,n = A.shape[1:]
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.zeros((maxn,N),dtype=bool)
mask[A,np.arange(N)] = 1
A.shape = (-1,m,n)
return mask.sum(0).reshape(m,n)
Runtime test -
In [154]: A = np.random.randint(0,100,(100,100,100))
# #B. M.'s soln
In [155]: %timeit f(A)
10 loops, best of 3: 28.3 ms per loop
# #B. M.'s soln using slicing : (B[1:] != B[:-1]).sum(0)+1
In [156]: %timeit f2(A)
10 loops, best of 3: 26.2 ms per loop
In [157]: %timeit nunique_axis0_maskcount_app1(A)
100 loops, best of 3: 12 ms per loop
In [158]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 9.14 ms per loop
Numba method
Using the same strategy as used for nunique_axis0_maskcount_app2 with directly getting the counts at C-level with numba, we would have -
from numba import njit
#njit
def nunique_loopy_func(mask, N, A, p, count):
for j in range(N):
mask[:] = True
mask[A[0,j]] = False
c = 1
for i in range(1,p):
if mask[A[i,j]]:
c += 1
mask[A[i,j]] = False
count[j] = c
return count
def nunique_axis0_numba(A):
p,m,n = A.shape
A.shape = (-1,m*n)
maxn = A.max()+1
N = A.shape[1]
mask = np.empty(maxn,dtype=bool)
count = np.empty(N,dtype=int)
out = nunique_loopy_func(mask, N, A, p, count).reshape(m,n)
A.shape = (-1,m,n)
return out
Runtime test -
In [328]: np.random.seed(0)
In [329]: A = np.random.randint(0,100,(100,100,100))
In [330]: %timeit nunique_axis0_maskcount_app2(A)
100 loops, best of 3: 11.1 ms per loop
# #B.M.'s numba soln
In [331]: %timeit countunique2(A,A.max()+1)
100 loops, best of 3: 3.43 ms per loop
# Numba soln posted in this post
In [332]: %timeit nunique_axis0_numba(A)
100 loops, best of 3: 2.76 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient numpy row-wise matrix multiplication using 3d arrays - python

Related

How to vectorize computation on arrays of different dimensions?

Numpy - How to remove trailing N*8 zeros

Searching large array by two columns

Numpy grouping by range of difference between elements

Count unique elements along an axis of a NumPy array

Categories

Resources