Speed up numpy indexing of large array - python

I have a numpy array x of size (n, n, p) and I need to index it using a list m. I need to return a two new arrays of sizes (n, m, p) and (n, n-m, p). Both p and m are generally small (range 10 to 100), but n can be from 100 to 10000+.
When n is small, there is no issue. However when n gets large, these indexing operations take the majority of my function call time.
In my actual implementation, the indexing took 15 seconds, and the rest of the function was less than 1 sec.
I've tried doing the regular indexing, using np.delete, and np.take, and np.take was faster by a factor of 2 and it was what I am currently using to get the 15 sec time.
An example is below:
m = [1, 7, 12, 40]
r = np.arange(5000)
r = np.delete(r, m, axis=0)
x = np.random.rand(5000,5000,10)
%timeit tmp = x[:,m,:]
1.55 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = x[:,r,:]
1.7 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.delete(x, r, axis=1)
1.46 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.delete(x, m, axis=1)
1.64 s ± 18.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit tmp = np.take(x, m, axis=1)
1.21 ms ± 61.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit tmp2 = np.take(x, r, axis=1)
1.04 s ± 79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Except instead of 1 sec, it's 15 times that and I have to call this function a few hundred or thousand times.
Is there something I can do to speed this indexing up?
I'm using Python 3.6.10 through Spyder 4.0.1 on a Windows 10 laptop with an Intel i7-8650U and 16GB of RAM. I checked the array sizes and my available RAM when executing the commands and did not hit the maximum usage at any point in the execution.

Related

Broadcast comparison on sliced numpy array using "," is a lot slower than "]["

I'm not sure why comparing on a sliced numpy array using , is a lot slower than ][. For example:
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99][1:99] == 1
print(time.time() - start)
start = time.time()
a = np.zeros((100,100))
for _ in range(1000000):
a[1:99, 1:99] == 1
print(time.time() - start)
3.2756259441375732
11.044903039932251
That's over 3 times worse.
The time measurements are approximately the same using timeit.
I'm working on a recursive algorithm (I intended to do so), and those problems make my program run a lot slower, from about 1 second increased to 10 seconds. I just want to know the reason behind them. May be this is a bug. I'm using Python 3.9.9. Thanks.
The first is the same as a[2:99]==1. A (98,100) slice followed by a (97,100), and then the == test.
In [177]: timeit (a[1:99][1:99]==1)
8.51 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [178]: timeit (a[1:99][1:99])
383 ns ± 5.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [179]: timeit (a[1:99])
208 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The bulk of the time is the test, not the slicing.
In [180]: a[1:99,1:99].shape
Out[180]: (98, 98)
In [181]: timeit a[1:99,1:99]==1
32.2 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [182]: timeit a[1:99,1:99]
301 ns ± 3.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Again the slicing is a minor part of the timing, but the == test is significantly slower. In the first case we selected a subset of the rows, so the test is on a contiguous block of the data-buffer. In the second we select a subset of rows and columns. Iteration through the data-buffer is more complicated.
We can simplify the comparison by testing a slice of columns versus a slice of rows:
In [183]: timeit a[:,2:99]==1
32.3 µs ± 13.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [184]: timeit a[2:99,:]==1
8.58 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a further test, make a new array with 'F' order. Now "rows" are the slow slice
In [189]: b = np.array(a, order='F')
In [190]: timeit b[:,2:99]==1
8.83 µs ± 20.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit b[2:99,:]==1
32.8 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
===
But why are you trying to compare these two slices, one that makes a (97,100) array, and the other a (98,98). They are picking different parts of a.
I wonder if you really meant to test a sequential row, column slice, not two row slices.
In [193]: timeit (a[1:99][:,1:99]==1)
32.6 µs ± 92.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparing just the slicing we see that the sequential one is slower - by just a bit.
In [194]: timeit (a[1:99][:,1:99])
472 ns ± 3.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [195]: timeit (a[1:99,1:99])
306 ns ± 3.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
===
The data for a is actually stored in 1d c array. The numpy code uses strides and shape to iterate through it when doing something like a[...] == 1.
So imagine (3,6) data buffer looking like
[0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5]
sliced with [1:3], it will use
[_ _ _ _ _ _ 0 1 2 3 4 5 0 1 2 3 4 5]
slice with [:,1:4] it will use
[_ 1 2 3 _ _ _ 1 2 3 _ _ _ 1 2 3 _ _]
Regardless of the processor caching details, the iteration through the 2nd is more complex.

Range of all elements in numpy array

Suppose I have a 1d array a where from each element I would like to have a range of which the size is stored in ranges:
a = np.array([10,9,12])
ranges = np.array([2,4,3])
The desired output would be:
np.array([10,11,9,10,11,12,12,13,14])
I could of course use a for loop, but I prefer a fully vectorized approach. np.repeat allows one to repeat the elements in a a number of times by setting repeats=, but I am not aware of a similar numpy function particularly dealing with the problem above.
>>> np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
array([10, 11, 9, 10, 11, 12, 12, 13, 14])
With pandas it could be easier:
>>> import pandas as pd
>>> x = pd.Series(np.repeat(a, ranges))
>>> x + x.groupby(x).cumcount()
0 10
1 11
2 9
3 10
4 11
5 12
6 12
7 13
8 14
dtype: int64
>>>
If you want a numpy array:
>>> x.add(x.groupby(x).cumcount()).to_numpy()
array([10, 11, 9, 10, 11, 12, 12, 13, 14], dtype=int64)
>>>
Someone asked about timing, so I compared the times of the three solutions (so far) in a very simple manner, using the %timeit magic function in Jupyter notebook cells.
I set it up as follows:
N = 1
a = np.array([10,9,12])
a = np.tile(a, N)
ranges = np.array([2,4,3])
ranges = np.tile(ranges, N)
a.shape, ranges.shape
So I could easily scale (albeit things not random, but repeated).
Then I ran:
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
,
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
and
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Results are as follows:
N = 1:
9.81 µs ± 481 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
568 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.53 µs ± 81.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
N = 10:
63.4 µs ± 976 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
575 µs ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
25.1 µs ± 698 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
N = 100:
612 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
608 µs ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 9.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 1000:
6.09 ms ± 52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
852 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.44 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So the Pandas solution wins when things get to arrays of 1000 elements or more, but the Python double list comprehension does an excellent job until that point. np.hstack probably loses out because of extra memory allocation and copying, but that's a guess. Note also that the Pandas solution is nearly the same time for each array size.
Caveats still exists because there are repeated numbers, and all values are relatively small integers. This really shouldn't matter, but I'm not (yet) betting on it. (For example, Pandas groupby functionality may be fast because of the repeated numbers.)
Bonus: the OP has statement in a comment that "The real life arrays are around 1000 elements, yet with ranges ranging from 100 to 1000. So becomes quite big – pr94".
So I adjusted my timing test to the following:
import numpy as np
import pandas as pd
N = 1000
a = np.random.randint(100, 1000, N)
# This is how I understand "ranges ranging from 100 to 1000"
ranges = np.random.randint(100, 1000, N)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
Which comes out as :
hstack: 2.78 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas: 18.4 ms ± 663 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
double list comprehension: 64.1 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Which shows that those caveats I mentioned, in some form at least, do seem to exist. But people should double check whether this testing code is actually the most relevant and appropriate, and whether it is correct.
This problem is probably going to be solved much faster with a Numba-compiled function:
#nb.jit
def expand_range(values, counts):
n = len(values)
m = np.sum(counts)
r = np.zeros((m,), dtype=values.dtype)
k = 0
for i in range(n):
x = values[i]
for j in range(counts[i]):
r[k] = x + j
k += 1
return r
On the very small inputs:
%timeit expand_range(a, ranges)
# 1.16 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
and on somewhat larger inputs:
b = np.random.randint(0, 1000, 1000)
b_ranges = np.random.randint(1, 10, 1000)
%timeit expand_range(b, b_ranges)
# 5.07 µs ± 98.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit x = pd.Series(np.repeat(a, ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 617 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(a, ranges)])
# 25 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.array([i for j in range(len(a)) for i in range(a[j],a[j]+ranges[j])])
# 13.5 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
these show that with Numba-based approach winning the speed gain is at least 100x over any of the other approaches proposed so far.
With the numbers closer to what as been indicated in one of the comments by the OP:
b = np.random.randint(10, 1000, 1000)
b_ranges = np.random.randint(100, 1000, 1000)
%timeit expand_range(b, b_ranges)
# 1.5 ms ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1000
%timeit x = pd.Series(np.repeat(b, b_ranges)); x.add(x.groupby(x).cumcount()).to_numpy()
# 91.8 ms ± 6.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.hstack([np.arange(start, start+size) for start, size in zip(b, b_ranges)])
# 10.7 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.array([i for j in range(len(b)) for i in range(b[j],b[j]+b_ranges[j])])
# 144 ms ± 4.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
which is still at least a respectable 7x over the others.

Faster return_inverse in np.unique

I have a large numpy 1D array with over a 100 million elements and am applying np.unique to it
import numpy as np
x = np.random.randint(0,10000, size=100_000_000)
_, index = np.unique(x, return_inverse=True)
What I actually need is the index that is returned from np.unique but I do not need the unique array at all (i.e., it is throwaway). Since, in my real use case, I need to call np.unique many times on different arrays (all with the same length), this becomes the bottleneck. I'm guessing that a lot of the time is spent on sorting the unique array.
What is the a fastest way to obtain the index for a large 1D array (it may be over a billion elements in length)?
Is there a parallelized option?
Here's a way with array-assignment + masking + indexing trickery specific to the case of positive integers only in the input array x -
def return_inverse_only(x, maxnum=None):
if maxnum is None:
maxnum = x.max()+1 # Determines extent of indexing array
p = np.zeros(maxnum, dtype=bool)
p[x] = 1
p2 = np.empty(maxnum, dtype=np.uint64)
c = p.sum()
p2[p] = np.arange(c)
out = p2[x]
return out
If max number in the input array is known before-hahnd, feed in one-added number as maxnum to boost perf. further.
Timings on large arrays -
In [146]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=100000)
In [147]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
10.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
539 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
446 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [148]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=1000000)
In [149]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
149 ms ± 5.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
6.1 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.3 ms ± 504 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [150]: np.random.seed(0)
...: x = np.random.randint(0,10000, size=10000000)
In [151]: %timeit np.unique(x, return_inverse=True)
...: %timeit return_inverse_only(x)
...: %timeit return_inverse_only(x, maxnum=10000)
1.88 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
67.9 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
55.8 ms ± 1.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
30x+ speedup!

for loop performance of pandas Series

I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?
Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.

ufunc memory consumption in arithemtic expressions

What is the memory consumption for arithmetic numpy expressions I.e.
vec ** 3 + vec ** 2 + vec
(vec being a numpy.ndarray). Is an array stored for each intermediate operation? Could such compound expressions have multiple times the memory than the underlying ndarray?
You are correct, a new array will be allocated for each intermediate result. Fortunately, the package numexpr is designed to deal with this issue. From the description:
The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general. Due to this, NumExpr works best with large arrays.
Example:
In [97]: xs = np.random.rand(1_000_000)
In [98]: %timeit xs ** 3 + xs ** 2 + xs
26.8 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [99]: %timeit numexpr.evaluate('xs ** 3 + xs ** 2 + xs')
1.43 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Thanks to #max9111 for pointing out that numexpr simplifies power to multiplication. It seems that most of the discrepancy in the benchmark is explained by optimization of xs ** 3.
In [421]: %timeit xs * xs
1.62 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [422]: %timeit xs ** 2
1.63 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [423]: %timeit xs ** 3
22.8 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [424]: %timeit xs * xs * xs
2.52 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories