Sum a numpy array in chunks - python

Let's say I have a numpy array:
x = np.array([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
And I want to sum it in groups of, say, 3, so that the results is as follows:
np.array([12, 21, 30, 39])
Here is one way to do it:
n = x.size
out = x.reshape(n//3, 3) # np.ones(3)
Is there a quicker way? I feel like this could be improved.
EDIT: just wanted to give an update for some of the methods described here
n = int(1e6)
arr = np.random.random(4*n)
def method1(arr):
return arr.reshape(n, 4) # np.ones(4)
def method2(arr):
return arr.reshape(n, 4).sum(-1)
def method3(arr):
return np.add.reduceat(arr, np.arange(0, 4*n, 4))
%timeit method1(arr)
1.53 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit method2(arr)
14.6 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit method3(arr)
14.2 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

method2 is the basic way to do that in Numpy. That being said it is not very well optimized yet internally for such a case. Indeed, the reduction is done along a very small number of items and the internal reduction is optimized for a relatively large number of items. AFAIR, compilers like GCC tends to auto-vectorize the code using SIMD instructions resulting in a much slower execution for small reductions. It might be optimized in the future but this is tricky to do since the problem is mainly related to the way compilers optimize the code and the assumption they make during the optimization steps. Thus, it is not really a problem of Numpy though there are ways to specifically optimize this use-case at the expensive of a less-maintainable code.
method3 is not very efficient since np.add.reduceat is currently not yet very-optimized internally in Numpy. We plan to do that but one should not expect a drastic improvement since the method is fundamentally not very efficient on modern CPUs anyway.
method1 is clever because it makes use of BLAS that are very optimized internally. The default implementation on most platform, OpenBLAS, carefully optimize many use-case, including small matrices/vectors multiplications, resulting in a much faster execution. That being said, it is not optimal due to the unneeded multiplications by ones (BLAS does not optimize the computations based on the content of the values).
AFAIK, there is no way to write a faster implementation than method1 in pure Numpy. As a result, the only option left to speed up the code is to execute a natively-compiled code specifically design to solve your use-case. This is possible using Numba or Cython. Here is a naive implementation:
import numba as nb
#nb.njit('(float64[::1],)')
def method4(arr):
res = np.empty(n)
for i in range(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
If you run this code, you will certainly get similar performance results than BLAS demonstrating how good BLAS implementations are (in fact, OpenBLAS is a bit faster on my machine). This code is not optimal because it is mainly memory-bound and page faults slow things down on most systems (see this related post). You can mitigate their overheads using multiple threads. This is still not optimal as page faults does not scale well on all platforms (quite fine on Linux but poor on Windows). Alternatively, you can preallocate the output array once so to pay this overhead only once. You can even mix both approaches regarding your needs (suing multiple threads can be useful to ensure the memory is saturated whatever the target platform though creating threads can be expensive). Here is the naive parallel implementation and an optimized parallel implementation:
# Naive parallel implementation mitigating a bit the page-faults overhead
#nb.njit('(float64[::1],)', parallel=True)
def method5(arr):
res = np.empty(n)
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
return res
# Parallel implementation avoiding completely page-faults
# (assuming `res` is preallocated and filled)
#nb.njit('(float64[::1],float64[::1])', parallel=True)
def method6(arr, res):
for i in nb.prange(n):
res[i] = arr[i*4] + arr[i*4+1] + arr[i*4+2] + arr[i*4+3]
Benchmark
method1: 3.64 ms
method2: 11.7 ms
method3: 16.0 ms
method4: 3.88 ms
method5: 2.05 ms
method6: 0.84 ms <----
This last method is nearly optimal and 4.3 times faster than the previously fastest BLAS one.

Related

Any chance of making this faster? (numpy.einsum)

I'm trying to multiply three arrays (A x B x A), with the dimensions (19000, 3) x (19000, 3, 3) x (19000, 3) so that at the end I'm getting a 1d-array with the size (19000), so I want to multiply only along the last one/two dimensions.
I've got it working with np.einsum() but I'm wondering if there is any way of making this faster, as this is the bottleneck of my whole code.
np.einsum('...i,...ij,...j', A, B, A)
I've already tried it with two separated np.einsum() calls, but that gave me the same performance:
np.einsum('...i, ...i', np.einsum('...i,...ij', A, B), A)
As well I've already tried the # operator and adding some additional axes, but that also didn't make it faster:
(A[:, None]#B#A[...,None]).squeeze()
I've tried to get it working with np.inner(), np.dot(), np.tensordot() and np.vdot(), but these never gave me the same results, so I couldn't compare them.
Any other ideas? Is there any way I could get a better performance?
I've already had a quick look at Numba, but as Numba doesn't support np.einsum() and many other NumPy functions, I would have to rewrite a lot of code.
You could use Numba
In the beginning it is always a good idea, to look what np.einsum does. With optimize==optimal it is usually really good to find a way of contraction, which has less FLOPs. In this case there is actually only a minor optimization possible and the intermediate array is relatively large (I will stick to the naive version). It should also be mentioned that contractions with very small (fixed?) dimensions are a quite special case. This is also a reason why it is quite easy to outperfom np.einsum here (unrolling etc..., which a compiler does if it knows that a loop consists only of 3 elements)
import numpy as np
A=np.random.rand(19000, 3)
B=np.random.rand(19000, 3, 3)
print(np.einsum_path('...i,...ij,...j', A, B, A,optimize="optimal")[1])
"""
Complete contraction: si,sij,sj->s
Naive scaling: 3
Optimized scaling: 3
Naive FLOP count: 5.130e+05
Optimized FLOP count: 4.560e+05
Theoretical speedup: 1.125
Largest intermediate: 5.700e+04 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
3 sij,si->js sj,js->s
2 js,sj->s s->s
"""
Numba implementation
import numba as nb
#si,sij,sj->s
#nb.njit(fastmath=True,parallel=True,cache=True)
def nb_einsum(A,B):
#check the input's at the beginning
#I assume that the asserted shapes are always constant
#This makes it easier for the compiler to optimize
assert A.shape[1]==3
assert B.shape[1]==3
assert B.shape[2]==3
#allocate output
res=np.empty(A.shape[0],dtype=A.dtype)
for s in nb.prange(A.shape[0]):
#Using a syntax like that is also important for performance
acc=0
for i in range(3):
for j in range(3):
acc+=A[s,i]*B[s,i,j]*A[s,j]
res[s]=acc
return res
Timings
#warmup the first call is always slower
#(due to compilation or loading the cached function)
res=nb_einsum(A,B)
%timeit nb_einsum(A,B)
#43.2 µs ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A,optimize=True)
#450 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.einsum('...i,...ij,...j', A, B, A)
#977 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(np.einsum('...i,...ij,...j', A, B, A,optimize=True),nb_einsum(A,B))
#True

Can idle threads in OpenMP/Cython be used to parallelize remaining section of the work block?

I am new to OpenMP and using it to parallelize a for-loop (to be accurate, I am using prange in Cython).
However, the operations are very uneven, and, as a result, there are quite a few idle threads till one block of the for-loop is completed.
I wanted to know whether there is a way to access the idle threads so that I can use them to parallelize the bottleneck operations.
This question boils down to the question of perfect scheduling of tasks, which is quite hard for a general case, so usually one fall back to heuristics.
OpenMP offers different heuristics for scheduling, which can be chosen via schedule-argument to prange (documentation).
Let's look at the following example:
%%cython -c=/openmp --link-args=/openmp
cdef double calc(int n) nogil:
cdef double d=0.0
cdef int i
for i in range(n):
d+=0.1*i*n
return d
def single_sum(int n):
cdef int i
cdef double sum = 0.0
for i in range(n):
sum += calc(i)
return sum
The evaluation of calc takes O(n), because a IEEE 754 complying compiler is not able to optimize the for-loop.
Now let's replace range through prange:
...
from cython.parallel import prange
def default_psum(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2):
sum += calc(i)
return sum
I have chosen to limit the number of threads to 2, to make the effect more dramatic. Now, comparing the running times we see:
N=4*10**4
%timeit single_sum(N) #no parallelization
# 991 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit default_psum(N) #parallelization
# 751 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
not as much improvement as we would like (i.e. we would like speed-up 2)!
It is an implementation detail of OpenMP-provider, which schedule is chosen when it is not explicitly set - but most probably it will be "static" without defining chunksize. In this case, the range is halved and one threads becomes the first, fast half, while another the second, where almost all of the work must be done - so a big part of the work isn't parallelized in the end.
A better strategy to achieve a better balance is to give i=0 to the first thread, i=1 to the second, i=2 again to the first and so on. This can be achieved for "static"-schedule by setting chunksize to 1:
def static_psum1(int n):
cdef int i
cdef double sum = 0.0
for i in prange(n, nogil=True, num_threads=2, schedule="static", chunksize=1):
sum += calc(i)
return sum
we almost reach the maximally possible speed-up of 2:
%timeit static_psum1(N)
# 511 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Choosing best schedule is a trade-off between scheduling overhead (not very high in the example above) and best work-balance - and the best trade-off can only be achieved only after analyzing the problem (and hardware!) at hand.
Here are some timings for the above example for different scheduling strategies and different number of threads:
(schedule,chunksize) N=2 N=8
single-threaded 991 ms 991 ms
(default) 751 ms 265 ms
static 757 ms 274 ms
static,1 511 ms 197 ms
static,10 512 ms 166 ms
dynamic,1 509 ms 158 ms
dynamic,10 509 ms 156 ms
guided 508 ms 158 ms
Trying to use different schedules makes only sense, when there is at least a theoretical possibility to achieve a good balance.
If there is a task, which takes 90% of running time, then no matter which schedule-strategy is used - it will not be possible to improve the performance. In this case the big task itself should be parallelized, sadly Cython's support for OpenMP is somewhat lacking (see for example this SO-post), so possible it is better to code in pure C and then wrap the resulting functionality with Cython.

NumPy is faster than PyTorch for larger cross or outer products

I'm computing huge outer products between vectors of size (50500,) and found out that NumPy is (much?) faster than PyTorch while doing so.
Here are the tests:
# NumPy
In [64]: a = np.arange(50500)
In [65]: b = a.copy()
In [67]: %timeit np.outer(a, b)
5.81 s ± 56.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
-------------
# PyTorch
In [73]: t1 = torch.arange(50500)
In [76]: t2 = t1.clone()
In [79]: %timeit torch.ger(t1, t2)
7.73 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'd ideally like to have the computation done in PyTorch. So, how can I speed things up for computing outer product in PyTorch for such huge vectors?
Note: I tried to move the tensors to GPU but I was treated with MemoryError because it needs around 19 GiB of space. So, I eventually have to do it on the CPU.
Unfortunately there's really no way to specifically speed up torch's method of computing the outer product torch.ger() without a vast amount of effort.
Explanation and Options
The reason numpy function np.outer() is so fast is because it's written in C, which you can see here: https://github.com/numpy/numpy/blob/7e3d558aeee5a8a5eae5ebb6aef03de892a92ebd/numpy/core/numeric.py#L1123
where the function uses operations from the umath C source code.
Pytorch's torch.ger() function is written in C++ here: https://github.com/pytorch/pytorch/blob/7ce634ebc2943ff11d2ec727b7db83ab9758a6e0/aten/src/ATen/native/LinearAlgebra.cpp#L142 which makes it ever so slightly slower as you can see in your example.
Your options to "speed up computing outer product in PyTorch" would be to add a C implementation for outer product in pytorch's native code, or make your own outer product function while interfacing with C using something like Cython if you really don't want to use numpy (which wouldn't make much sense).
P.S.
Also just as an aside, using GPUs would only improve your parallel computation speed on the GPU which may not outweigh the cost of time required to transfer data between RAM and GPU memory.
A very nice solution is to combine both.
class LazyFrames(object):
def __init__(self, frames):
self._frames = frames
def __array__(self, dtype=None):
out = np.concatenate(self._frames, axis=0)
if dtype is not None:
out = out.astype(dtype)
return out
frames might be just your pytorch tensors for instance.
This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge (e.g. DQN's 1M frames replay buffers). This object should only be converted to numpy array before being passed to the model.
Reference : https://github.com/Shmuma/ptan/blob/master/ptan/common/wrappers.py

numpy.sum performance depending on axis

When summing over a dimension in a numpy array, is there a performance difference between the first and the last axis?
Specifically, considering the following code, which of sum1 and sum2 will be performed faster?
import numpy as np
a = np.ones((1000,200))
b = np.ones((200,1000))
sum1 = np.sum(a, axis=0)
sum2 = np.sum(b, axis=-1)
I believe this question actually boils down to how does numpy internally store dimensions and that this can be overriden to use row-wise or column-wise format. However, when using the default setting, which of these will be faster? Also, what about N-dimensional arrays?
It is quite easy to check whether or not there is a performance difference (IPython, I increased a bit the numbers to have a more noticeable difference):
import numpy as np
a = np.ones((10000, 2000))
b = np.ones((2000, 10000))
%timeit np.sum(a, axis=0)
# 27.6 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(b, axis=-1)
# 34.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, by the time you are having an actual performance issue with np.sum you will probably have run out of memory anyway, but yes, there is a difference. By default, NumPy arrays are stored in row-major order, so first goes the first row, then the second, etc. It does make sense, then, that summing (or operating) in outer dimensions is faster, because the cache will be way more effective. Simply puy, in the first case, when you get the first element of the array a bunch of contiguous data will come to the cache with it, so when you want to sum the next elements they will be already there. In the second case, on the other hand, elements to sum are quite far away from each other (2000 elements of distance, actually), so the cache won't be helping much, column-wise. That is not to say the cache won't help at all, since you are summing all the columns, so cached data will still be reused to a degree, but not as effectively. This is a rather gross approximation, in general there are several cache levels, some shared among cores and some not, and understanding the exact effect that one or another code has on it is a complicated topic, but the general idea holds.

Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing
Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).

Categories