What is the fastest way to do bulk assignments into NumPy arrays?

What is the fastest way to do bulk assignments into NumPy arrays? - python

I'm working on a machine learning problem, and I need to construct an array of dimensions m x n x p.
For the sake of the question, let's say we have m locations, n time windows, and p features for each observation at each time window.
Our data store can only return one location worth of data at a time, in other words we get an array of size 1 x n x p back. But for the prediction step, we want everything consolidated into a single array of size m x n x p.
With relatively small p, this is fast enough using the naïve approach the we don't care. In some cases, however, we have fairly large p, and that's made this fairly slow.
For example:
In [44]: A = np.arange(4800000000, dtype=np.float32).reshape(20,30,8000000)
In [45]: x = np.random.randn(30, 8000000)
In [46]: %timeit A[0,:] = x
39.5 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
While 40ms is not a lot on face, doing that for an m of 100 results in 4 seconds on this one fairly straightforward operation.
Is there a faster way to do this array construction?

Related

Doing for loop computations faster

I have a for loop doing some operation on the elements of an array. There are 1e5 elements in the array
import numpy as np
A=np.array([1,2,3,4..........100000)]
for i in range(0,len(A)):
A[i]=(A[i]*2+A[i]*4)**(1/3)
I want to obtain parallelisation in the above code so that each execution of the for loop goes to a different core to make the code execution faster. I have a workstation with 48 cores. How to achieve this parallel processing in python? Please help.

Don't bother parallelizing just yet. Right now, you're taking no advantage of numpy vectorization; you may as well be using Python list (or maybe array.array) for all the benefit numpy is giving you.
Actually use the vectorization features, and the overhead should drop by several orders of magnitude:
import numpy as np
A = np.array([1,2,3,4..........100000]) # If this is actually the values you want, use np.arange(1, 100000+1) to speed it up
A = (A * 6) ** (1 / 3)
# If the result should truncate back to int64, not convert to doubles, cast back at the end
A = A.astype(np.int64)
(A * 6) ** (1 / 3) does the same work as the for loop did, but much faster (you could match the original code more closely with A = (A * 2 + A * 4) ** (1/3), but multiplying by 2 and 4 separately and adding them together is pointless when you could just multiply by 6 directly). The final (optional, depending on intent) line gets exact equivalent behavior of the original loop by truncating back to the original integer dtype.
Comparing performance with ipython %%timeit magic for a microbenchmark:
In [2]: %%timeit
...: A = np.arange(1, 100000+1)
...: for i in range(len(A)):
...: A[i] = (A[i]*2 + A[i]*4) ** (1/3)
...:
427 ms ± 6.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %%timeit
...: A = np.arange(1, 100000+1)
...: A = (A * 6) ** (1/3)
...:
2.72 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The vectorized code takes about 0.6% of the time taken by the naive loop; merely parallelizing the naive loop would never come close to achieving that sort of speedup. Adding the .astype(np.int64) cast only increases runtime by about 6%, still a trivial fraction of what the original for loop required.

Let numpy do the hard work.
A = (A*2+A*4)**(1/3)

Performing matrix operation on two large matrices

I have two large matrices (40000*4096) and I would like to compare and match each row of the first matrix to all of the rows for the second matrix and as a result, the output will have a size (40000*40000). However, since I need to do this for several thousand times, it is wildy time consuming 26k seconds for each iteration so for 5000 times ...
I would be glad if you could give me some smart suggestion. Thank you.
P.S. this is what I did so far for just one iteration (1 of 5000)
def matcher(Antigens, Antibodies,ind):
temp = np.zeros((Antibodies.shape[0],Antibodies.shape[1]))
output = np.zeros((Antibodies.shape[0],1))
for i in range(len(Antibodies)):
temp[i] = np.int32(np.equal(Antigens[ind],Antibodies[i]))
output[i] = np.sum(temp[i])
return output
output = [matcher(gens,Antibodies) for gens in Antigens]

Okay, I think I understand what your goal is:
Count number of row matches (antigen vs antibody matrix). Each row of the resulting vector (40,000 x 1) represents a count of exact matches between 1 antigen row and all of the antibodies row (so values from 0 - 40_000).
I made some fake data:
import numpy as np
import numba as nb
num_mat = 5 # number of matrices
num_row = 10_000 # number of rows per matrix
num_elm = 4_096 # number of elements per row
dim = (num_mat,num_row,num_elm)
Antigens = np.random.randint(0,256,dim,dtype=np.uint8)
Antibodies = np.random.randint(0,256,dim,dtype=np.uint8)
There's one important point here, I reduced the matrices to the smallest datatype that can represent the data in order to reduce their memory foot-print. I'm not sure what your data looks like, but hopefully you can do this as well.
Also, the following code assumes your dimensions look the fake data:
(number of matrices, rows, elements)
#nb.njit
def match_arr(arr1, arr2):
for i in range(arr1.shape[0]): #4096 vs 4096
if arr1[i] != arr2[i]:
return False
return True
#nb.njit
def match_mat_sum(ag, ab):
out = np.zeros((ag.shape[0])) # 40000
for i in range(ag.shape[0]):
tmp = 0
for j in range(ab.shape[0]):
tmp += match_arr(ag[i], ab[j])
out[i] = tmp
return out
#nb.njit(parallel=True)
def match_sets(Antigens, Antibodies):
out = np.empty((Antigens.shape[0] * Antibodies.shape[0], Antigens.shape[1])) # 5000 x 40000
# multiprocessing per antigen matrix, may want to move this as suits your data
for i in nb.prange(Antigens.shape[0]):
for j in range(Antibodies.shape[0]):
out[j+(5*i)] = match_mat_sum(Antigens[i], Antibodies[j]) # need to figure out the index to avoid race conditions
return out
I lean on Numba heavily. One of the key optimizations is not to check the equivalence of entire rows with np.equal() but to write a custom function match_arr() that breaks as soon as it finds a mis-matched element. Hopefully, this lets us skip a ton of comparisons.
Time comparison:
%timeit match_arr(arr1, arr2)
314 ns ± 0.361 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.equal(arr1, arr2)
1.07 µs ± 5.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
match_mat_sum
This function simply calculates the middle step (the 40,000 x 1 vector) that represents the sum of exact matches between two matrices. This step reduces two matrices like: (m x n), (o x n) -> (m)
match_sets()
The last function parallelizes this operation with explicit parallel loops through nb.prange. You might want to move this function to a different loop depending on what your data looks like (like if you have one antigen matrix, but 5000 antibody matrices, you should move prange to the inner loop or you'll not be leveraging parallelization). The fake data assumes some antigen and some antibody matrices.
Another important thing to note here is the indexing on the out array. In order to avoid race conditions, each explicit loops needs to write to a unique space. Again, depending on your data, you'll need to index the proper "place" to put the result.
On a Ryzen 1600 (6-core) with 16 gigs of RAM, using this fake data, I generated a result in 10.2 seconds.
Your data is about 3200x times larger. Assuming linear scaling, the full set would take approximately 9 hours, assuming you have enough memory.
You could write some kind of batch loader as well, rather than loading 5000 giant matrices directly into memory.

This problem can be tackled with a mixture of numpy broadcasting, and the module numexpr, which performs operations fast while minimizing the storage of intermediate values
import numexpr as ne
# expand arrays dimensions to support broadcasting when doing comparison
Antigens, Antibodies = Antigens[None, :, :], Antibodies[:, None, :]
output = ne.evaluate('sum((Antigens==Antibodies)*1, axis=2)')
# *1 is a hack because numexpr does not currently support sum on bool
This may be faster than your current solution, but for such large arrays it will take a while.
The performance of numexpr for this operations is a bit lackluster, but you can at least use broadcasting inside the loop:
output = np.zeros((Antibodies.shape[0],)*2, dtype=np.int32)
for row, out_row in zip(Antibodies, output):
(row[None,:]==Antigens).sum(1, out=out_row)

Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?

I want to create an empty Numpy array in Python, to later fill it with values. The code below generates a 1024x1024x1024 array with 2-byte integers, which means it should take at least 2GB in RAM.
>>> import numpy as np; from sys import getsizeof
>>> A = np.zeros((1024,1024,1024), dtype=np.int16)
>>> getsizeof(A)
2147483776
From getsizeof(A), we see that the array takes 2^31 + 128 bytes (presumably of header information.) However, using my task manager, I can see Python is only taking 18.7 MiB of memory.
Assuming the array is compressed, I assigned random values to each memory slot so that it could not be.
>>> for i in range(1024):
... for j in range(1024):
... for k in range(1024):
... A[i,j,k] = np.random.randint(32767, dtype = np.int16)
The loop is still running, and my RAM is slowly increasing (presumably as the arrays composing A inflate with the incompresible noise.) I'm assuming it would make my code faster to force numpy to expand this array from the beginning. Curiously, I haven't seen this documented anywhere!
So, 1. Why does numpy do this? and 2. How can I force numpy to allocate memory?

A neat answer to your first question can also be found in this StackOverflow answer.
To answer your second question, you can force the memory to be allocated as follows in a more or less efficient manner:
A = np.empty((1024,1024,1024), dtype=np.int16)
A.fill(0)
because then the memory is touched.
At my machine with my setup,
A = np.empty(0)
A.resize((1024, 1024, 1024))
also does the trick, but I cannot find this behavior documented, and this might be an implementation detail; realloc is used under the hood in numpy.

Let's look at some timings for a smaller case:
In [107]: A = np.zeros(10000,int)
In [108]: for i in range(A.shape[0]): A[i]=np.random.randint(327676)
We don't need to make A 3d to get the same effect; 1d of the same total size would be just as good.
In [109]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676)
37 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now compare that time to the alternative of generating the random numbers with one call:
In [110]: timeit np.random.randint(327676, size=A.shape)
185 µs ± 905 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Much much faster.
If we do the same loop, but simply assign the random number to a variable (and throw it away):
In [111]: timeit for i in range(A.shape[0]): x=np.random.randint(327676)
32.3 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The times are nearly the same as the original case. Assigning the values to the zeros array is not the big time consumer.
I'm not testing a very large case as you are, and my A has already been initialized in full. So you are welcome repeat the comparisons with your size. But I think the pattern will still hold - iteration 1024x1024x1024 times (100,000 larger than my example) is the big time consumer, not the memory allocation task.
Something else you might experimenting with: just iterate on the first dimension of A, and assign randomint shaped like the other 2 dimensions. For example, expanding my A with a size 10 dimension:
In [112]: A = np.zeros((10,10000),int)
In [113]: timeit for i in range(A.shape[0]): A[i]=np.random.randint(327676,size=A.shape[1])
1.95 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A is 10x larger than in [107], but take 16x less time to fill, because it only as to iterate 10x. In numpy if you must iterate, try to do it a few times on a more complex task.
(timeit repeats the test many times (e.g. 7*10), so it isn't going to capture any initial memory allocation step, even if I use a large enough array for that to matter).

numpy.sum performance depending on axis

When summing over a dimension in a numpy array, is there a performance difference between the first and the last axis?
Specifically, considering the following code, which of sum1 and sum2 will be performed faster?
import numpy as np
a = np.ones((1000,200))
b = np.ones((200,1000))
sum1 = np.sum(a, axis=0)
sum2 = np.sum(b, axis=-1)
I believe this question actually boils down to how does numpy internally store dimensions and that this can be overriden to use row-wise or column-wise format. However, when using the default setting, which of these will be faster? Also, what about N-dimensional arrays?

It is quite easy to check whether or not there is a performance difference (IPython, I increased a bit the numbers to have a more noticeable difference):
import numpy as np
a = np.ones((10000, 2000))
b = np.ones((2000, 10000))
%timeit np.sum(a, axis=0)
# 27.6 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(b, axis=-1)
# 34.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, by the time you are having an actual performance issue with np.sum you will probably have run out of memory anyway, but yes, there is a difference. By default, NumPy arrays are stored in row-major order, so first goes the first row, then the second, etc. It does make sense, then, that summing (or operating) in outer dimensions is faster, because the cache will be way more effective. Simply puy, in the first case, when you get the first element of the array a bunch of contiguous data will come to the cache with it, so when you want to sum the next elements they will be already there. In the second case, on the other hand, elements to sum are quite far away from each other (2000 elements of distance, actually), so the cache won't be helping much, column-wise. That is not to say the cache won't help at all, since you are summing all the columns, so cached data will still be reused to a degree, but not as effectively. This is a rather gross approximation, in general there are several cache levels, some shared among cores and some not, and understanding the exact effect that one or another code has on it is a complicated topic, but the general idea holds.

Why python broadcasting in the example below is slower than a simple loop?

I have an array of vectors and compute the norm of their diffs vs the first one.
When using python broadcasting, the calculation is significantly slower than doing it via a simple loop. Why?
import numpy as np
def norm_loop(M, v):
n = M.shape[0]
d = np.zeros(n)
for i in range(n):
d[i] = np.sum((M[i] - v)**2)
return d
def norm_bcast(M, v):
n = M.shape[0]
d = np.zeros(n)
d = np.sum((M - v)**2, axis=1)
return d
M = np.random.random_sample((1000, 10000))
v = M[0]
%timeit norm_loop(M, v)
25.9 ms
%timeit norm_bcast(M, v)
38.5 ms
I have Python 3.6.3 and Numpy 1.14.2
To run the example in google colab:
https://drive.google.com/file/d/1GKzpLGSqz9eScHYFAuT8wJt4UIZ3ZTru/view?usp=sharing

Memory access.
First off, the broadcast version can be simplified to
def norm_bcast(M, v):
return np.sum((M - v)**2, axis=1)
This still runs slightly slower than the looped version.
Now, conventional wisdom says that vectorized code using broadcasting should always be faster, which in many cases isn't true (I'll shamelessly plug another of my answers here). So what's happening?
As I said, it comes down to memory access.
In the broadcast version every element of M is subtracted from v. By the time the last row of M is processed the results of processing the first row have been evicted from cache, so for the second step these differences are again loaded into cache memory and squared. Finally, they are loaded and processed a third time for the summation. Since M is quite large, parts of the cache are cleared on each step to acomodate all of the data.
In the looped version each row is processed completely in one smaller step, leading to fewer cache misses and overall faster code.
Lastly, it is possible to avoid this with some array operations by using einsum.
This function allows mixing matrix multiplications and summations.
First, I'll point out it's a function that has rather unintuitive syntax compared to the rest of numpy, and potential improvements often aren't worth the extra effort to understand it.
The answer may also be slightly different due to rounding errors.
In this case it can be written as
def norm_einsum(M, v):
tmp = M-v
return np.einsum('ij,ij->i', tmp, tmp)
This reduces it to two operations over the entire array - a subtraction, and calling einsum, which performs the squaring and summation.
This gives a slight improvement:
%timeit norm_bcast(M, v)
30.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_loop(M, v)
25.1 ms ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit norm_einsum(M, v)
21.7 ms ± 65.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Squeezing out maximum performance
On vectorized operations you clearly have a bad cache behaviour. But the calculation itsef is also slow due to not exploiting modern SIMD instructions (AVX2,FMA). Fortunately it isn't really complicated to overcome this issues.
Example
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def norm_loop_improved(M, v):
n = M.shape[0]
d = np.empty(n,dtype=M.dtype)
#enables SIMD-vectorization
#if the arrays are not aligned
M=np.ascontiguousarray(M)
v=np.ascontiguousarray(v)
for i in nb.prange(n):
dT=0.
for j in range(v.shape[0]):
dT+=(M[i,j]-v[j])*(M[i,j]-v[j])
d[i]=dT
return d
Performance
M = np.random.random_sample((1000, 1000))
norm_loop_improved: 0.11 ms**, 0.28ms
norm_loop: 6.56 ms
norm_einsum: 3.84 ms
M = np.random.random_sample((10000, 10000))
norm_loop_improved:34 ms
norm_loop: 223 ms
norm_einsum: 379 ms
** Be careful when measuring performance
The first result (0.11ms) comes from calling the function repeadedly with the same data. This would need 77 GB/s reading-throuput from RAM, which is far more than my DDR3 Dualchannel-RAM is capable of. Due to the fact that calling a function with the same input parameters successively isn't realistic at all, we have to modify the measurement.
To avoid this issue we have to call the same function with different data at least twice (8MB L3-cache, 8MB data) and than divide the result by two to clear all the caches.
The relative performance of this methods also differ on array sizes (have a look at the einsum results).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the fastest way to do bulk assignments into NumPy arrays? - python

Related

Doing for loop computations faster

Performing matrix operation on two large matrices

Why doesn't numpy.zeros allocate all of its memory on creation? And how can I force it to?

numpy.sum performance depending on axis

Why python broadcasting in the example below is slower than a simple loop?

Categories

Resources