My sincere apologies, in advance if this question seems quite basic.
Given:
>>> import numpy as np
>>> import time
>>> A = np.random.rand( int(1e5), int(5e4) ) # large numpy array
Goal:
>>> bt=time.time(); B=np.argsort(A,axis=1);et=time.time();print(f"Took {(et-bt):.2f} s")
However, it takes quite long time to calculate an array of indices:
# Took 316.90 s
Question:
Is there any other time efficient ways to do this?
Cheers,
The input array A has a shape of (100_000, 50_000) and contains np.float64 values by default. This means that you need 8 * 100_000 * 50_000 / 1024**3 = 37.2 Gio of memory just for this array. You also likely need the same amount of space for the output matrix B (which should contains items of type np.int64). This means you need a machine with at least 74.4 Gio, not to mentions the space required for the operating system (OS) and running software (so probably at least 80 Gio). If you do not have such a memory space, then your OS will use your storage device as a swap memory, which is much much slower.
Assuming you have such a memory space available, such a computation is very expensive. It is mainly due to the page faults when filling the B array, and also the fact that B is pretty huge as well as the default implementation of Numpy do the computation sequentially. You can speed the computation up using a parallel Numba code and a smaller output. Here is an example:
import numba as nb
#nb.njit('int32[:,::1](float64[:,::1])', parallel=True)
def fastSort(a):
b = np.empty(a.shape, dtype=np.int32)
for i in nb.prange(a.shape[0]):
b[i,:] = np.argsort(a[i,:])
return b
Note that the Numba's implementation of argsort is less efficient than the one of Numpy but the parallel version should be much faster if the target machine have mny cores and a good memory bandwidth.
Here are the results on my 6-core machine on a matrix of size (10_000, 50_000) (10 times smaller because I do not have 80 Gio of RAM):
Original implementation: 28.14 s
Sequential Numba: 38.70 s
Parallel Numba: 6.79 s
The resulting solution is thus 4.1 times faster.
Note that you could even use the type uint16 for the items of B in this specific case as the size of each line is less than 2**16 = 65536. This will likely not be significantly faster but it should save some additional memory. The resulting required memory will be 46.5 Gio. You can further reduce the amount of memory needed using the np.float32 type (often at the expense of a loss of accuracy).
If you want to improve the execution time further, then you need to implement a faster implementation of argsort for your specific needs using a low-level language like C or C++. But be aware that beating Numpy is far from being easy if you are not an experienced programmer in such a language or not familiar with low-level optimizations. If you are interested in such a solution, a good start is probably to implement a radix sort.
Related
I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.
I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?
Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.
You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum
in c there is an example of importance of memory utilization: naive matrix multiplication using 3 for loops (i,j,k). And one can show that using (i,k,j) is much faster than (i,j,k) due to memory coalescence. In python numpy the order if indexes (again naive 3 loops, not np.dot) does not matter. Why is that?
First of all you need to know why the loop (i,k,j) is faster than (i,j,k) on C. That's happens because the memory usage optimization, in your computer memory the matrix is allocated in a linear way, so if you iterate using (i,k,j) you are using this in your favor where each loop takes a block of memory and load to your RAM. If you use (i,j,k) you are working against it, and each step will take a block of memory load to your RAM and discard on next step because you are iterating jumping blocks.
The implementation of numpy handle it for you, so even if you use the worst order numpy will order it to work faster.
The event of throwing away the cache, and keep changing it all the time is called Cache miss
At this link you can see a much better explanation about how the memory is allocated and why is it faster in some specific itartion way.
I just changed a program I am writing to hold my data as numpy arrays as I was having performance issues, and the difference was incredible. It originally took 30 minutes to run and now takes 2.5 seconds!
I was wondering how it does it. I assume it is that the because it removes the need for for loops but beyond that I am stumped.
Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.
Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.
numpy arrays are specialized data structures.
This means you don't only get the benefits of an efficient in-memory representation, but efficient specialized implementations as well.
E.g. if you are summing up two arrays the addition will be performed with the specialized CPU vector operations, instead of calling the python implementation of int addition in a loop.
Consider the following code:
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a, b)
toc = time.time()
print("Vectorised version: " + str(1000*(toc-tic)) + "ms")
c = 0
tic = time.time()
for i in range(1000000):
c += a[i] * b[i]
toc = time.time()
print("For loop: " + str(1000*(toc-tic)) + "ms")
Output:
Vectorised version: 2.011537551879883ms
For loop: 539.8685932159424ms
Here Numpy is much faster because it takes advantage of parallelism (which is the case of Single Instruction Multiple Data (SIMD)), while traditional for loop can't make use of it.
Numpy arrays are extremily similar to 'normal' arrays such as those in c. Notice that every element has to be of the same type. The speedup is great because you can take advantage of prefetching and you can instantly access any element in array by it's index.
You still have for loops, but they are done in c. Numpy is based on Atlas, which is a library for linear algebra operations.
http://math-atlas.sourceforge.net/
When facing a big computation, it will run tests using several implementations to find out which is the fastest one on our computer at this moment. With some numpy builds comutations may be parallelized on multiple cpus. So you will have highly optimized c running on continuous memory blocks.
Numpy arrays are stored in memory as continuous blocks of memory and python lists are stored as small blocks which are scattered in memory so memory access is easy and fast in a numpy array and memory access is difficult and slow in a python list.
source: https://algorithmdotcpp.blogspot.com/2022/01/prove-numpy-is-faster-than-normal-list.html
Currently I am working with a NumPy memmap array with 2,000,000 * 33 * 33 *4 (N * W * H * C) data. My program reads random (N) indices from this array.
I have 8GB of RAM, 2TB HDD. The HDD read IO is only around 20M/s, RAM usage stays at 2.5GB. It seems that there is a HDD bottleneck because I am retrieving random indices that are obviously not in the memmap cache. Therefore, I would like the memmap cache to use RAM as much as possible.
Is there a way for me to tell memmap to maximize IO and RAM usage?
(Checking my python 2.7 source)
As far as I can tell NumPy memmap uses mmap.
mmap does define:
# Variables with simple values
...
ALLOCATIONGRANULARITY = 65536
PAGESIZE = 4096
However i am not sure it would be wise (or even possible) to change those.
Furthermore, this may not solve your problem and would definitely not give you the most efficient solution, because there is caching and page reading at OS level and at hardware level (because for hardware it takes more or less the same time to read a single value or the whole page).
A much better solution would probably be to sort your requests. (I suppose here that N is large, otherwise just sort them once):
Gather a bunch of them (say one or ten millions?) and before doing the request, sort them. Then ask the ordered queries. Then after getting the answers put them back in their original order...