I just changed a program I am writing to hold my data as numpy arrays as I was having performance issues, and the difference was incredible. It originally took 30 minutes to run and now takes 2.5 seconds!
I was wondering how it does it. I assume it is that the because it removes the need for for loops but beyond that I am stumped.
Numpy arrays are densely packed arrays of homogeneous type. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, you get the benefits of locality of reference.
Also, many Numpy operations are implemented in C, avoiding the general cost of loops in Python, pointer indirection and per-element dynamic type checking. The speed boost depends on which operations you're performing, but a few orders of magnitude isn't uncommon in number crunching programs.
numpy arrays are specialized data structures.
This means you don't only get the benefits of an efficient in-memory representation, but efficient specialized implementations as well.
E.g. if you are summing up two arrays the addition will be performed with the specialized CPU vector operations, instead of calling the python implementation of int addition in a loop.
Consider the following code:
import numpy as np
import time
a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a, b)
toc = time.time()
print("Vectorised version: " + str(1000*(toc-tic)) + "ms")
c = 0
tic = time.time()
for i in range(1000000):
c += a[i] * b[i]
toc = time.time()
print("For loop: " + str(1000*(toc-tic)) + "ms")
Output:
Vectorised version: 2.011537551879883ms
For loop: 539.8685932159424ms
Here Numpy is much faster because it takes advantage of parallelism (which is the case of Single Instruction Multiple Data (SIMD)), while traditional for loop can't make use of it.
Numpy arrays are extremily similar to 'normal' arrays such as those in c. Notice that every element has to be of the same type. The speedup is great because you can take advantage of prefetching and you can instantly access any element in array by it's index.
You still have for loops, but they are done in c. Numpy is based on Atlas, which is a library for linear algebra operations.
http://math-atlas.sourceforge.net/
When facing a big computation, it will run tests using several implementations to find out which is the fastest one on our computer at this moment. With some numpy builds comutations may be parallelized on multiple cpus. So you will have highly optimized c running on continuous memory blocks.
Numpy arrays are stored in memory as continuous blocks of memory and python lists are stored as small blocks which are scattered in memory so memory access is easy and fast in a numpy array and memory access is difficult and slow in a python list.
source: https://algorithmdotcpp.blogspot.com/2022/01/prove-numpy-is-faster-than-normal-list.html
Related
I have a simple problem. A function receives an array [a, b] of two numbers, and it returns another array [aa, ab]. The sample code is
import numpy as np
def func(array_1):
array_2 = np.zeros_like(array_1)
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
return array_2
array_1 = np.array([3., 4.]) # sample test array [a, b]
print(array_1) # prints this test array [a, b]
print(func( array_1 ) ) # prints [a*a, a*b]
The two lines inside the function func
array_2[0] = array_1[0]*array_1[0]
array_2[1] = array_1[0]*array_1[1]
are independent and I want to parallelize them.
Please tell me
how to parallize this (without Numba)?
how to parallize this (with Numba)?
This does not make sense to parallelise this code using multiple threads/processes because the arrays are far too small for such parallelisation to be useful. Indeed, creating thread typically takes about 1-100 microseconds on a mainstream machine while this code should take clearly less than a microsecond in Numba. In fact, the two computing lines should take less than 0.01 microsecond. Thus, creating thread will make the execution far slower.
Assuming your array would be much bigger, the typical way to parallelize a Python script is to use multiprocessing (which creates processes). For a Numba code, it is prange + parallel=True (which creates threads).
If you execute a jitted Numba function, then the code already runs a bit in parallel. Indeed, modern mainstream processors already execute instructions in parallel. This is called instruction-level parallelism. More specifically, modern processors pipeline the instructions and execute multiple of them thanks to a superscalar execution and in an out-of-order way. All of this is completely automatic. You just need to avoid having dependencies between the executed instructions.
Finally, if you want to speed up this function, then you need to use Numba in the caller function because a function call from CPython is far more expensive than computing and storing two floats. Note also that allocating an array is pretty expensive too so it is better to reuse buffers at this granularity.
I have a huge numpy array with shape (50000000, 3) and I'm using:
x = array[np.where((array[:,0] == value) | (array[:,1] == value))]
to get the part of the array that I want. But this way seems to be quite slow.
Is there a more efficient way of performing the same task with numpy?
np.where is highly optimized and I doubt someone can write a faster code than the one implemented in the last Numpy version (disclaimer: I was one who optimized it). That being said, the main issue here is not much np.where but the conditional which create a temporary boolean array. This is unfortunately the way to do that in Numpy and there is not much to do as long as you use only Numpy with the same input layout.
One reason explaining why it is not very efficient is that the input data layout is inefficient. Indeed, assuming array is contiguously stored in memory using the default row major ordering, array[:,0] == value will read 1 item every 3 item of the array in memory. Due to the way CPU cache works (ie. cache lines, prefetching, etc.), 2/3 of the memory bandwidth is wasted. In fact, the output boolean array also need to be written and filling a newly-created array is a bit slow due to page faults. Note that array[:,1] == value will certainly reload data from RAM due to the size of the input (that cannot fit in most CPU caches). The RAM is slow and it is getter slower compared to the computational speed of the CPU and caches. This problem, called "memory wall", has been observed few decades ago and it is not expected to be fixed any time soon. Also note that the logical-or will also create a new array read/written from/to RAM. A better data layout is a (3, 50000000) transposed array contiguous in memory (note that np.transpose does not produce a contiguous array).
Another reason explaining the performance issue is that Numpy tends not to be optimized to operate on very small axis.
One main solution is to create the input in a transposed way if possible. Another solution is to write a Numba or Cython code. Here is an implementation of the non transposed input:
# Compilation for the most frequent types.
# Please pick the right ones so to speed up the compilation time.
#nb.njit(['(uint8[:,::1],uint8)', '(int32[:,::1],int32)', '(int64[:,::1],int64)', '(float64[:,::1],float64)'], parallel=True)
def select(array, value):
n = array.shape[0]
mask = np.empty(n, dtype=np.bool_)
for i in nb.prange(n):
mask[i] = array[i, 0] == value or array[i, 1] == value
return mask
x = array[select(array, value)]
Note that I used a parallel implementation since the or operator is sub-optimal with Numba (the only solution seems to use a native code or Cython) and also because the RAM cannot be fully saturated with one thread on some platforms like computing servers. Also note that it can be faster to use array[np.where(select(array, value))[0]] regarding the result of select. Indeed, if the result is random or very small, then np.where can be faster since it has special optimizations for theses cases that a boolean indexing does not perform. Note that np.where is not particularly optimized in the context of a Numba function since Numba use its own implementation of Numpy functions and they are sometimes not as much optimized for large arrays. A faster implementation consists in creating x in parallel but this is not trivial to do with Numba since the number of output item is not known ahead of time and that threads must know where to write data, not to mention Numpy is already fairly fast to do that in sequential as long as the output is predictable.
My sincere apologies, in advance if this question seems quite basic.
Given:
>>> import numpy as np
>>> import time
>>> A = np.random.rand( int(1e5), int(5e4) ) # large numpy array
Goal:
>>> bt=time.time(); B=np.argsort(A,axis=1);et=time.time();print(f"Took {(et-bt):.2f} s")
However, it takes quite long time to calculate an array of indices:
# Took 316.90 s
Question:
Is there any other time efficient ways to do this?
Cheers,
The input array A has a shape of (100_000, 50_000) and contains np.float64 values by default. This means that you need 8 * 100_000 * 50_000 / 1024**3 = 37.2 Gio of memory just for this array. You also likely need the same amount of space for the output matrix B (which should contains items of type np.int64). This means you need a machine with at least 74.4 Gio, not to mentions the space required for the operating system (OS) and running software (so probably at least 80 Gio). If you do not have such a memory space, then your OS will use your storage device as a swap memory, which is much much slower.
Assuming you have such a memory space available, such a computation is very expensive. It is mainly due to the page faults when filling the B array, and also the fact that B is pretty huge as well as the default implementation of Numpy do the computation sequentially. You can speed the computation up using a parallel Numba code and a smaller output. Here is an example:
import numba as nb
#nb.njit('int32[:,::1](float64[:,::1])', parallel=True)
def fastSort(a):
b = np.empty(a.shape, dtype=np.int32)
for i in nb.prange(a.shape[0]):
b[i,:] = np.argsort(a[i,:])
return b
Note that the Numba's implementation of argsort is less efficient than the one of Numpy but the parallel version should be much faster if the target machine have mny cores and a good memory bandwidth.
Here are the results on my 6-core machine on a matrix of size (10_000, 50_000) (10 times smaller because I do not have 80 Gio of RAM):
Original implementation: 28.14 s
Sequential Numba: 38.70 s
Parallel Numba: 6.79 s
The resulting solution is thus 4.1 times faster.
Note that you could even use the type uint16 for the items of B in this specific case as the size of each line is less than 2**16 = 65536. This will likely not be significantly faster but it should save some additional memory. The resulting required memory will be 46.5 Gio. You can further reduce the amount of memory needed using the np.float32 type (often at the expense of a loss of accuracy).
If you want to improve the execution time further, then you need to implement a faster implementation of argsort for your specific needs using a low-level language like C or C++. But be aware that beating Numpy is far from being easy if you are not an experienced programmer in such a language or not familiar with low-level optimizations. If you are interested in such a solution, a good start is probably to implement a radix sort.
This question already has answers here:
subsampling every nth entry in a numpy array
(2 answers)
Closed 7 years ago.
Suppose a 1-dimensional numpy array. I want to make a new array that contains every n elements. What is the computationally fastest way to do this?
Example:
a = numpy.arange(1,10)
b = numpy.fancytricks(a,?)
# b is now [2,4,6,8] if n = 2.
Edit: bolded important part of the question.
The absolute fastest way to do this is to write an extension module in pure C and use the buffer protocol to access the data directly. If you use Cython or another such tool to write the C for you, you may see small amounts of performance lost in automatic reference counting. Since you still have to do manual reference counting in handwritten C, the difference is likely to be negligible to nonexistent.
This will have marginally less overhead than the slicing syntax NumPy provides out of the box. However, assuming you use NumPy correctly, the overall performance gain is likely to be small and constant, so it's not clear to me that this is worth the extra effort in any reasonable situation.
b = a[1::step]
n = length(a)
Computational cost is O(n), you are "making" a loop with length(a)/step
UPDATE:
Computational cost is O(1), there is no numpy.array object re-arrangement, just one constant ... in access-method... is set / changed. Once deployed, the access-speed is the same as with a value stored there before an update.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.