I'm trying to optimize the performance of some numpy code. The problem is, I'm just blindly replacing code with equivalent code, and profiling to compare the results.
Does numpy provide any resource describing the complexity/performance of its basic operations (either a theoretical analysis, or some practical rules of the thumb), for example
basic operations (eg: sum, multiplication, exponentiation)
different data types (eg: float vs complex)
vectorization/broadcasting as function of the size of the matrix (eg: vectorizing beyond 2 dimensions seems to make performance worse than equivalent for-loop)
Related
I was wondering if it is possible to write an iterative algorithm without using a for loop using as_strided and some operation that edits the memory in place.
For example, if I want to write an algorithm that replaces a number in an array with the sum of its neighbors. I came up with this abomination (yep its summing an element with 2 right neighbors but its just to get an idea):
import numpy as np
a = np.arange(10)
ops = 2
a_view_window = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2, 3), strides=(0,) + 2*a.strides)
a_view = np.lib.stride_tricks.as_strided(a, shape = (ops,a.size - 2), strides=(0,) + a.strides)
np.add.reduce(a_view_window, axis = -1, out=a_view)
print(a)
So I am taking an array of 10 numbers and creating this strange view which increases dimensionality without changing the strides. Thus my thinking is the reduction it will run over the fake new dimension and write over the previous values thus when it gets to the next major dimension it will have to read from the data it overwrote and thus iteratively perform the addition.
Sadly this does not work :(
(yes I know this is a terrible way to do things but I am curious about how the underlying numpy stuff works and if it can be hacked in this way)
This code results in an undefined behavior prior to Numpy 1.13 and works out-of-place in newer versions so to avoid overlapping/aliasing issues. Indeed, you cannot assume Numpy iterate in a given order on the input/output array view. In fact, Numpy often use SIMD instructions to speed up the code and sometimes tell compilers that views are not overlapping/aliasing each other (using the restrict keyword) to they can generate a much more efficient code. For more information you can read the doc on ufuncs (and this issue):
Operations where ufunc input and output operands have memory overlap produced undefined results in previous NumPy versions, due to data dependency issues. In NumPy 1.13.0, results from such operations are now defined to be the same as for equivalent operations where there is no memory overlap.
Operations affected now make temporary copies, as needed to eliminate data dependency. As detecting these cases is computationally expensive, a heuristic is used, which may in rare cases result to needless temporary copies. For operations where the data dependency is simple enough for the heuristic to analyze, temporary copies will not be made even if the arrays overlap, if it can be deduced copies are not necessary.
I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?
Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.
You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum
Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.
The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.
I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.
For my understanding, tensordot
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.tensordot.html
is multiply two tensors and sum given indices. This is precisely einsum does.
Actually what is the difference between this two functions? Is that due to different performance?
They are different approaches to similar problems. einsum is more general. Speed can be similar, though you need to check individual cases.
tensordot works by reshaping and transposing axes, reducing the problem to one that np.dot can solve. Its code, up to the dot call is Python, so you can read it for yourself.
einsum is built 'from-ground-up' to work with the 'Einstein notation' that is used in physics (it was written by scientist for fit his needs and usage). The documentation covers that. It is C code, so is a little harder to study. Basically it parses the indexing string, and builds an nditer object that will iterate over the input arrays, performing some sort of sum-of-products calculation. It can take short cuts in case where you just want indexing, diagonal, etc.
There have been number of questions asking about either of these functions, or suggesting their use in the answers.
In new versions there is also a np.matmul that generalized dot in a different way. It is linked to the new # operator in Python3.5.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.