Why numpy is 'slow' by itself?

Why numpy is 'slow' by itself? - python

Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?

Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.

Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.

I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.

Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.

Related

Speed up N-d array dot

I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?

Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.

You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum

Why is numba not speeding up the following piece of code?

Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.

The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.

I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.

What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to SPEED?

This is a follow up to this question
What are the benefits / drawbacks of a list of lists compared to a numpy array of OBJECTS with regards to MEMORY?
I'm interested in understanding the speed implications of using a numpy array vs a list of lists when the array is of type object.
If anyone is interested in the object I'm using:
import gmpy2 as gm
gm.mpfr('0') # <-- this is the object

The biggest usual benefits of numpy, as far as speed goes, come from being able to vectorize operations, which means you replace a Python loop around a Python function call with a C loop around some inlined C (or even custom SIMD assembly) code. There are probably no built-in vectorized operations for arrays of mpfr objects, so that main benefit vanishes.
However, there are some place you'll still benefit:
Some operations that would require a copy in pure Python are essentially free in numpy—transposing a 2D array, slicing a column or a row, even reshaping the dimensions are all done by wrapping a pointer to the same underlying data with different striding information. Since your initial question specifically asked about A.T, yes, this is essentially free.
Many operations can be performed in-place more easily in numpy than in Python, which can save you some more copies.
Even when a copy is needed, it's faster to bulk-copy a big array of memory and then refcount all of the objects than to iterate through nested lists deep-copying them all the way down.
It's a lot easier to write your own custom Cython code to vectorize an arbitrary operation with numpy than with Python.
You can still get some benefit from using np.vectorize around a normal Python function, pretty much on the same order as the benefit you get from a list comprehension over a for statement.
Within certain size ranges, if you're careful to use the appropriate striding, numpy can allow you to optimize cache locality (or VM swapping, at larger sizes) relatively easily, while there's really no way to do that at all with lists of lists. This is much less of a win when you're dealing with an array of pointers to objects that could be scattered all over memory than when dealing with values that can be embedded directly in the array, but it's still something.
As for disadvantages… well, one obvious one is that using numpy restricts you to CPython or sometimes PyPy (hopefully in the future that "sometimes" will become "almost always", but it's not quite there as of 2014); if your code would run faster in Jython or IronPython or non-NumPyPy PyPy, that could be a good reason to stick with lists.

speeding up numpy.dot inside list comprehension

I have a numpy script that is currently running quite slowly.
spends the vast majority of it's time performing the following operation inside a loop:
terms=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex,Ey,Ez_av)
res=[np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms]
res=array(res)
Ex[1:Nx-1]=res[1:Nx-1,0]
Ey[1:Nx-1]=res[1:Nx-1,1]
It's the list comprehension that is really slowing this code down.
In this case, Coeff_3, and Coeff_2 are length 1000 lists whose elements are 3x3 numpy matricies, and Ex,Ey,Ez, Curl_x, etc are all length 1000 numpy arrays.
I realize it might be faster if i did things like setting a single 3x1000 E vector, but i have to perform a significant amount of averaging of different E vectors between step, which would make things very unwieldy.
Curiously however, i perform this operation twice per loop (once for Ex,Ey, once for Ez), and performing the same operation for the Ez's takes almost twice as long:
terms2=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex_av,Ey_av,Ez)
res2=array([np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms2])
Anyone have any idea what's happening? Forgive me if it's anything obvious, i'm very new to python.

As pointed out in previous comments, use array operations. np.hstack(), np.vstack(), np.outer() and np.inner() are useful here. You're code could become something like this (not sure about your dimensions):
Cxyz = np.vstack((Curl_x,Curl_y,Curl_z))
C2xyz = np.dot(C2, Cxyz)
...
Check the shape of your resulting dimensions, to make sure you translated your problem right. Sometimes numexpr can also to speed up such tasks significantly with little extra effort,

Matlab equivalent of Numpy broadcasting?

I'm trying to find some way to substract a size 3 vector from each column of a 3*(a big number) matrix in Matlab. Of course I could use a loop, but I'm trying to find some more efficient solution, a bit like numpy broadcasting. Oh, and I can't use repmat because I just don't have enough memory to use it (as it creates yet another 3*(a big number) matrix)...
Is this possible?

The other answers are a bit out of date -- Matlab R2016b appears to have added broadcasting as a standard feature. An example from that blog post that matches the question:
>> A = ones(2) + [1 5]'
A =
2 2
6 6

Loops aren't bad in MATLAB anymore thanks to compiler optimizations like just-in-time acceleration (JITA). etc. Most of the time, I've noticed that a solution with loops in current MATLAB versions is much faster than complicated (albeit, cool :D) one-liners.
bsxfun might do the trick but in my experience, it tends to have memory issues as well but less so than repmat.
So the syntax would be:
AA = bsxfun(#minus,A,b) where b is the vector and A is your big matrix
But I urge you to profile the loopy version and then decide! Most probably, due to memory constraints, you might not have a choice :)

I don't know if this will speed up the code, but subtraction of a scalar from a vector doesn't have memory issues. Since your matrix size is so asymmetrical, the overhead from a for-loop on the short dimension is negligible.
So maybe
matout = matin;
for j = 1:size(matin, 1) %3 in this case
matout(j,:) = matin(j,:) - vec_to_subtract(j);
end
of course, you could do this in place, but I didn't know if you wanted to preserve the original matrix.

Actually, it seems that http://www.frontiernet.net/~dmschwarz/genops.html (operator overloading with mex files) does the trick too, even though I haven't tested it yet.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.