For my understanding, tensordot
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.tensordot.html
is multiply two tensors and sum given indices. This is precisely einsum does.
Actually what is the difference between this two functions? Is that due to different performance?
They are different approaches to similar problems. einsum is more general. Speed can be similar, though you need to check individual cases.
tensordot works by reshaping and transposing axes, reducing the problem to one that np.dot can solve. Its code, up to the dot call is Python, so you can read it for yourself.
einsum is built 'from-ground-up' to work with the 'Einstein notation' that is used in physics (it was written by scientist for fit his needs and usage). The documentation covers that. It is C code, so is a little harder to study. Basically it parses the indexing string, and builds an nditer object that will iterate over the input arrays, performing some sort of sum-of-products calculation. It can take short cuts in case where you just want indexing, diagonal, etc.
There have been number of questions asking about either of these functions, or suggesting their use in the answers.
In new versions there is also a np.matmul that generalized dot in a different way. It is linked to the new # operator in Python3.5.
Related
I'm trying to optimize the performance of some numpy code. The problem is, I'm just blindly replacing code with equivalent code, and profiling to compare the results.
Does numpy provide any resource describing the complexity/performance of its basic operations (either a theoretical analysis, or some practical rules of the thumb), for example
basic operations (eg: sum, multiplication, exponentiation)
different data types (eg: float vs complex)
vectorization/broadcasting as function of the size of the matrix (eg: vectorizing beyond 2 dimensions seems to make performance worse than equivalent for-loop)
I'm trying to do some matrix computations as
def get_P(X,Z):
n_sample,n_m,n_t,n_f = X.shape
res = np.zeros((n_sample,n_m,n_t,n_t))
for i in range(n_sample):
res[i,:,:,:] = np.dot(X[i,:,:,:],Z[i,:,:])
return res
Because the size of X and Z is large, it takes more than 600ms to compute one np.dot, and I have 10k rows in X.
Is there anyway we can speed it up?
Well, there might be some avoidable overhead posed by your zero initialization (which gets overwritten right away): Just use np.ndarray instead.
Other than that: numpy is fairly well-optimized. Probably you can speed things up if you used dtype=numpy.float32 instead of the default 64-bit floating point numbers for your X, Z and res – but that's about it. Dot products are mostly spending time going linear through RAM and multiplying and summing numbers – things that numpy, compilers and CPUs are radically good at these days.
Note that numpy will only use one CPU core at a time in its default configuration - it might make sense to parallelize; for example, if you've got 16 CPU cores, you'd make 16 separate res partitions and calculate subsets of your range(n_sample) dot products on each core; python does bring the multithreading / async facilities to do so – you'll find plenty of examples, and explaining how would lead too far.
If you can spend the development time, and need massive amounts of data, so that this pays: you can use e.g. GPUs to multiply matrices; these are really good at that, and cuBLASlt (GEMM) is an excellent implementation, but honestly, you'd mostly be abandoning Numpy and would need to work things out yourself – in C/C++.
You can use numpy einsum to do this multiplication in one vectorized step.
It will be much faster than this loop based dot product. For examples, check this link https://rockt.github.io/2018/04/30/einsum
I have a numpy script that is currently running quite slowly.
spends the vast majority of it's time performing the following operation inside a loop:
terms=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex,Ey,Ez_av)
res=[np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms]
res=array(res)
Ex[1:Nx-1]=res[1:Nx-1,0]
Ey[1:Nx-1]=res[1:Nx-1,1]
It's the list comprehension that is really slowing this code down.
In this case, Coeff_3, and Coeff_2 are length 1000 lists whose elements are 3x3 numpy matricies, and Ex,Ey,Ez, Curl_x, etc are all length 1000 numpy arrays.
I realize it might be faster if i did things like setting a single 3x1000 E vector, but i have to perform a significant amount of averaging of different E vectors between step, which would make things very unwieldy.
Curiously however, i perform this operation twice per loop (once for Ex,Ey, once for Ez), and performing the same operation for the Ez's takes almost twice as long:
terms2=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex_av,Ey_av,Ez)
res2=array([np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms2])
Anyone have any idea what's happening? Forgive me if it's anything obvious, i'm very new to python.
As pointed out in previous comments, use array operations. np.hstack(), np.vstack(), np.outer() and np.inner() are useful here. You're code could become something like this (not sure about your dimensions):
Cxyz = np.vstack((Curl_x,Curl_y,Curl_z))
C2xyz = np.dot(C2, Cxyz)
...
Check the shape of your resulting dimensions, to make sure you translated your problem right. Sometimes numexpr can also to speed up such tasks significantly with little extra effort,
I'm trying to find some way to substract a size 3 vector from each column of a 3*(a big number) matrix in Matlab. Of course I could use a loop, but I'm trying to find some more efficient solution, a bit like numpy broadcasting. Oh, and I can't use repmat because I just don't have enough memory to use it (as it creates yet another 3*(a big number) matrix)...
Is this possible?
The other answers are a bit out of date -- Matlab R2016b appears to have added broadcasting as a standard feature. An example from that blog post that matches the question:
>> A = ones(2) + [1 5]'
A =
2 2
6 6
Loops aren't bad in MATLAB anymore thanks to compiler optimizations like just-in-time acceleration (JITA). etc. Most of the time, I've noticed that a solution with loops in current MATLAB versions is much faster than complicated (albeit, cool :D) one-liners.
bsxfun might do the trick but in my experience, it tends to have memory issues as well but less so than repmat.
So the syntax would be:
AA = bsxfun(#minus,A,b) where b is the vector and A is your big matrix
But I urge you to profile the loopy version and then decide! Most probably, due to memory constraints, you might not have a choice :)
I don't know if this will speed up the code, but subtraction of a scalar from a vector doesn't have memory issues. Since your matrix size is so asymmetrical, the overhead from a for-loop on the short dimension is negligible.
So maybe
matout = matin;
for j = 1:size(matin, 1) %3 in this case
matout(j,:) = matin(j,:) - vec_to_subtract(j);
end
of course, you could do this in place, but I didn't know if you wanted to preserve the original matrix.
Actually, it seems that http://www.frontiernet.net/~dmschwarz/genops.html (operator overloading with mex files) does the trick too, even though I haven't tested it yet.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.