Cython loop over array of indexes

Cython loop over array of indexes - python

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).
Here is a stupid example of implementation in cython :
%%cython -f -c=-O2 -I./
import numpy as np
cimport numpy as np
cimport cython
cdef class Test:
cdef double[:, ::1] a, b
cdef Py_ssize_t[:, ::1] indices
def __cinit__(self, a, b, indices):
self.a = a
self.b = b
self.indices = indices
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run1(self):
""" Use of external structure of indices. """
cdef Py_ssize_t idx, ix, iy
cdef int n = self.indices.shape[0]
for idx in range(n):
ix = self.indices[idx, 0]
iy = self.indices[idx, 1]
self.b[ix, iy] = ix * iy * self.a[ix, iy]
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run2(self):
""" Direct formulation """
cdef Py_ssize_t idx, ix, iy
cdef int nx = self.a.shape[0]
cdef int ny = self.a.shape[1]
for ix in range(nx):
for iy in range(ny):
self.b[ix, iy] = ix * iy * self.a[ix, iy]
with this on the python side :
import itertools
import numpy as np
N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)
and the results :
%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Why does the Test.run1() method run much slower than the Test.run2() method?
What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

Because run1 is significantly more complicated...
run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.
Your current performance is probably as good as it gets.

In addition of the good #DavidW answer, note that run2 is SIMD-friendly as opposed to run1. This means a compiler can easily generate SIMD instruction in run2 so to read multiple packed items from memory, multiply multiple items in a row thanks to packed SIMD instructions and write the packed items into memory. If the array is small enough to fit in CPU caches, which is the case here, the SIMD computation can be very fast. Indeed, nearly all modern x86-64 processors support the 256-bit wide AVX/AVX-2 instruction set that can operate on 8 32-bit integers in a row and 4 double-precision floating-point numbers. Additionally, such a code can easily be unrolled and well pipelined by modern processors. Hardware prefetchers are also optimized for this kind of use-case.
Meanwhile, run1 do indirect memory accesses. Compilers can hardly assume they are actually contiguous and generate packed loads/stores (this is very unlikely in most codes and this is up to developers to write this kind optimization). The indirection require multiple load instructions that saturate the load ports and make the overall computation at least twice slower. AVX-2 have a gather instruction that can theoretically help for such a case. That being said, the instruction is currently not well efficiently implemented on current Intel/AMD processors (it basically does scalar loads internally, saturating the load ports). Still, it should certainly make run1 runs as fast as run2 if the later is not vectorized (otherwise run2 should sharply outperform run1 even with gather instructions). Compilers unfortunately have a hard time using such instructions yet.
In fact, regarding the code and the timing, run2 should be even faster if SIMD instruction would be used. I think this is not the case and this is certainly because the -O2 optimization level is currently set in your code and compilers like GCC does not yet automatically vectorize the code (unless with the very last version AFAIK). Please consider using -O3. Please also consider enabling -mavx and -mavx2 if possible (this assume the target processor are not too old) as it should make the code faster.

Related

Is it possible to improve python performance for this code?

I have a simple code that:
Read a trajectory file that can be seen as a list of 2D arrays (list of positions in space) stored in Y
I then want to compute for each pair (scipy.pdist style) the RMSD
My code works fine:
trajectory = read("test.lammpstrj", index="::")
m = len(trajectory)
#.get_positions() return a 2d numpy array
Y = np.array([snapshot.get_positions() for snapshot in trajectory])
b = [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
This code execute in 0.86 seconds using python3.10, using Julia1.8 the same kind of code execute in 0.46 seconds
I plan to have trajectory much larger (~ 200,000 elements), would it be possible to get a speed-up using python or should I stick to Julia?

You've mentioned that snapshot.get_positions() returns some 2D array, suppose of shape (p, q). So I expect that Y is a 3D array with some shape (m, p, q), where m is the number of snapshots in the trajectory. You also expect m to scale rather high.
Let's see a basic way to speed up the distance calculation, on the setting m=1000:
import numpy as np
# dummy inputs
m = 1000
p, q = 4, 5
Y = np.random.randn(m, p, q)
# your current method
def foo():
return [np.sqrt(((((Y[i]- Y[j])**2))*3).mean()) for i in range(m) for j in range(i + 1, m)]
# vectorized approach -> compute the upper triangle of the pairwise distance matrix
def bar():
u, v = np.triu_indices(Y.shape[0], 1)
return np.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# Check for correctness
out_1 = foo()
out_2 = bar()
print(np.allclose(out_1, out_2))
# True
If we test the time required:
%timeit -n 10 -r 3 foo()
# 3.16 s ± 50.3 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
The first method is really slow, it takes over 3 seconds for this calculation. Let's check the second method:
%timeit -n 10 -r 3 bar()
# 97.5 ms ± 405 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So we have a ~30x speedup here, which would make your large calculation in python much more feasible than using the original code. Feel free to test out with other sizes of Y to see how it scales compared to the original.
JIT
In addition, you can also try out JIT, mainly jax or numba. It is fairly simple to port the function bar with jax.numpy, for example:
import jax
import jax.numpy as jnp
#jax.jit
def jit_bar(Y):
u, v = jnp.triu_indices(Y.shape[0], 1)
return jnp.sqrt((3 * (Y[u] - Y[v]) ** 2).mean(axis=(-1, -2)))
# check for correctness
print(np.allclose(bar(), jit_bar(Y)))
# True
If we test the time of the jitted jnp op:
%timeit -n 10 -r 3 jit_bar(Y)
# 10.6 ms ± 678 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
So compared to the original, we could reach even up to ~300x speed.
Note that not every operation can be converted to jax/jit so easily (this particular problem is conveniently suitable), so the general advice is to simply avoid python loops and use numpy's broadcasting/vectorization capabilities, like in bar().

Stick to Julia.
If you already made it in a language which runs faster, why are you trying to use python in the first place?

Your question is about speeding up Python, relative to Julia, so I'd like to offer some Julia code for comparison.
Since your data is most naturally expressed as a list of 4x5 arrays, I suggest expressing it as a vector of SMatrixes:
sumdiff2(A, B) = sum((A[i] - B[i])^2 for i in eachindex(A, B))
function dists(Y)
M = length(Y)
V = Vector{float(eltype(eltype(Y)))}(undef, sum(1:M-1))
Threads.#threads for i in eachindex(Y)
ii = sum(M-i+1:M-1) # don't worry about this sum
for j in i+1:lastindex(Y)
ind = ii + (j-i)
V[ind] = sqrt(3 * sumdiff2(Y[i], Y[j])/length(Y[i]))
end
end
return V
end
using Random: randn
using StaticArrays: SMatrix
Ys = [randn(SMatrix{4,5,Float64}) for _ in 1:1000];
Benchmarks:
# single-threaded
julia> using BenchmarkTools
julia> #btime dists($Ys);
6.561 ms (2 allocations: 3.81 MiB)
# multi-threaded with 6 cores
julia> #btime dists($Ys);
1.606 ms (75 allocations: 3.82 MiB)
I was not able to install jax on my computer, but when comparing with #Mercury's numpy code I got
foo: 5.5seconds
bar: 179ms
i.e. approximately 3400x speedup over foo.
It is possible to write this as a one-liner at a ~2-3x performance cost.

While Python tends to be slower than Julia for many tasks, it is possible to write numerical codes as fast as Julia in Python using Numba and plain loops. Indeed, Numba is based on LLVM-Lite which is basically a JIT-compiler based on the LLVM toolchain. The standard implementation of Julia also use a JIT and the LLVM toolchain. This means the two should behave pretty closely besides the overhead introduced by the languages that are negligible once the computation is performed in parallel (because the resulting computation will be memory-bound on nearly all modern platforms).
This computation can be parallelized in both Julia and Python (still using Numba). While writing a sequential computation is quite straightforward, writing a parallel computation is if bit more complex. Indeed, computing the upper triangular values can result in an imbalanced workload and so to a sub-optimal execution time. An efficient strategy is to compute, for each iteration, a pair of lines: one comes from the top of the upper triangular part and one comes from the bottom. The top line contains m-i items while the bottom one contains i+1 items. In the end, there is m+1 items to compute per iteration so the number of item is independent of the iteration number. This results in a much better load-balancing. The line of the middle needs to be computed separately regarding the size of the input array.
Here is the final implementation:
import numba as nb
import numpy as np
#nb.njit(inline='always', fastmath=True)
def compute_line(tmp, res, i, m):
offset = (i * (2 * m - i - 1)) // 2
factor = 3.0 / n
for j in range(i + 1, m):
s = 0.0
for k in range(n):
s += (tmp[i, k] - tmp[j, k]) ** 2
res[offset] = np.sqrt(s * factor)
offset += 1
return res
#nb.njit('()', parallel=True, fastmath=True)
def fastest():
m, n = Y.shape[0], Y.shape[1] * Y.shape[2]
res = np.empty(m*(m-1)//2)
tmp = Y.reshape(m, n)
for i in nb.prange(m//2):
compute_line(tmp, res, i, m)
compute_line(tmp, res, m-i-1, m)
if m % 2 == 1:
compute_line(tmp, res, (m+1)//2, m)
return res
# [...] same as others
%timeit -n 100 fastest()
Results
Here are performance results on my machine (with a i5-9600KF having 6 cores):
foo (seq, Python, Mercury): 4910.7 ms
bar (seq, Python, Mercury): 134.2 ms
jit_bar (seq, Python, Mercury): ???
dists (seq, Julia, DNF) 6.9 ms
dists (par, Julia, DNF) 2.2 ms
fastest (par, Python, me): 1.5 ms <-----
(Jax does not work on my machine so I cannot test it yet)
This implementation is the fastest one and succeed to beat the best Julia code so far.
Optimal implementation
Note that for large arrays like (200_000,4,5), all implementations provided so far are inefficient since they are not cache friendly. Indeed, the input array will take 32 MiB and will not for on the cache of most modern processors (and even if it could, one need to consider the space needed for the output and the fact that caches are not perfect). This can be fixed using tiling, at the expense of an even more complex code. I think such an implementation should be optimal if you use Z-order curves.

numpy array: fast assign short array to large array with index

I want to assign values to large array from short arrays with indexing. Simple codes are as follows:
import numpy as np
def assign_x():
a = np.zeros((int(3e6), 20))
index_a = np.random.randint(int(3e6), size=(int(3e6), 20))
b = np.random.randn(1000, 20)
for i in range(20):
index_b = np.random.randint(1000, size=int(3e6))
a[index_a[:, i], i] = b[index_b, i]
return a
%timeit x = assign_x()
# 2.79 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have tried other ways which may be relevant, for example np.take and numba's jit, but seems the above is the fastest way. It can also be possibly speed-up using multiprocessing. I have profile the codes, the most time is at the line below as it runs many times (20 here)
a[index_a[:, i], i] = b[index_b, i]
Any chance I can make this faster before using multiprocessing?

Why this is slow
This is slow because the memory access pattern is very inefficient. Indeed, random accesses are slow because the processor cannot predict them. As a result, it causes expensive cache misses (if the array does not fit in the L1/L2 cache) that cannot be avoided by prefetching data ahead of time. The thing is the arrays are to big to fit in caches: index_a and a takes each 457 MiB and b takes 156 KiB. As a results, access to b are typically done in the L2 cache with a higher latency and the accesses to the two other array are done in RAM. This is slow because the current DDR RAMs have huge latency of 60-100 ns on a typical PC. Even worse: this latency is likely not gonna be much smaller in a near future: the RAM latency has not changed much since the last two decades. This is called the Memory wall. Note also that modern processors fetch a full cache line of usually 64 bytes from the RAM when a value at a random location is requested (resulting in only 56/64=87.5% of the bandwidth to be wasted). Finally, generating random numbers is a quite expensive process, especially large integers, and np.random.randint can generate either 32-bit or 64-bit integers regarding the target platform.
How to improve this
The first improvement is to prefer indirection on the most contiguous dimension which is generally the last one since a[:,i] is slower than a[i,:]. You can transpose the arrays and swap the indexed values. However, the Numpy transposition function only return a view and does not actually transpose the array in memory. Thus an explicit copy in currently required. The best here is simply to directly generate the array so that accesses are efficient (rather than using expensive transpositions). Note you can use simple precision so array can better fit in caches at the expense of a lower precision.
Here is an example that returns a transposed array:
import numpy as np
def assign_x():
a = np.zeros((20, int(3e6)))
index_a = np.random.randint(int(3e6), size=(20, int(3e6)))
b = np.random.randn(20, 1000)
for i in range(20):
index_b = np.random.randint(1000, size=int(3e6))
a[i, index_a[i, :]] = b[i, index_b]
return a
%timeit x = assign_x()
The code can be improved further using Numba so to run the code in parallel (one core should not be enough to saturate the memory because of the RAM latency but many core can better use it because multiple fetches can be done concurrently). Moreover, it can help avoid the creation of big temporary arrays.
Here is an optimize Numba code:
import numpy as np
import numba as nb
import random
#nb.njit('float64[:,:]()', parallel=True)
def assign_x():
a = np.zeros((20, int(3e6)))
b = np.random.randn(20, 1000)
for i in nb.prange(20):
for j in range(3_000_000):
index_a = random.randint(0, 3_000_000)
index_b = random.randint(0, 1000)
a[i, index_a] = b[i, index_b]
return a
%timeit x = assign_x()
Here are results on a 10-core Skylake Xeon processor:
Initial code: 2798 ms
Better memory access pattern: 1741 ms
With Numba: 318 ms
Note that parallelizing the inner-most loop would theoretically be faster because one line of a is more likely to fit in the last-level cache. However, doing this will cause a race condition that can only be fixed efficiently with atomic stores not yet available in Numba (on CPU).
Note that the final code does not scale well because it is memory-bound. This is due to 87.5% of the memory throughput being wasted as explained before. Additionally, on many processors (like all Intel and AMD-Zen processors) the write allocate cache policy force data being read from memory for each store in this case. This makes the computation much more inefficient raising the wasted throughput to 93.7%... AFAIK, there is no way to prevent this in Python. In C/C++, the write allocate issue can be fixed using low-level instructions. The rule of thumb is avoid memory random access patterns on big arrays like the plague.

Cython: size attribute of memoryviews

I'm using a lot of 3D memoryviews in Cython, e.g.
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
I often want to loop over all elements of a. I can do this using a triple loop like
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for k in range(a.shape[2]):
a[i, j, k] = ...
If I do not care about the indices i, j and k, it is more efficient to do a flat loop, like
cython.declare(a_ptr='double*')
a_ptr = cython.address(a[0, 0, 0])
for i in range(size):
a_ptr[i] = ...
Here I need to know the number of elements (size) in the array. This is given by the product of the elements in the shape attribute, i.e. size = a.shape[0]*a.shape[1]*a.shape[2], or more generally size = np.prod(np.asarray(a).shape). I find both of these ugly to write, and the (albeit small) computational overhead bothers me. The nice way to do it is to use the builtin size attribute of memoryviews, size = a.size. However, for reasons I cannot fathom, this leads to unoptimized C code, as evident from the annotations html file generated by Cython. Specifically, the C code generated by size = a.shape[0]*a.shape[1]*a.shape[2] is simply
__pyx_v_size = (((__pyx_v_a.shape[0]) * (__pyx_v_a.shape[1])) * (__pyx_v_a.shape[2]));
where the C code generated from size = a.size is
__pyx_t_10 = __pyx_memoryview_fromslice(__pyx_v_a, 3, (PyObject *(*)(char *)) __pyx_memview_get_double, (int (*)(char *, PyObject *)) __pyx_memview_set_double, 0);; if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_10);
__pyx_t_14 = __Pyx_PyObject_GetAttrStr(__pyx_t_10, __pyx_n_s_size); if (unlikely(!__pyx_t_14)) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_14);
__Pyx_DECREF(__pyx_t_10); __pyx_t_10 = 0;
__pyx_t_7 = __Pyx_PyIndex_AsSsize_t(__pyx_t_14); if (unlikely((__pyx_t_7 == (Py_ssize_t)-1) && PyErr_Occurred())) __PYX_ERR(0, 2238, __pyx_L1_error)
__Pyx_DECREF(__pyx_t_14); __pyx_t_14 = 0;
__pyx_v_size = __pyx_t_7;
To generate the above code, I have enabled all possible optimizations through compiler directives, meaning that the unwieldy C code generated by a.size cannot be optimized away. It looks to me as though the size "attribute" is not really a pre-computed attribute, but actually carries out a computation upon lookup. Furthermore, this computation is quite a bit more involved than simply taking the product over the shape attribute. I cannot find any hint of an explanation in the docs.
What is the explanation of this behavior, and do I have a better choice than writing out a.shape[0]*a.shape[1]*a.shape[2], if I really care about this micro optimization?

Already by looking at the produced C-code, you can already see that size is a property and not a simple C-member. Here is the original Cython-code for memory-views:
#cname('__pyx_memoryview')
cdef class memoryview(object):
...
cdef object _size
...
#property
def size(self):
if self._size is None:
result = 1
for length in self.view.shape[:self.view.ndim]:
result *= length
self._size = result
return self._size
It is easy to see, that the product is calculated only once and then cached. Clearly it doesn't play a big role for 3 dimensional arrays, but for a higher number of dimensions caching could become pretty important (as we will see, there are at most 8 dimensions, so it is not that clearly cut, whether this caching is really worth it).
One can understand the decision to lazily calculate the size - after all, size is not always needed/used and one doesn't want to pay for it. Clearly, there is a price to pay for this laziness if you use the size a lot - that is the trade off cython makes.
I would not dwell too long on the overhead of calling a.size - it is nothing compared to the overhead of calling a cython-function from python.
For example, the measurements of #danny measure only this python-call overhead and not the actual performance of the different approaches. To show this, I throw a third function into the mix:
%%cython
...
def both():
a.size+a.shape[0]*a.shape[1]*a.shape[2]
which does double amount of the work, but
>>> %timeit mv_size
22.5 ns ± 0.0864 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> %timeit mv_product
20.7 ns ± 0.087 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>>%timeit both
21 ns ± 0.39 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
is just as fast. On the other hand:
%%cython
...
def nothing():
pass
isn't faster:
%timeit nothing
24.3 ns ± 0.854 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In a nutshell: I would use a.size because of the readability, assuming that optimizing that would not speed up my application, unless profiling proves something different.
The whole story: the variable a is of type __Pyx_memviewslice and not of type __pyx_memoryview as one could think. The struct __Pyx_memviewslice has the following definition:
struct __pyx_memoryview_obj;
typedef struct {
struct __pyx_memoryview_obj *memview;
char *data;
Py_ssize_t shape[8];
Py_ssize_t strides[8];
Py_ssize_t suboffsets[8];
} __Pyx_memviewslice;
that means, shape can be accessed very efficiently by the Cython-code, as it is a simple C-array (btw. I ask my self, what happens if there are more than 8 dimensions? - the answer is: you cannot have more than 8 dimensions).
The member memview is where the memory is hold and __pyx_memoryview_obj is the C-Extension which is produce from the cython-code we saw above and looks as follows:
/* "View.MemoryView":328
*
* #cname('__pyx_memoryview')
* cdef class memoryview(object): # <<<<<<<<<<<<<<
*
* cdef object obj
*/
struct __pyx_memoryview_obj {
PyObject_HEAD
struct __pyx_vtabstruct_memoryview *__pyx_vtab;
PyObject *obj;
PyObject *_size;
PyObject *_array_interface;
PyThread_type_lock lock;
__pyx_atomic_int acquisition_count[2];
__pyx_atomic_int *acquisition_count_aligned_p;
Py_buffer view;
int flags;
int dtype_is_object;
__Pyx_TypeInfo *typeinfo;
};
So, Pyx_memviewslice is not really a Python object -it is kind of convenience wrapper, which caches important data, like shape and stride so this information can be accessed fast and cheap.
What happens when we call a.size? First, __pyx_memoryview_fromslice is called which does some additional reference counting and some further stuff and returns the member memview from the __Pyx_memviewslice-object.
Then the property size is called on this returned memoryview, which accesses the cached value in _size as have been shown in the Cython code above.
It looks as if the python-programmers introduced a shortcut for such important information as shape, strides and suboffsets, but not for the size which is probably not so important - this is the reason for cleaner C-code in the case of shape.

The generated C code for a.size looks fine.
It has to interface with Python because memory views are python extension types. size on the memory view is a python attribute and gets converted to ssize_t. That is all the C code does. The conversion can be avoided by typing the size variable as Py_ssize_t rather than ssize_t.
So there is not anything in the C code that looks unoptimised - it's just looking up an attribute on a python object, size on a memory view in this case.
Here are results of micro-benchmark for the two methods.
Setup:
cimport numpy as np
import numpy as np
cimport cython
cython.declare(a='double[:, :, ::1]')
a = np.empty((10, 20, 30), dtype='double')
def mv_size():
return a.size
def mv_product():
return a.shape[0]*a.shape[1]*a.shape[2]
Results:
%timeit mv_size
10000000 loops, best of 3: 23.4 ns per loop
%timeit mv_product
10000000 loops, best of 3: 23.4 ns per loop
Performance is pretty much identical.
The product method is purely C code which matters if it needs to be executed in parallel, but otherwise there is no performance benefit over memory view size.

Using pointers to numpy array data attribute

I'm trying to solve the bottleneck in my application, which is an elementwise sum of two matrices.
I'm using NumPy and Cython. I have a cdef class with a matrix attribute. Since Cython still doesn't support buffer arrays in class attributes, I followed this and tried to use a pointer to the data attribute of the matrix. The thing is, I'm sure I'm doing something wrong, as the results indicate.
What I tried to do is more or less the following:
cdef class the_class:
cdef np.ndarray the_matrix
cdef float_t* the_matrix_p
def __init__(self):
the_matrix_p = <float_t*> self.the_matrix.data
cpdef the_function(self):
other_matrix = self.get_other_matrix()
the_matrix_p += other_matrix.data

I have serious doubt that adding two numpy arrays is a bottleneck that you can solve rewriting things in C. See the follwing code, that uses scipy.weave:
import numpy as np
from scipy.weave import inline
a = np.random.rand(10000000)
b = np.random.rand(10000000)
c = np.empty((10000000,))
def c_sum(a, b, c) :
length = a.shape[0]
code = '''
for(int j = 0; j < length; j++)
{
c[j] = a[j] + b[j];
}
'''
inline(code, ['a', 'b', 'c', 'length'])
Once you run c_sum(a, b, c) once to get the C code compiled, these are the timings I get:
In [12]: %timeit c_sum(a, b, c)
10 loops, best of 3: 33.5 ms per loop
In [16]: %timeit np.add(a, b, out=c)
10 loops, best of 3: 33.6 ms per loop
So it seems you are looking at something of a .3% performance improvement, if the timing differences are not simply random noise, on an operation that takes a handful of ms when working on arrays of ten million elements. If it really is a bottleneck, this is hardly going to solve it.

Try compiling ATLAS and recompiling numpy after that. This won't probably help with addition, but you can have really nice performance boost with more complicated matrix operations (if you use such, of course).
Check out this simple benchmark. If your results fall too far from those given in the post, maybe your numpy is not linked against some optimized BLAS implementation.

Numpy vs Cython speed

I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.
To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):
Version 1 (without cython)
import numpy as np
def _process(array):
rows = array.shape[0]
cols = array.shape[1]
out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
data = np.load('data.npy')
out = _process(data)
np.save('vianumpy.npy', out)
Version 2 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('viacynpy.npy', out)
Version 3 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
for col in range(0, cols):
for row2 in range(0, rows):
out[row, col] += array[row2, col] - array[row, col]
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('vialoop.npy', out)
With a 10000x10 matrix saved in data.npy, the times are:
$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop
$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop
$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop
Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?
Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.

With slight modification, version 3 becomes twice as fast:
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col] - array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process is your version 3.

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this
First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays
Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:
http://docs.cython.org/src/quickstart/cythonize.html
Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.
Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:
https://github.com/synapticarbors/pylangevin-integrator
Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.

The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.

I'd guess the main overhead you are saving is the temporary arrays created. You create a great big array array - array[row, :], then reduce it into a smaller array using sum. But building that big temporary array won't be free, especially if you need to allocate memory.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.