I'm trying to solve the bottleneck in my application, which is an elementwise sum of two matrices.
I'm using NumPy and Cython. I have a cdef class with a matrix attribute. Since Cython still doesn't support buffer arrays in class attributes, I followed this and tried to use a pointer to the data attribute of the matrix. The thing is, I'm sure I'm doing something wrong, as the results indicate.
What I tried to do is more or less the following:
cdef class the_class:
cdef np.ndarray the_matrix
cdef float_t* the_matrix_p
def __init__(self):
the_matrix_p = <float_t*> self.the_matrix.data
cpdef the_function(self):
other_matrix = self.get_other_matrix()
the_matrix_p += other_matrix.data
I have serious doubt that adding two numpy arrays is a bottleneck that you can solve rewriting things in C. See the follwing code, that uses scipy.weave:
import numpy as np
from scipy.weave import inline
a = np.random.rand(10000000)
b = np.random.rand(10000000)
c = np.empty((10000000,))
def c_sum(a, b, c) :
length = a.shape[0]
code = '''
for(int j = 0; j < length; j++)
{
c[j] = a[j] + b[j];
}
'''
inline(code, ['a', 'b', 'c', 'length'])
Once you run c_sum(a, b, c) once to get the C code compiled, these are the timings I get:
In [12]: %timeit c_sum(a, b, c)
10 loops, best of 3: 33.5 ms per loop
In [16]: %timeit np.add(a, b, out=c)
10 loops, best of 3: 33.6 ms per loop
So it seems you are looking at something of a .3% performance improvement, if the timing differences are not simply random noise, on an operation that takes a handful of ms when working on arrays of ten million elements. If it really is a bottleneck, this is hardly going to solve it.
Try compiling ATLAS and recompiling numpy after that. This won't probably help with addition, but you can have really nice performance boost with more complicated matrix operations (if you use such, of course).
Check out this simple benchmark. If your results fall too far from those given in the post, maybe your numpy is not linked against some optimized BLAS implementation.
Related
I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).
Here is a stupid example of implementation in cython :
%%cython -f -c=-O2 -I./
import numpy as np
cimport numpy as np
cimport cython
cdef class Test:
cdef double[:, ::1] a, b
cdef Py_ssize_t[:, ::1] indices
def __cinit__(self, a, b, indices):
self.a = a
self.b = b
self.indices = indices
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run1(self):
""" Use of external structure of indices. """
cdef Py_ssize_t idx, ix, iy
cdef int n = self.indices.shape[0]
for idx in range(n):
ix = self.indices[idx, 0]
iy = self.indices[idx, 1]
self.b[ix, iy] = ix * iy * self.a[ix, iy]
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run2(self):
""" Direct formulation """
cdef Py_ssize_t idx, ix, iy
cdef int nx = self.a.shape[0]
cdef int ny = self.a.shape[1]
for ix in range(nx):
for iy in range(ny):
self.b[ix, iy] = ix * iy * self.a[ix, iy]
with this on the python side :
import itertools
import numpy as np
N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)
and the results :
%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Why does the Test.run1() method run much slower than the Test.run2() method?
What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?
Because run1 is significantly more complicated...
run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.
Your current performance is probably as good as it gets.
In addition of the good #DavidW answer, note that run2 is SIMD-friendly as opposed to run1. This means a compiler can easily generate SIMD instruction in run2 so to read multiple packed items from memory, multiply multiple items in a row thanks to packed SIMD instructions and write the packed items into memory. If the array is small enough to fit in CPU caches, which is the case here, the SIMD computation can be very fast. Indeed, nearly all modern x86-64 processors support the 256-bit wide AVX/AVX-2 instruction set that can operate on 8 32-bit integers in a row and 4 double-precision floating-point numbers. Additionally, such a code can easily be unrolled and well pipelined by modern processors. Hardware prefetchers are also optimized for this kind of use-case.
Meanwhile, run1 do indirect memory accesses. Compilers can hardly assume they are actually contiguous and generate packed loads/stores (this is very unlikely in most codes and this is up to developers to write this kind optimization). The indirection require multiple load instructions that saturate the load ports and make the overall computation at least twice slower. AVX-2 have a gather instruction that can theoretically help for such a case. That being said, the instruction is currently not well efficiently implemented on current Intel/AMD processors (it basically does scalar loads internally, saturating the load ports). Still, it should certainly make run1 runs as fast as run2 if the later is not vectorized (otherwise run2 should sharply outperform run1 even with gather instructions). Compilers unfortunately have a hard time using such instructions yet.
In fact, regarding the code and the timing, run2 should be even faster if SIMD instruction would be used. I think this is not the case and this is certainly because the -O2 optimization level is currently set in your code and compilers like GCC does not yet automatically vectorize the code (unless with the very last version AFAIK). Please consider using -O3. Please also consider enabling -mavx and -mavx2 if possible (this assume the target processor are not too old) as it should make the code faster.
Description
I simply want to create a matrix of rows x cols that is filled with 0s. Always working with numpy I thought using np.zeros as described in the docs is the easiest:
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def f1():
cdef:
int dim = 40000
int i, j
np.ndarray[DTYPE_t, ndim=2] mat = np.zeros([40000, 40000], dtype=DTYPE)
for i in range(dim):
for j in range(dim):
mat[i, j] = 1
Then I compared this using the arrays in c:
def f2():
cdef:
int dim = 40000
int[40000][40000] mat
int i, j
for i in range(dim):
for j in range(dim):
mat[i][j] = 1
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs. However when I return the array from f2() I noticed it is not zero filled (of course here it can't be, i==j however when not filling it it won't return a 0 array either). How can this be done in cython. I know in regular C it would be like: int arr[n][m] = {};.
Question
How can the c array be filled with 0s? (I would go for numpy instead if there is something obvious wrong in my code)
You do not want to be writing code like this:
int[40000][40000] mat generates a 6 gigabyte array on the stack (assuming 4 byte ints). Typically maximum stack sizes are of the order of a few Mb. I have no idea how this isn't crashing your PC.
However when I return the array from f2() [...]
The array you have allocated is completely local to the function. From a C point of view you cannot return it since it ceases to exist after the function has finished. I think Cython may convert it to a (nested) Python list for you. This requires a slow copy element-by-element and is not what you want.
For what you're doing here you're much better just using Numpy.
Cython doesn't support a good equivalent of the C arr = {} so if you do want initialize sensible, small C arrays you need to use of one:
loops,
memset (which you can cimport from libc.string),
Create a typed memoryview of it and do memview[:,:] = 0
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs.
This kind of difference usually suggests that the C compiler has optimized some code out (by detecting that the result is unused). It is unlikely to be a genuine speed-up.
I have read through the einsum manual and ajcr's basic introduction
I have zero experience with einstein summation in a non-coding context, although I have tried to remedy that with some internet research (would provide links but don't have the reputation for more than two yet). I've also tried experimenting in python with einsum to see if I could get a better handle on things.
And yet I'm still unclear on whether it is both possible and efficient to do as follows:
on two arrays of arrays (a and b) of equal length (3) and height (n) , row by row produce the outer product of ( row i: a on b) plus the outer product of (row i: b on a), and then sum all the outer product matrices to output one, final matrix.
I know that 'i,j->ij' produces the outer product of one vector on another-- it's the next steps that have lost me. ('ijk,jik->ij' is definitely not it)
my other available option is to loop through the array and call the basic functions (a double outer product, and a matrix addition) from functions I've written in cython (using the numpy built in outer and sum function is not an option, it is far too slow). It is likely I'd end up moving the loop itself to cython as well.
so:
how can I express einsum-ically the procedure I described above?
would it offer real gains over doing everything in cython? or are there other alternatives I'm not aware of? (including the possibility that I've been using numpy less efficiently than I could be...)
Thanks.
edit with example:
A=np.zeros((3,3))
arrays_1=np.array([[1,0,0],[1,2,3],[0,1,0],[3,2,1]])
arrays_2=np.array([[1,2,3],[0,1,0],[1,0,0],[3,2,1]])
for i in range(len(arrays_1)):
A=A+(np.outer(arrays_1[i], arrays_2[i])+np.outer(arrays_2[i],arrays_1[i]))
(note, however, that in practice we're dealing with arrays of much greater length (ie still length 3 for each internal member but up to a few thousand such members), and this section of code gets (unavoidably) called many times)
in case it's at all helpful, here's the cython for the summing two outer products:
def outer_product_sum(np.ndarray[DTYPE_t, ndim=1] a_in, np.ndarray[DTYPE_t, ndim=1] b_in):
cdef double *a = <double *>a_in.data
cdef double *b = <double *>b_in.data
return np.array([
[a[0]*b[0]+a[0]*b[0], a[0]*b[1]+a[1]*b[0], a[0] * b[2]+a[2] * b[0]],
[a[1]*b[0]+a[0]*b[1], a[1]*b[1]+a[1]*b[1], a[1] * b[2]+a[2] * b[1]],
[a[2]*b[0]+a[0]*b[2], a[2]*b[1]+a[1]*b[2], a[2] * b[2]+a[2] * b[2]]])
which, right now, I call from within a 'i in range(len(array))' loop as shown above.
Einstein summation can only be used for the multiplicative part of question (i.e. the outer product). Luckily the summation does not have to be performed element-wise, but you can do that on the reduce matrices. Using the arrays from your example:
arrays_1 = np.array([[1,0,0],[1,2,3],[0,1,0],[3,2,1]])
arrays_2 = np.array([[1,2,3],[0,1,0],[1,0,0],[3,2,1]])
A = np.einsum('ki,kj->ij', arrays_1, arrays_2) + np.einsum('ki,kj->ij', arrays_2, arrays_1)
The input arrays are of shape (4,3), summation takes place over the first index (named 'k'). If summation should take place over the second index, change the subscripts string to 'ik,jk->ij'.
Whatever you can do with np.einsum, you can usually do faster using np.dot. In this case, A is the sum of two dot products:
arrays_1 = np.array([[1,0,0],[1,2,3],[0,1,0],[3,2,1]])
arrays_2 = np.array([[1,2,3],[0,1,0],[1,0,0],[3,2,1]])
A1 = (np.einsum('ki,kj->ij', arrays_1, arrays_2) +
np.einsum('ki,kj->ij', arrays_2, arrays_1))
A2 = arrays_1.T.dot(arrays_2) + arrays_2.T.dot(arrays_1)
print(np.allclose(A1, A2))
# True
%timeit (np.einsum('ki,kj->ij', arrays_1, arrays_2) +
np.einsum('ki,kj->ij', arrays_2, arrays_1))
# 100000 loops, best of 3: 7.51 µs per loop
%timeit arrays_1.T.dot(arrays_2) + arrays_2.T.dot(arrays_1)
# 100000 loops, best of 3: 4.51 µs per loop
I am confused how NumPy nested loop for 3D array is so slow in comparison with Cython.
I wrote trivial example.
Python/NumPy version:
import numpy as np
def my_func(a,b,c):
s=0
for z in xrange(401):
for y in xrange(401):
for x in xrange(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s+=1
return s
a = np.zeros((401,401,401), dtype=np.float32)
b = np.zeros((401,401,401), dtype=np.uint32)
c = np.zeros((401,401,401), dtype=np.uint8)
s = my_func(a,b,c)
Cythonized version:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def my_func(np.float32_t[:,:,::1] a, np.uint32_t[:,:,::1] b, np.uint8_t[:,:,::1] c):
cdef np.uint16_t z,y,x
cdef np.uint32_t s = 0
for z in range(401):
for y in range(401):
for x in range(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s = s+1
return s
Cythonized version of my_func() runs approx. 6500x faster. Simpler function only with if-statement and array access can be even 10000x faster. Python version of my_func() takes 500.651 sec. to finish. Is iterating over relatively small 3D array so slow or I made some mistake in code?
Cython version 0.21.1, Python 2.7.5, GCC 4.8.1, Xubuntu 13.10.
Python is an interpreted language. One of the benefits of compiling to machine code is the huge speedup you get, especially with things like nested loops.
I don't know what your expectations are, but all interpreted languages will be terribly slow at the things you are trying to do (JIT compiling may help to some extent though).
The trick of getting good performance out of Numpy (or MATLAB or anything similar) is to avoid looping altogether and instead try to refactor your code into a few operations on large matrices. This way, the looping will take place in the (heavily optimized) machine code libraries instead of your Python code.
As mentioned by Krumelur, python loops are definitely slow. You can, however, use numpy to your advantage. Operations on entire arrays are quite fast, although you need a little ingenuity sometimes.
For instance, in your code, since your loop never reads the value in b after you modify it (I think? My head is a little fuzzy at the moment, so you'll definitely want to go through this), the following should be equivalent:
# Precalculate a matrix of x*y*z
tmp = np.indices(a.shape)
prod = (tmp[:,:,:,0] * tmp[:,:,:,1] * tmp[:,:,:,2]).T
# Use array-wide logical operations to compute c using a and the transpose of b
condition = np.logical_and(a == 0, b.T >= 0)
# Use condition to alter b and c only where condition is true
b[condition] = prod[condition]
c[condition] = 1
s = condition.sum()
So this does calculate x*y*z even in cases where the condition is false. You could probably avoid that if it turns out that is using lots of time, but it's likely not to be a significant factor.
For loop with numpy array in python is slow, you should use vector calculation as possible. If the algorithm need for loop for every elements in the array, here is some speedup hint.
a[z,y,x] is a numpy scalar value, calculation with numpy scalar values is very slow:
x = 3.0
%timeit x > 0
x = np.float64(3.0)
%timeit x > 0
the output on my pc with numpy 1.8.2, windows 7:
10000000 loops, best of 3: 64.3 ns per loop
1000000 loops, best of 3: 657 ns per loop
you can use item() method to get the python value directly:
if a.item(z, y, x) == 0 and b.item(x, y, z) >= 0:
...
it can speedup the for loop about 8x.
I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844
Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.