Why is cffi so much quicker than numpy?

Why is cffi so much quicker than numpy? - python

I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844

Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.

Related

more efficient array calculation on this python code case

I have written a function which takes an N by N array and compute an output array based on it.
heres how my code looks like this:
def calculate_output(input,N):
output = np.zeros((N, N))
for y in range(N):
for x in range(N):
val1 = 0 if y-1<0 else output[y-1][x]+input[y][x]
val2 = 0 if x-1<0 else output[y][x-1]+input[y][x]
output[y][x] = max(val1,val2)
return output
N = 10000
input = np.reshape(np.random.binomial(1, [0.25] * N * N), (N, N))
output =calculate_output(input,N)
however this compution is not fast enough and takes about 300 seconds on my machine.(compared to 3 seconds when implemented on C++)
is there any way to improve this without writing a C extension?
I have tries using pypy but in this case the code is even slower using pypy

CPython is very slow because it is an interpreter and it clearly cannot compete with C and C++ in such a case. The usual approach to reduce the cost of the interpreter is to avoid loops as much as possible and use few Numpy vectorized calls instead. However in this case, it is barely possible to write an efficient implementation using Numpy vectorized calls.
On the other hand PyPy is often much better for numerical codes because of the JIT compilation. But its implementation of Numpy is not great at all mainly because they used an implementation of Numpy rewritten in Python which is not as good as the native Numpy implementation and the native implementation would not be efficient because of the way Python modules are currently implemented. To put it shortly, AFAIK, the PyPy JIT cannot optimize Numpy access with the native implementation. As the result, the JIT can be slower than the CPython interpreter in your case.
However, you can speed up the code a lot using the Numba JIT compiler which has been written for this exact use-case. Moreover, few optimizations can be implemented to speed up the code even more (whatever the programming language used):
conditionals are generally slow, you can move them in loops performing only the borders
writing zeros initially in the output matrix is not required and is actually slower
Using 2D direct indexing is cleaner and likely a bit faster
integers can be used instead of floating-point numbers since the output contains only integers and computing integers is faster than computing the same operation with floating-point numbers.
import numba as nb
#nb.njit(['int32[:,::1](int32[:,::1],int32)', 'int64[:,::1](int64[:,::1],int64)'])
def calculate_output(input,N):
output = np.empty((N, N), input.dtype)
for x in range(0,N):
val2 = 0 if x-1<0 else output[0,x-1]+input[0,x]
output[0,x] = max(0,val2)
for y in range(1,N):
val1 = 0 if y-1<0 else output[y-1,0]+input[y,0]
output[y,0] = max(val1,0)
for y in range(1,N):
for x in range(1,N):
val1 = output[y-1,x]+input[y,x]
val2 = output[y,x-1]+input[y,x]
output[y,x] = max(val1,val2)
return output
The resulting calculate_output call is 730 times faster on my machine.

Faster alternative to numpy for manual element-wise operations on large arrays?

I have some code that was originally written in C (by someone else) using C-style malloc arrays. I later converted a lot of it to C++ style, using vector<vector<vector<complex>>> arrays for consistency with the rest of my project. I never timed it, but both methods seemed to be of similar speed.
I recently started a new project in python, and I wanted to use some of this old code. Not wanting to move data back and for between projects, I decided to port this old code into python so that it's all in one project. I naively typed up all of the code in python syntax, replacing any arrays in the old code with numpy arrays (initialising them like this array = np.zeros(list((1024, 1024)), dtype=complex)). The code works fine, but it is excruciatingly slow. If I had to guess, I would say it's on the order of 1000 times slower.
Now having looked into it, I see that a lot of people say numpy is very slow for element-wise operations. While I have used some of the numpy functions for common mathematical operations, such as FFTs and matrix multiplication, most of my code involves nested for loops. A lot of it is pretty complicated and doesn't seem to me to be amenable to reducing to simple array operations that are faster in numpy.
So, I'm wondering if there is an alternative to numpy that is faster for these kind of calculations. The ideal scenario would be that there is a module that I can import that has a lot of the same functionality, so I don't have to rewrite much of my code (i.e., something that can do FFTs and initialises arrays in the same way, etc.), but failing that, I would be happy with something that I could at least use for the more computationally demanding parts of the code and cast back and forth between the numpy arrays as needed.
cpython arrays sounded promising, but a lot of benchmarks I've seen don't show enough of a difference in speed for my purposes. To give an idea of the kind of thing I'm talking about, this is one of the methods that is slowing down my code. This is called millions of times, and the vz_at() method contains a lookup table and does some interpolation to give the final return value:
def tra(self, tr, x, y, z_number, i, scalex, idx, rmax2, rminsq):
M = 1024
ixo = int(x[i] / scalex)
iyo = int(y[i] / scalex)
nx1 = ixo - idx
nx2 = ixo + idx
ny1 = iyo - idx
ny2 = iyo + idx
for ix in range(nx1, nx2 + 1):
rx2 = x[i] - float(ix) * scalex
rx2 = rx2 * rx2
ixw = ix
while ixw < 0:
ixw = ixw + M
ixw = ixw % M
for iy in range(ny1, ny2 + 1):
rsq = y[i] - float(iy) * scalex
rsq = rx2 + rsq * rsq
if rsq <= rmax2:
iyw = iy
while iyw < 0:
iyw = iyw + M
iyw = iyw % M
if rsq < rminsq:
rsq = rminsq
vz = P.vz_at(z_number[i], rsq)
tr[ixw, iyw] += vz
All up, there are a couple of thousands of lines of code; this is just a small snippet to give an example. To be clear, a lot of my arrays are 1024x1024x1024 or 1024x1024 and are complex-valued. Others are one-dimensional arrays on the order of a million elements. What's the best way I can speed these element-wise operations up?

For information, some of your code can be made more concise and thus a bit more readable. For instance:
array = np.zeros(list((1024, 1024)), dtype=complex)).
can be written
array = np.zeros((1024, 1024), dtype=complex)
As you are trying out Python, this is at least a nice benefit :-)
Now, for your problem there are several solutions in the current Python scientific landscape:
Numba is a just-in-time compiler for Python that is dedicated to array processing, achieving good performance when NumPy hits its limits.
Pros: Little to no modification of your code as you just write plain Python, shows good performance in many situations. Numba should recognize some NumPy operations to avoid a Numba->Python->NumPy slowdown.
Cons: Can be tedious to install and hence to distribute Numba-based code.
Cython is a mix of Python and C to generate compiled functions. You can start from a pure Python file and accelerate the code via type annotations and the use of some "C"-isms.
Pros: stable, widely used, relatively easy to distribute Cython-based code.
Cons: need to rewrite the performance critical code, even if only in part.
As an additional hint, Nicolas Rougier (a French scientist) wrote an online book on many situations where you can make use of NumPy to speed up Python code: http://www.labri.fr/perso/nrougier/from-python-to-numpy/

Improve cython array indexing speed

I'm have a pretty simple function which I need to speed up. Essentially I have a big array of 16 bit numbers with some holes in it. (About 10%) I need to traverse the array, find areas where there are 2 0's in a row, then fill them in with the average of the previous and next elements. This takes only a few milliseconds in C, but Python is doing way worse.
I've switched from regular python arrays to numpy arrays, and then compiled my code using cython, but I'm still really far from my target. I was hoping someone with more experience might look at what I'm doing and give me some feedback.
My regular python code looks like this:
self.rawData = numpy.fromfile(ql, numpy.uint16, 50000)
[snip]
def fixZeroes(self):
for x in range(2,len(self.rawData)):
if self.rawData[x] == 0 and self.rawData[x-1] == 0:
self.rawData[x] = (self.rawData[x-2] + self.rawData[x+2]) / 2
self.rawData[x-1] = (self.rawData[x-3] + self.rawData[x+1]) /2
My Cython code looks very similar:
import numpy as np
cimport numpy as np
DTYPE = np.uint16
ctypedef np.uint16_t DTYPE_t
#cython.boundscheck(False)
def fix_zeroes(np.ndarray[DTYPE_t, ndim=1] raw):
assert raw.dtype == DTYPE
cdef int len = 50000
for x in range(2,len):
if raw[x] == 0 and raw[x-1] == 0:
raw[x] = (raw[x-2] + raw[x+2]) / 2
raw[x-1] = (raw[x-3] + raw[x+1]) /2
return raw
When I run this code, the performance is still way slower than I'd like:
Starting cython zero fix
Finished: 0:00:36.983681
starting python zero fix
Finished: 0:00:41.434476
I really think I must be doing something wrong. Most every article I've seen talks about the huge performance gains numpy and cython add, but I'm barely breaking 10%.

You should declare the x variable that you are using to index the raw array:
cdef int x
you can also use other directives that usually provide a performance boost:
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.

If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.

Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Parralelized code runs much slower in Python than in Matlab

I have a piece of code that does the following:
for each file (already read in the RAM):
call a function and obtain a result
add the results up and disply
Each file can be analyzed in parallel. The function that analyzes each file is as following:
# Complexity = 1000*19*19 units of work
def fun(args):
(a, b, p) = args
for itr in range(1000):
for i in range(19):
for j in range(19):
# The following random number generated depends on
# latest values in (i-1, j), (i+1, j), (i, j-1) & (i, j+1)
# cells of latest a and b arrays
u = np.random.rand();
if (u < p):
a[i, j] += -1
else:
b[i, j] += 1
return a+b
I am using multiprocessing package to achieve parallelism:
import numpy as np
import time
from multiprocessing import Pool, cpu_count
if __name__ == '__main__':
t = time.time()
pool = Pool(processes=cpu_count())
args = [None]*100
for i in range(100):
a = np.random.randint(2, size=(19, 19))
b = np.random.randint(2, size=(19, 19))
p = np.random.rand()
args[i] = (a, b, p)
result = pool.map(fun, args)
for i in range(2, 100):
result[0] += result[i]
print result[0]
print time.time() - t
I have written equivalent MATLAB code that uses parfor and calls fun in each iteration of parfor:
tic
args = cell(100, 1);
r = cell(100, 1);
parfor i = 1:100
a = randi(2, 19, 19);
b = randi(2, 19, 19);
p = rand();
args{i}.a = a;
args{i}.b = b;
args{i}.p = p;
r{i} = fun(args{i});
end
for i = 2:100
r{1} = r{1} + r{i};
end
disp(r{1});
toc
The implementation of fun is as follows:
function [ ret ] = fun( args )
a = args.a;
b = args.b;
p = args.p;
for itr = 1:1000
for i = 1:19
for j = 1:19
u = rand();
if (u < p)
a(i, j) = a(i, j) + -1;
else
b(i, j) = b(i, j) + 1;
end
end
end
end
ret = a + b;
end
I find that MATLAB is blazingly fast, it takes around 1.5 seconds on a dual core processor compared to around 33-34 seconds taken by Python program. Why is this so?
EDIT: A lot of answers suggested that I should vectorize the random number generation. Actually the way it works is, random number generated depends on the latest a and b 2D arrays. I just put a simple rand() call to keep program simple and readable. In actuality of my program, random number is always generated by looking at certain horizontally and vertically neighbouring cells of (i, j) cell. So it is not possible to vectorize that.

Have you benchmarked both implementations of fun in a non-parallel context? One might just be a lot faster. In particular, those nested loops in the Python fun look like they might turn in to a faster vectorized solution in Matlab, or may be optimized by Matlab's JIT.
Stick both implementations in profilers to see where they're spending their time. Convert both implementations to non-parallel and profile those first, to make sure they're equivalent in performance before introducing the complexities of the parallelization stuff.
And one last check - you're setting up Matlab's Parallel Computation Toolbox with a local pool of workers, right, and not hooking in to a remote machine or picking up some other resources? How many workers on the Matlab side?

I did some tests on your Python code, though without the multiprocessing part and achieved a roughly 25x speedup by making the following changes:
Used Python lists instead of a NumPy array, because the latter is really slow when you need to do a lot of indexing. My timings are including the time it takes to do ndarray.tolist() so this might actually be a viable option, as long as the array's are not to big I think.
Ran it in PyPy instead of the regular Python interpreter, because PyPy has a JIT-compiler, to make the comparison with MATLAB a bit more fair. Regular CPython doesn't have such a feature.
Made the "random" call local to the function, i.e. do rand = np.random.rand and later u = rand(), because in Python lookups in the local namespace are faster, which can be significant in tight loops such as this.
Used Python's random.random instead of np.random.rand (also bound to a name local to the function).
Exchanged range for the generator-based xrange.
(This list is sorted from most significant speedup to only very small gain)
Of course there's also the parallel computing aspect. With multiprocessing all the Python objects that are passed around between processes are "pickled", meaning they have to be serialized and de-serialized on top of being copied between processes. MATLAB also copies data between processes (or threads?), but probably does so in a less wasteful way. Next to that, setting up a multiprocessing.Pool also takes a short while, which may not be fair to your MATLAB benchmark, but I'm not sure about this.
Based on your timings and mine, I would say that Python and MATLAB could be about equally fast for this specific task. But, unfortunately you have to jump through some hoops to get that speed with Python. Maybe using the #autojit capability of Numba could be interesting in this regard, if you have that available.

Try this version of fun and see if it gives you a speedup.
def fun(args):
a, b, p = args
n = 1000
u = np.random.random((n, 19, 19))
msk = u < p
msk_sum = msk.sum(0)
a -= msk_sum
b += (n - msk_sum)
return a + b
This is the more efficient way of implement this kind of function using numpy.
These kinds of nested loops can have quite high overhead in interpreted languages like matlab and python, but I suspect the JIT is compensating, at least partially, in matlab so the performance difference between the vectorized and loop implementations will be smaller. Cpython does not currently have any optimizations for these kinds of loops (that I know of) but but at least one python implementation, pypy, does have a JIT. Unfortunately pypy currently only has limited numpy support.
Update:
It seems like you have an iterative algorithm, and at least in my experience, those are the hardest to optimize with numpy/cpython. Consider using cython, this tutorial might also be useful, to write your nested loops. Other people might have other suggestions, but that's the best I can think of.

Given the available information, you would probably benefit from using Cython, which let's you also use some parallelism. Depending on the random numbers you need you could for instance use GSL to generate them.
Original question
There's no need to use multiprocessing, because fun can be easily vectorized, which yields a massive speed up (over 50x).
I'm not sure why matlab is not hit so hard as numpy, but its JIT may be saving it. Python does not like doing two . look ups in the inner loop, nor calling expensive functions there.
def fun_fast(args):
a, b, p = args
for i in xrange(19):
for j in xrange(19):
u = np.random.rand(1000)
msk = u < p
msk_sum = msk.sum()
a[i, j] -= msk_sum
b[i, j] += msk.size - msk_sum
return a + b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.