I'm have a pretty simple function which I need to speed up. Essentially I have a big array of 16 bit numbers with some holes in it. (About 10%) I need to traverse the array, find areas where there are 2 0's in a row, then fill them in with the average of the previous and next elements. This takes only a few milliseconds in C, but Python is doing way worse.
I've switched from regular python arrays to numpy arrays, and then compiled my code using cython, but I'm still really far from my target. I was hoping someone with more experience might look at what I'm doing and give me some feedback.
My regular python code looks like this:
self.rawData = numpy.fromfile(ql, numpy.uint16, 50000)
[snip]
def fixZeroes(self):
for x in range(2,len(self.rawData)):
if self.rawData[x] == 0 and self.rawData[x-1] == 0:
self.rawData[x] = (self.rawData[x-2] + self.rawData[x+2]) / 2
self.rawData[x-1] = (self.rawData[x-3] + self.rawData[x+1]) /2
My Cython code looks very similar:
import numpy as np
cimport numpy as np
DTYPE = np.uint16
ctypedef np.uint16_t DTYPE_t
#cython.boundscheck(False)
def fix_zeroes(np.ndarray[DTYPE_t, ndim=1] raw):
assert raw.dtype == DTYPE
cdef int len = 50000
for x in range(2,len):
if raw[x] == 0 and raw[x-1] == 0:
raw[x] = (raw[x-2] + raw[x+2]) / 2
raw[x-1] = (raw[x-3] + raw[x+1]) /2
return raw
When I run this code, the performance is still way slower than I'd like:
Starting cython zero fix
Finished: 0:00:36.983681
starting python zero fix
Finished: 0:00:41.434476
I really think I must be doing something wrong. Most every article I've seen talks about the huge performance gains numpy and cython add, but I'm barely breaking 10%.
You should declare the x variable that you are using to index the raw array:
cdef int x
you can also use other directives that usually provide a performance boost:
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)
Related
Description
I simply want to create a matrix of rows x cols that is filled with 0s. Always working with numpy I thought using np.zeros as described in the docs is the easiest:
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def f1():
cdef:
int dim = 40000
int i, j
np.ndarray[DTYPE_t, ndim=2] mat = np.zeros([40000, 40000], dtype=DTYPE)
for i in range(dim):
for j in range(dim):
mat[i, j] = 1
Then I compared this using the arrays in c:
def f2():
cdef:
int dim = 40000
int[40000][40000] mat
int i, j
for i in range(dim):
for j in range(dim):
mat[i][j] = 1
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs. However when I return the array from f2() I noticed it is not zero filled (of course here it can't be, i==j however when not filling it it won't return a 0 array either). How can this be done in cython. I know in regular C it would be like: int arr[n][m] = {};.
Question
How can the c array be filled with 0s? (I would go for numpy instead if there is something obvious wrong in my code)
You do not want to be writing code like this:
int[40000][40000] mat generates a 6 gigabyte array on the stack (assuming 4 byte ints). Typically maximum stack sizes are of the order of a few Mb. I have no idea how this isn't crashing your PC.
However when I return the array from f2() [...]
The array you have allocated is completely local to the function. From a C point of view you cannot return it since it ceases to exist after the function has finished. I think Cython may convert it to a (nested) Python list for you. This requires a slow copy element-by-element and is not what you want.
For what you're doing here you're much better just using Numpy.
Cython doesn't support a good equivalent of the C arr = {} so if you do want initialize sensible, small C arrays you need to use of one:
loops,
memset (which you can cimport from libc.string),
Create a typed memoryview of it and do memview[:,:] = 0
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs.
This kind of difference usually suggests that the C compiler has optimized some code out (by detecting that the result is unused). It is unlikely to be a genuine speed-up.
I've started messing around with parallel programming and cython/openmp, and I have a simple program that sums over an array using prange:
import numpy as np
from cython.parallel import prange
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def parallel_summation(double[:] vec):
cdef int n = vec.shape[0]
cdef double total
cdef int i
for i in prange(n, nogil=True):
total += vec[i]
return total
It seems to work OK with a setup.py file. However, I was wondering if it is possible to adjust this function and have a little more control over what the processors are doing.
Let's say I have 4 processors: I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum. From the cython documentation, I wasn't able to gather whether something like this is possible or not (the documentation is a little sparse).
I'd appreciate if someone could explain if/how something like this is done using cython/openmp, or maybe help locate some relevant examples (its been surprisingly hard to find simple ones online).
I want to split the vector to be summed into 4 parts, and then have each processor locally add the elements inside. Then at the end, I can combine the results from each processor to get the total sum.
That's exactly what's happening here already. Cython infers from your inplace operation that you want to do a reduction. OpenMP will implement a parallel loop with private (zero initialized) copies of the total variable and add them all to total at the end of the loop.
In the generated C, this looks like this:
#pragma omp parallel
{
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) reduction(+:__pyx_v_total)
for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_3; __pyx_t_2++){
{
__pyx_v_i = (int)(0 + 1 * __pyx_t_2);
__pyx_t_4 = __pyx_v_i;
__pyx_v_total = (__pyx_v_total + (*((double *) ( /* dim=0 */ (__pyx_v_vec.data + __pyx_t_4 * __pyx_v_vec.strides[0]) ))));
}
}
}
You just need to enable OpenMP as described here.
The one thing that you should change in your code, is to initialize total = 0, otherwise it's just an unitialized C variable with may contain garbage.
I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844
Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.
I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).
I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.
To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):
Version 1 (without cython)
import numpy as np
def _process(array):
rows = array.shape[0]
cols = array.shape[1]
out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
data = np.load('data.npy')
out = _process(data)
np.save('vianumpy.npy', out)
Version 2 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('viacynpy.npy', out)
Version 3 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
for col in range(0, cols):
for row2 in range(0, rows):
out[row, col] += array[row2, col] - array[row, col]
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('vialoop.npy', out)
With a 10000x10 matrix saved in data.npy, the times are:
$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop
$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop
$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop
Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?
Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.
With slight modification, version 3 becomes twice as fast:
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col] - array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process is your version 3.
As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this
First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays
Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.
I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:
http://docs.cython.org/src/quickstart/cythonize.html
Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.
Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:
https://github.com/synapticarbors/pylangevin-integrator
Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.
The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.
I'd guess the main overhead you are saving is the temporary arrays created. You create a great big array array - array[row, :], then reduce it into a smaller array using sum. But building that big temporary array won't be free, especially if you need to allocate memory.