Description
I simply want to create a matrix of rows x cols that is filled with 0s. Always working with numpy I thought using np.zeros as described in the docs is the easiest:
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def f1():
cdef:
int dim = 40000
int i, j
np.ndarray[DTYPE_t, ndim=2] mat = np.zeros([40000, 40000], dtype=DTYPE)
for i in range(dim):
for j in range(dim):
mat[i, j] = 1
Then I compared this using the arrays in c:
def f2():
cdef:
int dim = 40000
int[40000][40000] mat
int i, j
for i in range(dim):
for j in range(dim):
mat[i][j] = 1
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs. However when I return the array from f2() I noticed it is not zero filled (of course here it can't be, i==j however when not filling it it won't return a 0 array either). How can this be done in cython. I know in regular C it would be like: int arr[n][m] = {};.
Question
How can the c array be filled with 0s? (I would go for numpy instead if there is something obvious wrong in my code)
You do not want to be writing code like this:
int[40000][40000] mat generates a 6 gigabyte array on the stack (assuming 4 byte ints). Typically maximum stack sizes are of the order of a few Mb. I have no idea how this isn't crashing your PC.
However when I return the array from f2() [...]
The array you have allocated is completely local to the function. From a C point of view you cannot return it since it ceases to exist after the function has finished. I think Cython may convert it to a (nested) Python list for you. This requires a slow copy element-by-element and is not what you want.
For what you're doing here you're much better just using Numpy.
Cython doesn't support a good equivalent of the C arr = {} so if you do want initialize sensible, small C arrays you need to use of one:
loops,
memset (which you can cimport from libc.string),
Create a typed memoryview of it and do memview[:,:] = 0
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs.
This kind of difference usually suggests that the C compiler has optimized some code out (by detecting that the result is unused). It is unlikely to be a genuine speed-up.
Related
I am currently passing from Cython to C the following pointer of a pointer:
#convert the input Python 2D array to a memory view
cdef double[:,:] a_cython= np.asarray(a,order="C")
#define a pointer of a pointer with dimensions of a
cdef double** point_to_a = <double **>malloc(N * sizeof(double*))
#initialize the pointer
if not point_to_a: raise MemoryError
#try:
for i in range(N):
point_to_a[i] = &a_cython[i, 0]
#pass this double pointer to a C function
logistic_sigmoid(&point_to_a[0], N,M)
where a is a numpy array, whose dimensions are N x M, point_to_a is a Cython pointer of a pointer which is referring to Cython memoryview a_cython. Since the input a from Python is 2 dimensional array, I thought this was the best approach to pass the info directly to C.
The passage goes smoothly and the computation is done correctly. However, I am trying now to re-convert back point_to_a to a numpy array, but I am struggling a bit.
I am considering various solutions. I would like to explore if it's possible to keep a N dimensional array throughout the entire process, thus I was experimenting with this approach in Cython:
#define a integer array for dimensions
cdef np.npy_intp dims[2]
dims[0]= N
dims[1] = M
#create a new memory view and PyArray_SimpleNewFromData to deal with the pointer
cdef np.ndarray[double, ndim=2] new_a = np.PyArray_SimpleNewFromData(2, &dims[0], np.NPY_DOUBLE, point_to_a)
however, when I am converting new_a to a np.array as array = np.asarray(new_a) I have an array with 0s only.
Do you have any ideas?
Thanks very much
As soon as you use int** (or similar) your data is in so-called indirect memory layout. Cython's typed memory views support indirect memory layout (see for example Cython: understanding a typed memoryview with a indirect_contignuous memory layout), however there are not so many classes implementing this interface.
Numpy's ndarrays do not implement indirect memory layout - they only support direct memory layouts (e.g. pointer of type int* and not int**), so passing an int** to a numpy array will do no good.
The good thing is, that because you share the memory with a_cython, the values were already updated in-place. You can get the underlying numpy array by returning the base-object of the typed memory view, i.e.
return a_cython.base # returns 2d-numpy array.
there is no need to copy memory at all!
There are however some issues with memory management (e.g. you need to free point_to_a).
This is maybe an overkill in your case, but I use the opportunity to shamelessly plug-in a library of mine indirect_buffer: Because alternatives for indirect memory layout buffers are scarce and from time to time one needs one, I've create one to avoid writing always the same code.
With indirect_buffer your function could look like following:
%%cython
#just an example for a c-function
cdef extern from *:
"""
void fillit(int** ptr, int N, int M){
int cnt=0;
for(int i=0;i<N;i++){
for(int j=0;j<M;j++){
ptr[i][j]=cnt++;
}
}
}
"""
void fillit(int** ptr, int N, int M)
from indirect_buffer.buffer_impl cimport IndirectMemory2D
def py_fillit(a):
#create collection, it is a view of a
indirect_view=IndirectMemory2D.cy_view_from_rows(a, readonly=False)
fillit(<int**>indirect_view.ptr, indirect_view.shape[0], indirect_view.shape[1])
# values are updated directly in a
which now can be used, for example:
import numpy as np
a=np.zeros((3,4), dtype=np.int32)
py_fillit(a)
print(a)
# prints as expected:
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11]])
The above version does a lot of things right: memory management, locking of buffers and so on.
I'm have a pretty simple function which I need to speed up. Essentially I have a big array of 16 bit numbers with some holes in it. (About 10%) I need to traverse the array, find areas where there are 2 0's in a row, then fill them in with the average of the previous and next elements. This takes only a few milliseconds in C, but Python is doing way worse.
I've switched from regular python arrays to numpy arrays, and then compiled my code using cython, but I'm still really far from my target. I was hoping someone with more experience might look at what I'm doing and give me some feedback.
My regular python code looks like this:
self.rawData = numpy.fromfile(ql, numpy.uint16, 50000)
[snip]
def fixZeroes(self):
for x in range(2,len(self.rawData)):
if self.rawData[x] == 0 and self.rawData[x-1] == 0:
self.rawData[x] = (self.rawData[x-2] + self.rawData[x+2]) / 2
self.rawData[x-1] = (self.rawData[x-3] + self.rawData[x+1]) /2
My Cython code looks very similar:
import numpy as np
cimport numpy as np
DTYPE = np.uint16
ctypedef np.uint16_t DTYPE_t
#cython.boundscheck(False)
def fix_zeroes(np.ndarray[DTYPE_t, ndim=1] raw):
assert raw.dtype == DTYPE
cdef int len = 50000
for x in range(2,len):
if raw[x] == 0 and raw[x-1] == 0:
raw[x] = (raw[x-2] + raw[x+2]) / 2
raw[x-1] = (raw[x-3] + raw[x+1]) /2
return raw
When I run this code, the performance is still way slower than I'd like:
Starting cython zero fix
Finished: 0:00:36.983681
starting python zero fix
Finished: 0:00:41.434476
I really think I must be doing something wrong. Most every article I've seen talks about the huge performance gains numpy and cython add, but I'm barely breaking 10%.
You should declare the x variable that you are using to index the raw array:
cdef int x
you can also use other directives that usually provide a performance boost:
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)
I have been playing around with writing cffi modules in python, and their speed is making me wonder if I'm using standard python correctly. It's making me want to switch to C completely! Truthfully there are some great python libraries I could never reimplement myself in C so this is more hypothetical than anything really.
This example shows the sum function in python being used with a numpy array, and how slow it is in comparison with a c function. Is there a quicker pythonic way of computing the sum of a numpy array?
def cast_matrix(matrix, ffi):
ap = ffi.new("double* [%d]" % (matrix.shape[0]))
ptr = ffi.cast("double *", matrix.ctypes.data)
for i in range(matrix.shape[0]):
ap[i] = ptr + i*matrix.shape[1]
return ap
ffi = FFI()
ffi.cdef("""
double sum(double**, int, int);
""")
C = ffi.verify("""
double sum(double** matrix,int x, int y){
int i, j;
double sum = 0.0;
for (i=0; i<x; i++){
for (j=0; j<y; j++){
sum = sum + matrix[i][j];
}
}
return(sum);
}
""")
m = np.ones(shape=(10,10))
print 'numpy says', m.sum()
m_p = cast_matrix(m, ffi)
sm = C.sum(m_p, m.shape[0], m.shape[1])
print 'cffi says', sm
just to show the function works:
numpy says 100.0
cffi says 100.0
now if I time this simple function I find that numpy is really slow!
Am I using numpy in the correct way? Is there a faster way to calculate the sum in python?
import time
n = 1000000
t0 = time.time()
for i in range(n): C.sum(m_p, m.shape[0], m.shape[1])
t1 = time.time()
print 'cffi', t1-t0
t0 = time.time()
for i in range(n): m.sum()
t1 = time.time()
print 'numpy', t1-t0
times:
cffi 0.818415880203
numpy 5.61657714844
Numpy is slower than C for two reasons: the Python overhead (probably similar to cffi) and generality. Numpy is designed to deal with arrays of arbitrary dimensions, in a bunch of different data types. Your example with cffi was made for a 2D array of floats. The cost was writing several lines of code vs .sum(), 6 characters to save less than 5 microseconds. (But of course, you already knew this). I just want to emphasize that CPU time is cheap, much cheaper than developer time.
Now, if you want to stick to Numpy, and you want to get a better performance, your best option is to use Bottleneck. They provide a few functions optimised for 1 and 2D arrays of float and doubles, and they are blazing fast. In your case, 16 times faster, which will put execution time in 0.35, or about twice as fast as cffi.
For other functions that bottleneck does not have, you can use Cython. It helps you write C code with a more pythonic syntax. Or, if you will, convert progressively Python into C until you are happy with the speed.
I'm trying to solve the bottleneck in my application, which is an elementwise sum of two matrices.
I'm using NumPy and Cython. I have a cdef class with a matrix attribute. Since Cython still doesn't support buffer arrays in class attributes, I followed this and tried to use a pointer to the data attribute of the matrix. The thing is, I'm sure I'm doing something wrong, as the results indicate.
What I tried to do is more or less the following:
cdef class the_class:
cdef np.ndarray the_matrix
cdef float_t* the_matrix_p
def __init__(self):
the_matrix_p = <float_t*> self.the_matrix.data
cpdef the_function(self):
other_matrix = self.get_other_matrix()
the_matrix_p += other_matrix.data
I have serious doubt that adding two numpy arrays is a bottleneck that you can solve rewriting things in C. See the follwing code, that uses scipy.weave:
import numpy as np
from scipy.weave import inline
a = np.random.rand(10000000)
b = np.random.rand(10000000)
c = np.empty((10000000,))
def c_sum(a, b, c) :
length = a.shape[0]
code = '''
for(int j = 0; j < length; j++)
{
c[j] = a[j] + b[j];
}
'''
inline(code, ['a', 'b', 'c', 'length'])
Once you run c_sum(a, b, c) once to get the C code compiled, these are the timings I get:
In [12]: %timeit c_sum(a, b, c)
10 loops, best of 3: 33.5 ms per loop
In [16]: %timeit np.add(a, b, out=c)
10 loops, best of 3: 33.6 ms per loop
So it seems you are looking at something of a .3% performance improvement, if the timing differences are not simply random noise, on an operation that takes a handful of ms when working on arrays of ten million elements. If it really is a bottleneck, this is hardly going to solve it.
Try compiling ATLAS and recompiling numpy after that. This won't probably help with addition, but you can have really nice performance boost with more complicated matrix operations (if you use such, of course).
Check out this simple benchmark. If your results fall too far from those given in the post, maybe your numpy is not linked against some optimized BLAS implementation.
I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.
To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):
Version 1 (without cython)
import numpy as np
def _process(array):
rows = array.shape[0]
cols = array.shape[1]
out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
data = np.load('data.npy')
out = _process(data)
np.save('vianumpy.npy', out)
Version 2 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('viacynpy.npy', out)
Version 3 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
for col in range(0, cols):
for row2 in range(0, rows):
out[row, col] += array[row2, col] - array[row, col]
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('vialoop.npy', out)
With a 10000x10 matrix saved in data.npy, the times are:
$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop
$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop
$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop
Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?
Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.
With slight modification, version 3 becomes twice as fast:
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col] - array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process is your version 3.
As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this
First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays
Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.
I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:
http://docs.cython.org/src/quickstart/cythonize.html
Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.
Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:
https://github.com/synapticarbors/pylangevin-integrator
Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.
The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.
I'd guess the main overhead you are saving is the temporary arrays created. You create a great big array array - array[row, :], then reduce it into a smaller array using sum. But building that big temporary array won't be free, especially if you need to allocate memory.