cython.parallel cannot see the difference in speed

cython.parallel cannot see the difference in speed - python

I tried to use cython.parallel prange. I can only see two cores 50% being used. How can I make use of all the cores. i.e. send the loops to the cores simultaneously sharing the arrays, volume and mc_vol?
EDIT: I also edited purely sequential for-loop which is about 30 seconds faster than than cython.parallel prange version. Both of them are using one core only. Is there are way to parallelize this.
cimport cython
from cython.parallel import prange, parallel, threadid
from libc.stdio cimport sprintf
from libc.stdlib cimport malloc, free
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
cdef int vol_len=len(volume)-1
cdef int k, j, i
cdef char* pattern # a string pointer - allocate later
Perm_area = {
"00000000": 0.000000,
...
"00011101": 1.515500
}
try:
pattern = <char*>malloc(sizeof(char)*260)
for k in range(vol_len):
for j in range(vol_len):
for i in range(vol_len):
sprintf(pattern, "%i%i%i%i%i%i%i%i",
volume[i, j, k],
volume[i, j + 1, k],
volume[i + 1, j, k],
volume[i + 1, j + 1, k],
volume[i, j, k + 1],
volume[i, j + 1, k + 1],
volume[i + 1, j, k + 1],
volume[i + 1, j + 1, k + 1]);
mc_vol[i, j, k] = Perm_area[pattern]
# if Perm_area[pattern] > 0:
# print pattern, 'Area: ', Perm_area[pattern]
#total_area += Perm_area[pattern]
finally:
free(pattern)
return mc_vol
EDIT following DavidW's suggestion, but prange is considerably slower:
cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
cdef int vol_len=len(volume)-1
cdef int k, j, i
cdef char* pattern # a string pointer - allocate later
Perm_area = {
"00000000": 0.000000,
...
"00011101": 1.515500
}
with nogil,parallel():
try:
pattern = <char*>malloc(sizeof(char)*260)
for k in prange(vol_len):
for j in range(vol_len):
for i in range(vol_len):
sprintf(pattern, "%i%i%i%i%i%i%i%i",
volume[i, j, k],
volume[i, j + 1, k],
volume[i + 1, j, k],
volume[i + 1, j + 1, k],
volume[i, j, k + 1],
volume[i, j + 1, k + 1],
volume[i + 1, j, k + 1],
volume[i + 1, j + 1, k + 1]);
with gil:
mc_vol[i, j, k] = Perm_area[pattern]
# if Perm_area[pattern] > 0:
# print pattern, 'Area: ', Perm_area[pattern]
# total_area += Perm_area[pattern]
finally:
free(pattern)
return mc_vol
My setup file looks like:
setup(
name='SurfaceArea',
ext_modules=[
Extension('c_marchSurf', ['c_marchSurf.pyx'], include_dirs=[numpy.get_include()],
extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'], language="c++")
],
cmdclass={'build_ext': build_ext}, requires=['Cython', 'numpy', 'matplotlib', 'pathos', 'scipy', 'cython.parallel']
)

The problem is the with gil:, which defines a block which can only run on one core at once. You aren't doing anything else inside the loop so you shouldn't really expect any speed-up.
In order to avoid using the GIL you need to avoid using Python features where possible. You avoid it in the string formatting part by using c sprintf to create your string. For the dictionary lookup part, the easiest thing is probably to use the C++ standard library, which contains a map class with similar behaviour. (Note that you'll now need to compile it with Cython's C++ mode)
# at the top of your file
from libc.stdio cimport sprintf
from libc.stdlib cimport malloc, free
from libcpp.map cimport map
from libcpp.string cimport string
import numpy as np
cimport numpy as np
# ... code omitted ....
cpdef MC_Surface(np.ndarray[np.int_t,ndim=3] volume, np.ndarray[np.float32_t,ndim=3] mc_vol):
# note above I've defined volume as a numpy array so that
# I can do fast, GIL-less direct array lookup
cdef char* pattern # a string pointer - allocate later
Perm_area = {} # some dictionary, as before
# depending on the size of Perm_area, this conversion to
# a C++ object is potentially quite slow (it involves a lot
# of string copies)
cdef map[string,float] Perm_area_m = Perm_area
# ... code omitted ...
with nogil,parallel():
try:
# assigning pattern here makes it thread local
# it's assigned once per thread which isn't too bad
pattern = <char*>malloc(sizeof(char)*50)
# when you allocate pattern you need to make it big enough
# either by calculating a size, or by just making it overly big
# ... more code omitted...
# then later, inside your loops
sprintf(pattern, "%i%i%i%i%i%i%i%i", volume[i, j, k],
volume[i, j + 1, k],
volume[i + 1, j, k],
volume[i + 1, j + 1, k],
volume[i, j, k + 1],
volume[i, j + 1, k + 1],
volume[i + 1, j, k + 1],
volume[i + 1, j + 1, k + 1]);
# and now do the dictionary lookup without the GIL
# because we're using the C++ class instead.
# Unfortunately, we also need to do a string copy (which might slow things down)
mc_vol[i, j, k] = Perm_area_m[string(pattern)]
# be aware that this can throw an exception if the
# pattern does not match (same as Python).
finally:
free(pattern)
I've also had to change volume to being a numpy array, since if it were just a Python object I'd need the GIL to index its elements.
(Edit: changed to take the dictionary lookup out of the GIL block too by using C++ map)

Related

Is my Python/Cython iteration benchmark representative?

I want to iterate through a large data structure in a Python program and perform a task for each element. For simplicity, let's say the elements are integers and the task is just an incrementation. In the end, the last incremented element is returned as (dummy) result. In search of the best structure/method to do this I compared timings in pure Python and Cython for these structures (I could not find a direct comparison of them elsewhere):
Python list
NumPy array / typed memory view
Cython extension type with underlying C++ vector
The iterations I timed are:
Python foreach in list iteration (it_list)
Cython list iteration with explicit element access (cit_list)
Python foreach in array iteration (it_nparray)
Python NumPy vectorised operation (vec_nparray)
Cython memory view iteration with explicit element access (cit_memview)
Python foreach in underlying vector iteration (it_pyvector)
Python foreach in underlying vector iteration via __iter__ (it_pyvector_iterator)
Cython vector iteration with explicit element access (cit_pyvector)
Cython vector iteration via vector.iterator (cit_pyvector_iterator)
I am concluding from this (timings are below):
plain Python iteration over the NumPy array is extremely slow (about 10 times slower than the Python list iteration) -> not a good idea
Python iteration over the wrapped C++ vector is slow, too (about 1.5 times slower than the Python list iteration) -> not a good idea
Cython iteration over the wrapped C++ vector is the fastest option, approximately equal to the C contiguous memory view
The iteration over the vector using explicit element access is slightly faster than using an iterator -> why bother to use an iterator?
The memory view approach has comparably larger overhead than the extension type approach
My question is now: Are my numbers reliable (did I do something wrong or miss anything here)? Is this in line with your experience with real-world examples? Is there anything else I could do to improve the iteration? Below the code that I used and the timings. I am using this in a Jupyter notebook by the way. Suggestions and comments are highly appreciated!
Relative timings (minimum value 1.000), for different data structure sizes n:
================================================================================
Timings for n = 1:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.005
cit_list: 1.023
it_list: 3.064
it_pyvector: 4.230
it_pyvector_iterator: 4.937
cit_memview: 8.196
vec_nparray: 20.187
it_nparray: 25.310
================================================================================
================================================================================
Timings for n = 1000:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.001
cit_memview: 2.453
vec_nparray: 5.845
cit_list: 9.944
it_list: 137.694
it_pyvector: 199.702
it_pyvector_iterator: 218.699
it_nparray: 1516.080
================================================================================
================================================================================
Timings for n = 1000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.056
cit_pyvector_iterator: 1.197
vec_nparray: 2.516
cit_list: 7.089
it_list: 87.099
it_pyvector_iterator: 143.232
it_pyvector: 162.374
it_nparray: 897.602
================================================================================
================================================================================
Timings for n = 10000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.004
cit_pyvector_iterator: 1.060
vec_nparray: 2.721
cit_list: 7.714
it_list: 88.792
it_pyvector_iterator: 130.116
it_pyvector: 149.497
it_nparray: 872.798
================================================================================
Cython code:
%%cython --annotate
# distutils: language = c++
# cython: boundscheck = False
# cython: wraparound = False
from libcpp.vector cimport vector
from cython.operator cimport dereference as deref, preincrement as princ
# Extension type wrapping a vector
cdef class pyvector:
cdef vector[long] _data
cpdef void push_back(self, long x):
self._data.push_back(x)
def __iter__(self):
cdef size_t i, n = self._data.size()
for i in range(n):
yield self._data[i]
#property
def data(self):
return self._data
# Cython iteration over Python list
cpdef long cit_list(list l):
cdef:
long j, ii
size_t i, n = len(l)
for i in range(n):
ii = l[i]
j = ii + 1
return j
# Cython iteration over NumPy array
cpdef long cit_memview(long[::1] v) nogil:
cdef:
size_t i, n = v.shape[0]
long j
for i in range(n):
j = v[i] + 1
return j
# Iterate over pyvector
cpdef long cit_pyvector(pyvector v) nogil:
cdef:
size_t i, n = v._data.size()
long j
for i in range(n):
j = v._data[i] + 1
return j
cpdef long cit_pyvector_iterator(pyvector v) nogil:
cdef:
vector[long].iterator it = v._data.begin()
long j
while it != v._data.end():
j = deref(it) + 1
princ(it)
return j
Python code:
# Python iteration over Python list
def it_list(l):
for i in l:
j = i + 1
return j
# Python iteration over NumPy array
def it_nparray(a):
for i in a:
j = i + 1
return j
# Vectorised NumPy operation
def vec_nparray(a):
a + 1
return a[-1]
# Python iteration over C++ vector extension type
def it_pyvector_iterator(v):
for i in v:
j = i + 1
return j
def it_pyvector(v):
for i in v.data:
j = i + 1
return j
And for the benchmark:
import numpy as np
from operator import itemgetter
def bm(sizes):
"""Call functions with data structures of varying length"""
Timings = {}
for n in sizes:
Timings[n] = {}
# Python list
list_ = list(range(n))
# NumPy array
a = np.arange(n, dtype=np.int64)
# C++ vector extension type
pyv = pyvector()
for i in range(n):
pyv.push_back(i)
calls = [
(it_list, list_),
(cit_list, list_),
(it_nparray, a),
(vec_nparray, a),
(cit_memview, a),
(it_pyvector, pyv),
(it_pyvector_iterator, pyv),
(cit_pyvector, pyv),
(cit_pyvector_iterator, pyv),
]
for fxn, arg in calls:
Timings[n][fxn.__name__] = %timeit -o fxn(arg)
return Timings
def ratios(timings, base=None):
"""Show relative performance of runs based on `timings` dict"""
if base is not None:
base = timings[base].average
else:
base = min(x.average for x in timings.values())
return sorted([
(k, v.average / base)
for k, v in timings.items()
], key=itemgetter(1))
Timings = {}
sizes = [1, 1000, 1000000, 10000000]
Timings.update(bm(sizes))
for s in sizes:
print("=" * 80)
print(f"Timings for n = {s}:")
print("-" * 80)
for x in ratios(Timings[s]):
print(f"{x[0]:>25}: {x[1]:7.3f}")
print("=" * 80, "\n")

Sum of partial derivatives of a product over a symbolic number of variables

I would like SymPy to evaluate an expression like the following:
How would I define the symbols and the expression so that SymPy could handle it nicely? I would like to keep N as just a symbol, i.e. not make an actual finite list of x's. I have tried various combinations of IndexedBase and Sum /Product, but didn't get it working right.

Ideally it would be this:
x = IndexedBase("x")
i, j, N = symbols("i j N")
expr = Sum(Product(exp(-x[j]**2), (j, 1, N)).diff(x[i]), (i, 1, N))
So far this is unevaluated, expr is
Sum(Derivative(Product(exp(-x[j]**2), (j, 1, N)), x[i]), (i, 1, N))
The method doit can be used to evaluate it. Unfortunately the differentiation of a product doesn't quite work yet: expr.doit() returns
N*Derivative(Product(exp(-x[j]**2), (j, 1, N)), x[i])
Rewriting the product as the sum prior to differentiation helps:
expr = Sum(Product(exp(-x[j]**2), (j, 1, N)).rewrite(Sum).diff(x[i]), (i, 1, N))
expr.doit()
returns
Sum(Piecewise((-2*exp(Sum(log(exp(-x[j]**2)), (j, 1, N)))*x[i], (1 <= i) & (i <= N)), (0, True)), (i, 1, N))
which is the correct result of differentiation. Sadly we have that extraneous condition in Piecewise, and also log(exp(...)) that should have been simplified. SymPy doesn't infer that (1 <= i) & (i <= N) is True from the context of the outer sum, and it also hesitates to simplify log(exp thinking x[j] might be complex. So I resort to surgical procedure with Piecewise, replacing it by the first piece, and to forcefully expanding logs:
e = expr.doit()
p = next(iter(e.atoms(Piecewise)))
e = expand_log(e.xreplace({p: p.args[0][0]}), force=True)
Now e is
Sum(-2*exp(Sum(-x[j]**2, (j, 1, N)))*x[i], (i, 1, N))
Couldn't get exp(Sum(..)) to become a Product again, unfortunately.

cpython vs cython vs numpy array performance

I am doing some performance test on a variant of the prime numbers generator from http://docs.cython.org/src/tutorial/numpy.html.
The below performance measures are with kmax=1000
Pure Python implementation, running in CPython: 0.15s
Pure Python implementation, running in Cython: 0.07s
def primes(kmax):
p = []
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p.append(n)
k = k + 1
n = n + 1
return p
Pure Python+Numpy implementation, running in CPython: 1.25s
import numpy
def primes(kmax):
p = numpy.empty(kmax, dtype=int)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Cython implementation using int*: 0.003s
from libc.stdlib cimport malloc, free
def primes(int kmax):
cdef int n, k, i
cdef int *p = <int *>malloc(kmax * sizeof(int))
result = []
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
free(p)
return result
The above performs great but looks horrible, as it holds two copies of the data... so I tried reimplementing it:
Cython + Numpy: 1.01s
import numpy as np
cimport numpy as np
cimport cython
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def primes(DTYPE_t kmax):
cdef DTYPE_t n, k, i
cdef np.ndarray p = np.empty(kmax, dtype=DTYPE)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Questions:
why is the numpy array so incredibly slower than a python list, when running on CPython?
what did I do wrong in the Cython+Numpy implementation? cython is obviously NOT treating the numpy array as an int[] as it should.
how do I cast a numpy array to a int*? The below doesn't work
cdef numpy.nparray a = numpy.zeros(100, dtype=int)
cdef int * p = <int *>a.data

cdef DTYPE_t [:] p_view = p
Using this instead of p in the calculations. reduced the runtime from 580 ms down to 2.8 ms for me. About the exact same runtime as the implementation using *int. And that's about the max you can expect from this.
DTYPE = np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
def primes(DTYPE_t kmax):
cdef DTYPE_t n, k, i
cdef np.ndarray p = np.empty(kmax, dtype=DTYPE)
cdef DTYPE_t [:] p_view = p
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p_view[i] != 0:
i = i + 1
if i == k:
p_view[k] = n
k = k + 1
n = n + 1
return p

why is the numpy array so incredibly slower than a python list, when running on CPython?
Because you didn't fully type it. Use
cdef np.ndarray[dtype=np.int, ndim=1] p = np.empty(kmax, dtype=DTYPE)
how do I cast a numpy array to a int*?
By using np.intc as the dtype, not np.int (which is a C long). That's
cdef np.ndarray[dtype=int, ndim=1] p = np.empty(kmax, dtype=np.intc)
(But really, use a memoryview, they're much cleaner and the Cython folks want to get rid of the NumPy array syntax in the long run.)

Best syntax I found so far:
import numpy
cimport numpy
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def primes(int kmax):
cdef int n, k, i
cdef numpy.ndarray[int] p = numpy.empty(kmax, dtype=numpy.int32)
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
n = n + 1
return p
Note where I used numpy.int32 instead of int. Anything on the left side of a cdef is a C type (thus int = int32 and float = float32), while anything on the RIGHT side of it (or outside of a cdef) is a python type (int = int64 and float = float64)

Fast 2 dimensional array of floats for python (access/write)

For my project use, I need to store certain amount (~100x100) of floats in two dimensional array. And during the function calculation I need to read and write to the array and since the function is really the bottleneck (consuming 98% of time) I really would need it to be fast.
I did some experiments with numpy and cython:
import numpy
import time
cimport numpy
cimport cython
cdef int col, row
DTYPE = numpy.int
ctypedef numpy.int_t DTYPE_t
cdef numpy.ndarray[DTYPE_t, ndim=2] matrix_c = numpy.zeros([100 + 1, 100 + 1], dtype=DTYPE)
time_ = time.time()
for l in xrange(5000):
for col in xrange(100):
for row in xrange(100):
matrix_c[<unsigned int>row + 1][<unsigned int>col + 1] = matrix_c[<unsigned int>row][<unsigned int>col]
print "Numpy + cython time: {0}".format(time.time() - time_)
but I found out that in spite of all my attempts, the version using python lists, is still significantly faster.
Code using lists:
matrix = []
for i in xrange(100 + 1):
matrix.append([])
for j in xrange(100 + 1):
matrix[i].append(0)
time_ = time.time()
for l in xrange(5000):
for col in xrange(100):
for row in xrange(100):
matrix[row + 1][col + 1] = matrix[row][col]
print "list time: {0}".format(time.time() - time_)
And results:
list time: 0.0141758918762
Numpy + cython time: 0.484772920609
Have I done something wrong? If not, is there anything that would help me to improve the results?

Here's my version of the code you have.
There are three functions, dealing with integer arrays, 32 bit floating point arrays and double precision floating point arrays, respectively.
from numpy cimport ndarray as ar
cimport numpy as np
import numpy as np
cimport cython
import time
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_int(ar[int,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_float(ar[np.float32_t,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_double(ar[double,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
To call these functions from Python I run this
import numpy as np
from numpy.random import rand, randint
print "integers"
c = randint(0, high=20, size=(101,101))
access_write_int(c, 5000)
print "32 bit float"
c = rand(101, 101).astype(np.float32)
access_write_float(c, 5000)
print "double precision"
c = rand(101, 101)
access_write_double(c, 5000)
The following changes are important:
Avoid slicing the array by accessing it using indices of the form [i,j] instead of [i][j]
Define the variables l, col, and row, as integers so that the for loops run in C.
Use the function decorators #cython.boundscheck(False) and '#cython.wraparound(False)` to turn off boundschecking and wraparound indexing for the key portion of the program. This allows for out of bounds memory accesses, so you should do this only when you are certain that your indices are what they should be.
Swap the two innermost for loops so that you access your array according to how it is arranged in memory. This makes a much bigger difference for larger arrays. The arrays given by np.zeros np.random.rand, etc. are usually C contiguous, so rows are stored in contiguous blocks and it is faster to vary the index along the rows in the outer for loop and not the inner one. If you want to keep the for loops as they are, consider taking the transpose of your array before you run the function on it so that the columns are in contiguous blocks instead.

The problem seems to be the way you are accessing the matrix elements.
Use [i,j] instead of [i][j].
Also you can remove the casting <>, which prevent wrong values from being taken but increase the function call overhead.
Also, I would use range instead of xrange since in all Cython the examples from the documentation they are using range.
The result will be something like:
import numpy
import time
cimport numpy
cimport cython
cdef int col, row
INT = numpy.int
ctypedef numpy.int_t cINT
cdef numpy.ndarray[cINT, ndim=2] matrix_c = numpy.zeros([100 + 1, 100 + 1], dtype=INT)
time_ = time.time()
for l in range(5000):
for col in range(100):
for row in range(100):
matrix_c[row + 1, col + 1] = matrix_c[row, col]
print "Numpy + cython time: {0}".format(time.time() - time_)
Strongly recommended reference:
- Working with NumPy in Cython

Error using scipy.weave.inline

I am using several techniques (NumPy, Weave, Cython, Numba) to perform a Python performance benchmark. The code takes two numpy arrays of size NxN and multiplies them element-wise and stores the values in another array C.
My weave.inline() code gives me a scipy.weave.build_tools.CompileError. I have created a minimalist piece of code which generates the same error. Could someone please help?
import time
import numpy as np
from scipy import weave
from scipy.weave import converters
def benchmark():
N = np.array(5000, dtype=np.int)
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
weave_inline_loop(A, B, C, N)
print time.clock() - t
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
return_val = C;
"""
C = weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
benchmark()

Three small changes are needed:
N can't be a 0D-numpy array (it has to be an integer so that i < N works in the C code). You should write N = 5000 instead of N = np.array(5000, dtype=np.int).
The C array is being modified in-place so it doesn't have to be returned. I don't know the restrictions on the kind of objects than return_val can handle, but if you try to keep return_val = C; it fails compiling: don't know how to convert ‘blitz::Array<double, 2>’ to ‘const py::object&’.
After that, weave.inline returns None. Keeping the assignment C = weave.inline(... makes the code look confusing, even if it works fine and the array named C will hold the result in the benchmark scope.
This is the end result:
import time
import numpy as np
from scipy import weave
from scipy.weave import converters
def benchmark():
N = 5000
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
weave_inline_loop(A, B, C, N)
print time.clock() - t
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
"""
weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')

Two issues. First, you don't need the line return_val = C. You are directly manipulating the data in the variable C in your inlined code, so its already available to python and there's no need to explicitly return it to the environment (and trying to do so is causing errors when trying to do the appropriate type conversions). So change your function to:
def weave_inline_loop(A, B, C, N):
code = """
int i, j;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
C(i, j) = A(i, j) * B(i, j);
}
}
"""
weave.inline(code, ['A', 'B', 'C', 'N'], type_converters=converters.blitz, compiler='gcc')
return C
Second issue. You are comparing i and j (both ints), to N an array of length 1. This also generated an error. But if you call your code as:
def benchmark():
N = np.array(5000, dtype=np.int)
A = np.random.rand(N, N)
B = np.random.rand(N, N)
C = np.zeros([N, N], dtype=float)
t = time.clock()
print weave_inline_loop(A, B, C, int(N))
# I added a print statement so you can see that C is being
# populated with the new 2d array
print time.clock() - t

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

cython.parallel cannot see the difference in speed - python

Related

Is my Python/Cython iteration benchmark representative?

Sum of partial derivatives of a product over a symbolic number of variables

cpython vs cython vs numpy array performance

Fast 2 dimensional array of floats for python (access/write)

Error using scipy.weave.inline

Categories

Resources