cython prange not so fast than single thread

cython prange not so fast than single thread - python

I just wrote a trivial program to test how cython's prange performs, and here is the code:
from cython.parallel import prange
import numpy as np
def func(int r, int c):
cdef:
double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
double total = 0
int i, j
for i in prange(r, nogil=True, schedule='static', chunksize=1):
for j in range(c):
total += a[i,j]
return total
On Mac Book pro, with OMP_NUM_THREADS=3, the above code takes almost 18 sec for (r,c) = (10000, 100000), and with single thread, it takes about 21 sec.
Why there is so little performance boost? Am I using this prange correctly?

Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.
a = np.ones((10000, 100000), np.double)
takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.
To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:
#cython: boundscheck=False, wraparound=False, initializedcheck=False
from cython.parallel cimport prange
def serial(double[:, :] a):
cdef:
double total = 0
int i, j
for i in range(a.shape[0]):
for j in range(a.shape[1]):
total += a[i, j]
return total
def parallel(double[:, :] a):
cdef:
double total = 0
int i, j
for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
for j in range(a.shape[1]):
total += a[i, j]
return total
For example:
In [1]: import tmp
In [2]: r, c = 10000, 100000
In [3]: a = np.random.randn(r, c) # this takes ~6.75 sec
In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop
In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop
Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.
The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.
* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'

Related

Is my Python/Cython iteration benchmark representative?

I want to iterate through a large data structure in a Python program and perform a task for each element. For simplicity, let's say the elements are integers and the task is just an incrementation. In the end, the last incremented element is returned as (dummy) result. In search of the best structure/method to do this I compared timings in pure Python and Cython for these structures (I could not find a direct comparison of them elsewhere):
Python list
NumPy array / typed memory view
Cython extension type with underlying C++ vector
The iterations I timed are:
Python foreach in list iteration (it_list)
Cython list iteration with explicit element access (cit_list)
Python foreach in array iteration (it_nparray)
Python NumPy vectorised operation (vec_nparray)
Cython memory view iteration with explicit element access (cit_memview)
Python foreach in underlying vector iteration (it_pyvector)
Python foreach in underlying vector iteration via __iter__ (it_pyvector_iterator)
Cython vector iteration with explicit element access (cit_pyvector)
Cython vector iteration via vector.iterator (cit_pyvector_iterator)
I am concluding from this (timings are below):
plain Python iteration over the NumPy array is extremely slow (about 10 times slower than the Python list iteration) -> not a good idea
Python iteration over the wrapped C++ vector is slow, too (about 1.5 times slower than the Python list iteration) -> not a good idea
Cython iteration over the wrapped C++ vector is the fastest option, approximately equal to the C contiguous memory view
The iteration over the vector using explicit element access is slightly faster than using an iterator -> why bother to use an iterator?
The memory view approach has comparably larger overhead than the extension type approach
My question is now: Are my numbers reliable (did I do something wrong or miss anything here)? Is this in line with your experience with real-world examples? Is there anything else I could do to improve the iteration? Below the code that I used and the timings. I am using this in a Jupyter notebook by the way. Suggestions and comments are highly appreciated!
Relative timings (minimum value 1.000), for different data structure sizes n:
================================================================================
Timings for n = 1:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.005
cit_list: 1.023
it_list: 3.064
it_pyvector: 4.230
it_pyvector_iterator: 4.937
cit_memview: 8.196
vec_nparray: 20.187
it_nparray: 25.310
================================================================================
================================================================================
Timings for n = 1000:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.001
cit_memview: 2.453
vec_nparray: 5.845
cit_list: 9.944
it_list: 137.694
it_pyvector: 199.702
it_pyvector_iterator: 218.699
it_nparray: 1516.080
================================================================================
================================================================================
Timings for n = 1000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.056
cit_pyvector_iterator: 1.197
vec_nparray: 2.516
cit_list: 7.089
it_list: 87.099
it_pyvector_iterator: 143.232
it_pyvector: 162.374
it_nparray: 897.602
================================================================================
================================================================================
Timings for n = 10000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.004
cit_pyvector_iterator: 1.060
vec_nparray: 2.721
cit_list: 7.714
it_list: 88.792
it_pyvector_iterator: 130.116
it_pyvector: 149.497
it_nparray: 872.798
================================================================================
Cython code:
%%cython --annotate
# distutils: language = c++
# cython: boundscheck = False
# cython: wraparound = False
from libcpp.vector cimport vector
from cython.operator cimport dereference as deref, preincrement as princ
# Extension type wrapping a vector
cdef class pyvector:
cdef vector[long] _data
cpdef void push_back(self, long x):
self._data.push_back(x)
def __iter__(self):
cdef size_t i, n = self._data.size()
for i in range(n):
yield self._data[i]
#property
def data(self):
return self._data
# Cython iteration over Python list
cpdef long cit_list(list l):
cdef:
long j, ii
size_t i, n = len(l)
for i in range(n):
ii = l[i]
j = ii + 1
return j
# Cython iteration over NumPy array
cpdef long cit_memview(long[::1] v) nogil:
cdef:
size_t i, n = v.shape[0]
long j
for i in range(n):
j = v[i] + 1
return j
# Iterate over pyvector
cpdef long cit_pyvector(pyvector v) nogil:
cdef:
size_t i, n = v._data.size()
long j
for i in range(n):
j = v._data[i] + 1
return j
cpdef long cit_pyvector_iterator(pyvector v) nogil:
cdef:
vector[long].iterator it = v._data.begin()
long j
while it != v._data.end():
j = deref(it) + 1
princ(it)
return j
Python code:
# Python iteration over Python list
def it_list(l):
for i in l:
j = i + 1
return j
# Python iteration over NumPy array
def it_nparray(a):
for i in a:
j = i + 1
return j
# Vectorised NumPy operation
def vec_nparray(a):
a + 1
return a[-1]
# Python iteration over C++ vector extension type
def it_pyvector_iterator(v):
for i in v:
j = i + 1
return j
def it_pyvector(v):
for i in v.data:
j = i + 1
return j
And for the benchmark:
import numpy as np
from operator import itemgetter
def bm(sizes):
"""Call functions with data structures of varying length"""
Timings = {}
for n in sizes:
Timings[n] = {}
# Python list
list_ = list(range(n))
# NumPy array
a = np.arange(n, dtype=np.int64)
# C++ vector extension type
pyv = pyvector()
for i in range(n):
pyv.push_back(i)
calls = [
(it_list, list_),
(cit_list, list_),
(it_nparray, a),
(vec_nparray, a),
(cit_memview, a),
(it_pyvector, pyv),
(it_pyvector_iterator, pyv),
(cit_pyvector, pyv),
(cit_pyvector_iterator, pyv),
]
for fxn, arg in calls:
Timings[n][fxn.__name__] = %timeit -o fxn(arg)
return Timings
def ratios(timings, base=None):
"""Show relative performance of runs based on `timings` dict"""
if base is not None:
base = timings[base].average
else:
base = min(x.average for x in timings.values())
return sorted([
(k, v.average / base)
for k, v in timings.items()
], key=itemgetter(1))
Timings = {}
sizes = [1, 1000, 1000000, 10000000]
Timings.update(bm(sizes))
for s in sizes:
print("=" * 80)
print(f"Timings for n = {s}:")
print("-" * 80)
for x in ratios(Timings[s]):
print(f"{x[0]:>25}: {x[1]:7.3f}")
print("=" * 80, "\n")

Comparing numpy array with itself by element efficiently

I am performing a large number of these calculations:
A == A[np.newaxis].T
where A is a dense numpy array which frequently has common values.
For benchmarking purposes we can use:
n = 30000
A = np.random.randint(0, 1000, n)
A == A[np.newaxis].T
When I perform this calculation, I run into memory issues. I believe this is because the output isn't in more efficient bitarray or np.packedbits format. A secondary concern is we are performing twice as many comparisons as necessary, since the resulting Boolean array is symmetric.
The questions I have are:
Is it possible to produce the Boolean numpy array output in a more memory efficient fashion without sacrificing speed? The options I know about are bitarray and np.packedbits, but I only know how to apply these after the large Boolean array is created.
Can we utilise the symmetry of our calculation to halve the number of comparisons processed, again without sacrificing speed?
I will need to be able to perform & and | operations on Boolean arrays output. I have tried bitarray, which is super-fast for these bitwise operations. But it is slow to pack np.ndarray -> bitarray and then unpack bitarray -> np.ndarray.
[Edited to provide clarification.]

Here's one with numba to give us a NumPy boolean array as output -
from numba import njit
#njit
def numba_app1(idx, n, s, out):
for i,j in zip(idx[:-1],idx[1:]):
s0 = s[i:j]
c = 0
for p1 in s0[c:]:
for p2 in s0[c+1:]:
out[p1,p2] = 1
out[p2,p1] = 1
c += 1
return out
def app1(A):
s = A.argsort()
b = A[s]
n = len(A)
idx = np.flatnonzero(np.r_[True,b[1:] != b[:-1],True])
out = np.zeros((n,n),dtype=bool)
numba_app1(idx, n, s, out)
out.ravel()[::out.shape[1]+1] = 1
return out
Timings -
In [287]: np.random.seed(0)
...: n = 30000
...: A = np.random.randint(0, 1000, n)
# Original soln
In [288]: %timeit A == A[np.newaxis].T
1 loop, best of 3: 317 ms per loop
# #Daniel F's soln-1 that skips assigning lower diagonal in output
In [289]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 450 ms per loop
# #Daniel F's soln-2 (complete one)
In [291]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 634 ms per loop
# Solution from this post
In [292]: %timeit app1(A)
10 loops, best of 3: 66.9 ms per loop

This isn't even a numpy answer, but should work to keep your data requirements down by using a bit of homebrewed sparse notation
from numba import jit
#jit # because this is gonna be loopy
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(i + 1, n):
if A[i] == A[j]:
c.append((i, j))
return c
Now c is a list of coordinate tuples (i, j), i < j that correspond to coordinates in your boolean array that are "True". You can easily do and and or operations on these setwise:
list(set(c1) & set(c2))
list(set(c1) | set(c2))
Later, when you want to apply this mask to an array, you can back out the coordinates and use them for fancy indexing instead:
i_, j_ = list(np.array(c).T)
i = np.r_[i_, j_, np.arange(n)]
j = np.r_[j_, i_, np.arange(n)]
You can then np.lexsort i nd j if you care about order
Alternatively, you can define sparse_outer_eq as:
#jit
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(n):
if A[i] == A[j]:
c.append((i, j))
return c
Which keeps >2x the data, but then the coordinates come out simply:
i, j = list(np.array(c).T)
if you've done any set operations, this will still need to be lexsorted if you want a rational order.
If your coordinates are each n-bit integers, this should be more space-efficient than boolean format as long as your sparsity is less than 1/n -> 3% or so for 32-bit.
as for time, thanks to numba it's even faster than broadcasting:
n = 3000
A = np.random.randint(0, 1000, n)
%timeit sparse_outer_eq(A)
100 loops, best of 3: 4.86 ms per loop
%timeit A == A[:, None]
100 loops, best of 3: 11.8 ms per loop
and comparisons:
a = A == A[:, None]
b = B == B[:, None]
a_ = sparse_outer_eq(A)
b_ = sparse_outer_eq(B)
%timeit a & b
100 loops, best of 3: 5.9 ms per loop
%timeit list(set(a_) & set(b_))
1000 loops, best of 3: 641 µs per loop
%timeit a | b
100 loops, best of 3: 5.52 ms per loop
%timeit list(set(a_) | set(b_))
1000 loops, best of 3: 955 µs per loop
EDIT: if you want to do &~ (as per your comment) use the second sparse_outer_eq method (so you don't have to keep track of the diagonal) and just do:
list(set(a_) - set(b_))

Here is the more or less canonical argsort solution:
import numpy as np
def f_argsort(A):
idx = np.argsort(A)
As = A[idx]
ne_ = np.r_[True, As[:-1] != As[1:], True]
bnds = np.flatnonzero(ne_)
valid = np.diff(bnds) != 1
return [idx[bnds[i]:bnds[i+1]] for i in np.flatnonzero(valid)]
n = 30000
A = np.random.randint(0, 1000, n)
groups = f_argsort(A)
for grp in groups:
print(len(grp), set(A[grp]), end=' ')
print()

I'm adding a solution to my question because it satisfies these 3 properties:
Low, fixed, memory requirement
Fast bitwise operations (&, |, ~, etc)
Low storage, 1-bit per Boolean via packing integers
The downside is it is stored in np.packbits format. It is substantially slower than other methods (especially argsort), but if speed is not an issue the algorithm should work well. If anyone figures a way to optimise further, this would be very helpful.
Update: A more efficient version of the below algorithm can be found here: Improving performance on comparison algorithm np.packbits(A==A[:, None], axis=1).
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
A = np.random.randint(0, 10, 100)
n = len(A)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
packed = compare_elementwise(A, result_arr, selection_arr)

Comparing runtime of standard list comprehension vs. NumPy

I'm experimenting with NumPy to see how and where it is faster than using generic list comprehensions in Python. Here's a standard coding question I'm using for this experiment.
Find the sum of all the multiples of 3 or 5 below 1000000.
I have written three functions to compute this number.
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
I first did a quick sanity check to see that the three functions produced the same answer.
I then used timeit to compare the runtimes for the three functions.
%timeit -n 100 fA(1000000)
100 loops, best of 3: 182 ms per loop
%timeit -n 100 fB(1000000)
100 loops, best of 3: 14.4 ms per loop
%timeit -n 100 fC(1000000)
100 loops, best of 3: 44 ms per loop
It's no surprise that fA is the slowest. But why is fB so much better than fC? Is there a better way to compute this answer using NumPy?
I don't think size is an issue here. In fact, if I change the 1e6 to 1e9, fC becomes even slower when compared to fB.

fB is so much faster than fC because fC is not the NumPy equivalent of fB. fC is the NumPy equivalent of fA. This is the NumPy equivalent of fB:
def fD(M):
multiples_3 = np.arange(0, M, 3)
multiples_5 = np.arange(0, M, 5)
multiples_15 = np.arange(0, M, 15)
return multiples_3.sum() + multiples_5.sum() - multiples_15.sum()
It runs way faster:
In [4]: timeit fB(1000000)
100 loops, best of 3: 9.96 ms per loop
In [5]: timeit fD(1000000)
1000 loops, best of 3: 637 µs per loop

In fB you are constructing the ranges with the exact multiples you want. Their sizes become smaller from 3 to 5 to 15 and thus each takes less time to construct than the one before, after they are constructed you only need to take the sum and do some arithmetic.
In fC you are constructing a 100000 element array, the size isn't really the issue as much as the two modulo comparisons you are doing which must look at every single element in the array. This takes the lion's share of the execution time (about 90 %) for fC.

You're only really using numpy there to generate an array. You'd see a much bigger difference if you were trying to perform operations on arrays as opposed to performing them on lists or tuples. With regards to that particular problem, take a look at the function fD in the code below, which just calculates how many multiples there should be in each range and then calculates their sum, rather than generating the array. Actually, if you run the below snippet, you'll see how the times change in function of M. Also, fC breaks down for M >= 100000. I couldn't tell you why.
import numpy as np
from time import time
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
def fD(M):
return sum_mult(M,3)+sum_mult(M,5)-sum_mult(M,15)
def sum_mult(M,n):
instances=(M-1)/n
check=len(range(n,M,n))
return (n*instances*(instances+1))/2
for x in range(5,20):
print "*"*20
M=2**x
print M
answers=[]
T=[]
for f in (fA,fB,fC,fD):
ts=time()
answers.append(f(M))
for i in range(20):
f(M)
T.append(time()-ts)
if not all([x==answers[0] for x in answers]):
print "Warning! Answers do not match!",answers
print T

How to boost efficiency for large number modulus multiplications in Python

I am using python3 without any tailored library for some simple arithmetic. The operation that dominates computational efficiency is a multiplication of many 2048 bit values:
length=len(array)
res=1
for x in range(length):
res=(res*int(array[x]))
ret=res%n2
To give you an insight it takes ~3940 seconds to make 10000 multiplications moduli a number for every multiplication for an:
Intel Core i5 CPU M 560 # 2.67GHz × 4 with 8GB of memory, running Ubuntu 12.04 32bit machine.
Would it make sense to boost it up using a library like gmpy2 or there would not be any advantage?

You seem to be calculating the product of all the numbers first then taking the remainder, rather than exploiting the properties of modular multiplication: a * b * c mod p == (a * b mod p) * c mod p. This takes very little time at all to multiply 10,000 2048-bit numbers modulo some n:
In [1]: import random
In [2]: array = [random.randrange(2**2048) for i in range(10000)]
In [3]: n = random.randrange(2**2048)
In [4]: prod = 1
In [5]: %%time
...: for e in array:
...: prod *= e
...: prod %= n
...:
CPU times: user 210 ms, sys: 4.07 ms, total: 214 ms
Wall time: 206 ms
For you, I would suggest:
array = map(int, array)
prod = 1
for x in array:
prod *= x
prod %= n2

What is the best way to compute the trace of a matrix product in numpy?

If I have numpy arrays A and B, then I can compute the trace of their matrix product with:
tr = numpy.linalg.trace(A.dot(B))
However, the matrix multiplication A.dot(B) unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:
tr = 0.0
for i in range(n):
tr += A[i, :].dot(B[:, i])
but this performs the loop in Python code and isn't as obvious as numpy.linalg.trace.
Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?

You can improve on #Bill's solution by reducing intermediate storage to the diagonal elements only:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
Another option is to use np.einsum and have no explicit intermediate storage at all:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
On my system it runs slightly slower than using inner1d, but it may not hold for all systems, see this question:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop

From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):
# Tr(A.B)
tr = (A*B.T).sum()
I think this takes less computation than doing numpy.trace(A.dot(B)).
Edit:
Ran some timers. This way is much faster than using numpy.trace.
In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461
In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959

Note that one slight variant is to take the dot product of the vectorized matrices. In python, vectorization is done using .flatten('F'). It's slightly slower than taking the sum of the Hadamard product, on my computer, so it's a worse solution than wflynny's , but I think it's kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.
Speed comparison, on my system:
import numpy as np
import time
N = 1000
np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)
tart = time.time()
for i in range(10):
C = np.trace(A.dot(B))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A.T * B).sum()
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A * B.T).sum()
print(time.time() - start, C)
Result:
6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

cython prange not so fast than single thread - python

Related

Is my Python/Cython iteration benchmark representative?

Comparing numpy array with itself by element efficiently

Comparing runtime of standard list comprehension vs. NumPy

How to boost efficiency for large number modulus multiplications in Python

What is the best way to compute the trace of a matrix product in numpy?

Categories

Resources