I want to iterate through a large data structure in a Python program and perform a task for each element. For simplicity, let's say the elements are integers and the task is just an incrementation. In the end, the last incremented element is returned as (dummy) result. In search of the best structure/method to do this I compared timings in pure Python and Cython for these structures (I could not find a direct comparison of them elsewhere):
Python list
NumPy array / typed memory view
Cython extension type with underlying C++ vector
The iterations I timed are:
Python foreach in list iteration (it_list)
Cython list iteration with explicit element access (cit_list)
Python foreach in array iteration (it_nparray)
Python NumPy vectorised operation (vec_nparray)
Cython memory view iteration with explicit element access (cit_memview)
Python foreach in underlying vector iteration (it_pyvector)
Python foreach in underlying vector iteration via __iter__ (it_pyvector_iterator)
Cython vector iteration with explicit element access (cit_pyvector)
Cython vector iteration via vector.iterator (cit_pyvector_iterator)
I am concluding from this (timings are below):
plain Python iteration over the NumPy array is extremely slow (about 10 times slower than the Python list iteration) -> not a good idea
Python iteration over the wrapped C++ vector is slow, too (about 1.5 times slower than the Python list iteration) -> not a good idea
Cython iteration over the wrapped C++ vector is the fastest option, approximately equal to the C contiguous memory view
The iteration over the vector using explicit element access is slightly faster than using an iterator -> why bother to use an iterator?
The memory view approach has comparably larger overhead than the extension type approach
My question is now: Are my numbers reliable (did I do something wrong or miss anything here)? Is this in line with your experience with real-world examples? Is there anything else I could do to improve the iteration? Below the code that I used and the timings. I am using this in a Jupyter notebook by the way. Suggestions and comments are highly appreciated!
Relative timings (minimum value 1.000), for different data structure sizes n:
================================================================================
Timings for n = 1:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.005
cit_list: 1.023
it_list: 3.064
it_pyvector: 4.230
it_pyvector_iterator: 4.937
cit_memview: 8.196
vec_nparray: 20.187
it_nparray: 25.310
================================================================================
================================================================================
Timings for n = 1000:
--------------------------------------------------------------------------------
cit_pyvector_iterator: 1.000
cit_pyvector: 1.001
cit_memview: 2.453
vec_nparray: 5.845
cit_list: 9.944
it_list: 137.694
it_pyvector: 199.702
it_pyvector_iterator: 218.699
it_nparray: 1516.080
================================================================================
================================================================================
Timings for n = 1000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.056
cit_pyvector_iterator: 1.197
vec_nparray: 2.516
cit_list: 7.089
it_list: 87.099
it_pyvector_iterator: 143.232
it_pyvector: 162.374
it_nparray: 897.602
================================================================================
================================================================================
Timings for n = 10000000:
--------------------------------------------------------------------------------
cit_pyvector: 1.000
cit_memview: 1.004
cit_pyvector_iterator: 1.060
vec_nparray: 2.721
cit_list: 7.714
it_list: 88.792
it_pyvector_iterator: 130.116
it_pyvector: 149.497
it_nparray: 872.798
================================================================================
Cython code:
%%cython --annotate
# distutils: language = c++
# cython: boundscheck = False
# cython: wraparound = False
from libcpp.vector cimport vector
from cython.operator cimport dereference as deref, preincrement as princ
# Extension type wrapping a vector
cdef class pyvector:
cdef vector[long] _data
cpdef void push_back(self, long x):
self._data.push_back(x)
def __iter__(self):
cdef size_t i, n = self._data.size()
for i in range(n):
yield self._data[i]
#property
def data(self):
return self._data
# Cython iteration over Python list
cpdef long cit_list(list l):
cdef:
long j, ii
size_t i, n = len(l)
for i in range(n):
ii = l[i]
j = ii + 1
return j
# Cython iteration over NumPy array
cpdef long cit_memview(long[::1] v) nogil:
cdef:
size_t i, n = v.shape[0]
long j
for i in range(n):
j = v[i] + 1
return j
# Iterate over pyvector
cpdef long cit_pyvector(pyvector v) nogil:
cdef:
size_t i, n = v._data.size()
long j
for i in range(n):
j = v._data[i] + 1
return j
cpdef long cit_pyvector_iterator(pyvector v) nogil:
cdef:
vector[long].iterator it = v._data.begin()
long j
while it != v._data.end():
j = deref(it) + 1
princ(it)
return j
Python code:
# Python iteration over Python list
def it_list(l):
for i in l:
j = i + 1
return j
# Python iteration over NumPy array
def it_nparray(a):
for i in a:
j = i + 1
return j
# Vectorised NumPy operation
def vec_nparray(a):
a + 1
return a[-1]
# Python iteration over C++ vector extension type
def it_pyvector_iterator(v):
for i in v:
j = i + 1
return j
def it_pyvector(v):
for i in v.data:
j = i + 1
return j
And for the benchmark:
import numpy as np
from operator import itemgetter
def bm(sizes):
"""Call functions with data structures of varying length"""
Timings = {}
for n in sizes:
Timings[n] = {}
# Python list
list_ = list(range(n))
# NumPy array
a = np.arange(n, dtype=np.int64)
# C++ vector extension type
pyv = pyvector()
for i in range(n):
pyv.push_back(i)
calls = [
(it_list, list_),
(cit_list, list_),
(it_nparray, a),
(vec_nparray, a),
(cit_memview, a),
(it_pyvector, pyv),
(it_pyvector_iterator, pyv),
(cit_pyvector, pyv),
(cit_pyvector_iterator, pyv),
]
for fxn, arg in calls:
Timings[n][fxn.__name__] = %timeit -o fxn(arg)
return Timings
def ratios(timings, base=None):
"""Show relative performance of runs based on `timings` dict"""
if base is not None:
base = timings[base].average
else:
base = min(x.average for x in timings.values())
return sorted([
(k, v.average / base)
for k, v in timings.items()
], key=itemgetter(1))
Timings = {}
sizes = [1, 1000, 1000000, 10000000]
Timings.update(bm(sizes))
for s in sizes:
print("=" * 80)
print(f"Timings for n = {s}:")
print("-" * 80)
for x in ratios(Timings[s]):
print(f"{x[0]:>25}: {x[1]:7.3f}")
print("=" * 80, "\n")
Related
In the numpy library, one can pass a list into the numpy.searchsorted function, whereby it searched through a different list one element at a time and returns an array of the same sizes as the indices needed to preserve order. However, it seems to be wasting performance if both lists are sorted. For example:
m=[1,3,5,7,9]
n=[2,4,6,8,10]
numpy.searchsorted(m,n)
would return [1,2,3,4,5] which is the correct answer, but it looks like this would have complexity O(n ln(m)), whereby if one were to simply loop through m, and have some kind of pointer to n, it seems like the complexity is more like O(n+m)? Is there some kind of function in NumPy which does this?
AFAIK, this is not possible to do that in linear time only with Numpy without making additional assumptions on the inputs (eg. the integer are small and bounded). An alternative solution is to use Numba to do the merge manually:
import numba as nb
# Note: Numba requires a function signature with well defined array types
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted(a, b):
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < a.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
return result
a, b = np.cumsum(np.random.randint(0, 100, (2, 1000000)).astype(np.int64), axis=1)
result = search_both_sorted(a, b)
A faster implementation consists in using a branch-less approach so to remove the overhead of branch mis-prediction (especially on random/unpredictable inputs) when a and b are about the same size. Additionally, the O(n log m) algorithm can be faster when b is small so using np.searchsorted in that case is very efficient as pointed out by #MichaelSzczesny. Note that the Numba implementation of np.searchsorted can be a bit slower than the one of Numpy so it is better to pick the Numpy implementation. Here is the optimized version:
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted_opt_numba(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if sb < sa * 0.15:
# Use a version with branches because `a[i] < b[j]`
# should be most of the time true.
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
else:
# Use a branchless approach to avoid miss-predictions
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
tmp = a[i] < b[j]
result[j] = i
i += tmp
j += ~tmp
for k in range(j, b.size):
result[k] = i
return result
def search_both_sorted_opt(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if 2 * sb * np.log2(sa) < sa + sb:
return np.searchsorted(a, b)
else:
return search_both_sorted_opt_numba(a, b)
searchsorted: 19.1 ms
snp_search: 11.8 ms
search_both_sorted: 6.5 ms
search_both_sorted_branchless: 4.3 ms
The optimized branchless Numba implementation is about 4.4 times faster than searchsorted which is pretty good considering that the code of searchsorted is already highly optimized. It can be even faster when a and b are huge because of cache locality.
You could use sortednp, unfortunately it does not give too much flexibility, In the code snippet below I used its merge tracking indices, but it produces three arrays, four times more memory than necessary is used, but it is faster than searchsorted.
import numpy as np
import sortednp as snp
a = np.cumsum(np.random.rand(1000000))
b = np.cumsum(np.random.rand(1000000))
def snp_search(a,b):
m, (ib, ia) = snp.merge(b, a, indices=True)
return ib - np.arange(len(ib))
assert(np.all(snp_search(a,b) == np.searchsorted(a,b)))
np.searchsorted(a, b); #58 ms
snp_search(a,b); # 22ms
np.searchsorted takes this into account already as can be seen from the source code:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (cmp(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
Here min_idx, max_idx are used to perform binary search on the array. If last_key_val < key_val then only max_idx is reset to the array length, but min_idx remains at its current value, i.e. binary search starts at the same lower boundary as for the previous key.
I am trying to solve some problem at codility.
And I am wondering whether answ = [max] * N has linear or constant time ?
def solution(N, A):
answ = [0] * N
max = 0
for item in A:
if item > N:
answ = [max] * N # this line here. Linear or constant time ?
else:
answ[item-1] += 1
if answ[item-1] > max:
max = answ[item-1]
return answ
List A has length M.
So, if time is constant I will receive O(M) complexity of algorithm.
If linear, I will receive O(M*N) complexity.
Yes. CPython lists are merely arrays of pointers. Check out struct definition in listobject.h:
https://hg.python.org/cpython/file/tip/Include/listobject.h#l22
typedef struct {
PyObject_VAR_HEAD
/* Vector of pointers to list elements. list[0] is ob_item[0], etc. */
PyObject **ob_item;
/* ob_item contains space for 'allocated' elements. The number
* currently in use is ob_size.
* Invariants:
* 0 <= ob_size <= allocated
* len(list) == ob_size
* ob_item == NULL implies ob_size == allocated == 0
* list.sort() temporarily sets allocated to -1 to detect mutations.
*
* Items must normally not be NULL, except during construction when
* the list is not yet visible outside the function that builds it.
*/
Py_ssize_t allocated;
} PyListObject;
If that doesn't convince you....
In [1]: import time
In [2]: import matplotlib.pyplot as plt
In [3]: def build_list(N):
...: start = time.time()
...: lst = [0]*N
...: stop = time.time()
...: return stop - start
...:
In [4]: x = list(range(0,1000000, 10000))
In [5]: y = [build_list(n) for n in x]
In [6]: plt.scatter(x, y)
Out[6]: <matplotlib.collections.PathCollection at 0x7f2d0cae7438>
In [7]: plt.show()
Since you're populating an array sized N with the value max, that means you're doing N writes - hence it is linear in complexity.
There are some data structures that can receive a "default" value for all items that aren't explicitly declared with a value, in a bound array size. However, Python's list() isn't such a structure.
I just wrote a trivial program to test how cython's prange performs, and here is the code:
from cython.parallel import prange
import numpy as np
def func(int r, int c):
cdef:
double[:,:] a = np.arange(r*c, dtype=np.double).reshape(r,c)
double total = 0
int i, j
for i in prange(r, nogil=True, schedule='static', chunksize=1):
for j in range(c):
total += a[i,j]
return total
On Mac Book pro, with OMP_NUM_THREADS=3, the above code takes almost 18 sec for (r,c) = (10000, 100000), and with single thread, it takes about 21 sec.
Why there is so little performance boost? Am I using this prange correctly?
Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.
a = np.ones((10000, 100000), np.double)
takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.
To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:
#cython: boundscheck=False, wraparound=False, initializedcheck=False
from cython.parallel cimport prange
def serial(double[:, :] a):
cdef:
double total = 0
int i, j
for i in range(a.shape[0]):
for j in range(a.shape[1]):
total += a[i, j]
return total
def parallel(double[:, :] a):
cdef:
double total = 0
int i, j
for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
for j in range(a.shape[1]):
total += a[i, j]
return total
For example:
In [1]: import tmp
In [2]: r, c = 10000, 100000
In [3]: a = np.random.randn(r, c) # this takes ~6.75 sec
In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop
In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop
Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.
The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.
* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'
I was trying to write a function that gets a matrix of 2D points and a probability p and change or swap each points coordinates with probability p
So I asked a question and I was trying to use a binary sequence as an array of the powers of a specific matrix swap_matrix=[[0,1],[1,0]] to swap randomly (with a specific proportion) the coordinates of a given set of 2D points. However I realised that power function only accepts integer values and not arrays. And shuffle is as I can understand for the whole matrix and you cannot specify a specific dimension.
Having either of these two functions is OK.
For example:
swap(a=[[1,2],[2,3],[3,4],[3,5],[5,6]],b=[0,0,0,1,1])
should return [[1,2],[2,3],[3,4],[5,3],[6,5]]
The idea that just popped up and now I am editing is:
def swap(mat,K,N):
#where K/N is the proportion and K and N are natural numbers
#mat is a N*2 matrix that I am planning to randomly changes
#it coordinates of each row or keep it as it is
a=[[[0,1],[1,0]]]
b=[[[1,0],[0,1]]]
a=np.repeat(a,K,axis=0)
b=np.repeat(b,N-K,axis=0)
out=np.append(a,b,axis=0)
np.random.shuffle(out)
return np.multiply(mat,out.T)
Where I get an error cause I cannot flatten only once to make the matrices multipliable!
Again I am looking for an efficient method(vectorized in Matlab context).
P.S. In my special case the matrix is in the shape (N,2) and with the second column as all ones if that would help.
Maybe this is good enough for your purposes. In a quick test it appears to be about 13x faster than the blunt for-loop approach (#Naji, posting your "inefficient" code is helpful for making a comparison).
Edited my code following Jaime's comment
def swap(a, b):
a = np.copy(a)
b = np.asarray(b, dtype=np.bool)
a[b] = a[b, ::-1] # equivalent to: a[b] = np.fliplr(a[b])
return a
# the following is faster, but modifies the original array
def swap_inplace(a, b):
b = np.asarray(b, dtype=np.bool)
a[b] = a[b, ::-1]
print swap(a=[[1,2],[2,3],[3,4],[3,5],[5,6]],b=[0,0,0,1,1])
Outputs:
[[1 2]
[2 3]
[3 4]
[5 3]
[6 5]]
Edit to include more detailed timings
I wanted to know if I could speed this up still with Cython, so I investigated the efficiency some more :-) The results are worth mentioning I think (since efficiency is part of the actual question), but I do appologize in advance for the amount of additional code.
First the results.. The "cython" function is clearly the fastest of all, another 10x faster than the proposed Numpy solution above. The "blunt loop approach" I mentioned is given by the function named "loop", but as it turns out there are much faster methods conceivable. My pure Python solution is only 3x slower than the vectorized Numpy code above! Another thing to note is that "swap_inplace" was most of the time only marginally faster than "swap". Also the timings vary a bit with different random matrices a and b... So now you know :-)
function | milisec | normalized
-------------+---------+-----------
loop | 184 | 10.
double_loop | 84 | 4.7
pure_python | 51 | 2.8
swap | 18 | 1
swap_inplace | 17 | 0.95
cython | 1.9 | 0.11
And the rest of code I used (it seems I took this way to seriously :P):
def loop(a, b):
a_c = np.copy(a)
for i in xrange(a.shape[0]):
if b[i]:
a_c[i,:] = a[i, ::-1]
def double_loop(a, b):
a_c = np.copy(a)
n, m = a_c.shape
for i in xrange(n):
if b[i]:
for j in xrange(m):
a_c[i, j] = a[i, m-j-1]
return a_c
from copy import copy
def pure_python(a, b):
a_c = copy(a)
n, m = len(a), len(a[0])
for i in xrange(n):
if b[i]:
for j in xrange(m):
a_c[i][j] = a[i][m-j-1]
return a_c
import pyximport; pyximport.install()
import testcy
def cython(a, b):
return testcy.swap(a, np.asarray(b, dtype=np.uint8))
def rand_bin_array(K, N):
arr = np.zeros(N, dtype=np.bool)
arr[:K] = 1
np.random.shuffle(arr)
return arr
N = 100000
a = np.random.randint(0, N, (N, 2))
b = rand_bin_array(0.33*N, N)
# before timing the pure python solution I first did:
a = a.tolist()
b = b.tolist()
######### In the file testcy.pyx #########
#cython: boundscheck=False
#cython: wraparound=False
import numpy as np
cimport numpy as np
def swap(np.ndarray[np.int_t, ndim=2] a, np.ndarray[np.uint8_t, ndim=1] b):
cdef np.ndarray[np.int_t, ndim=2] a_c
cdef int n, m, i, j
a_c = a.copy()
n = a_c.shape[0]
m = a_c.shape[1]
for i in range(n):
if b[i]:
for j in range(m):
a_c[i, j] = a[i, m-j-1]
return a_c
For my project use, I need to store certain amount (~100x100) of floats in two dimensional array. And during the function calculation I need to read and write to the array and since the function is really the bottleneck (consuming 98% of time) I really would need it to be fast.
I did some experiments with numpy and cython:
import numpy
import time
cimport numpy
cimport cython
cdef int col, row
DTYPE = numpy.int
ctypedef numpy.int_t DTYPE_t
cdef numpy.ndarray[DTYPE_t, ndim=2] matrix_c = numpy.zeros([100 + 1, 100 + 1], dtype=DTYPE)
time_ = time.time()
for l in xrange(5000):
for col in xrange(100):
for row in xrange(100):
matrix_c[<unsigned int>row + 1][<unsigned int>col + 1] = matrix_c[<unsigned int>row][<unsigned int>col]
print "Numpy + cython time: {0}".format(time.time() - time_)
but I found out that in spite of all my attempts, the version using python lists, is still significantly faster.
Code using lists:
matrix = []
for i in xrange(100 + 1):
matrix.append([])
for j in xrange(100 + 1):
matrix[i].append(0)
time_ = time.time()
for l in xrange(5000):
for col in xrange(100):
for row in xrange(100):
matrix[row + 1][col + 1] = matrix[row][col]
print "list time: {0}".format(time.time() - time_)
And results:
list time: 0.0141758918762
Numpy + cython time: 0.484772920609
Have I done something wrong? If not, is there anything that would help me to improve the results?
Here's my version of the code you have.
There are three functions, dealing with integer arrays, 32 bit floating point arrays and double precision floating point arrays, respectively.
from numpy cimport ndarray as ar
cimport numpy as np
import numpy as np
cimport cython
import time
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_int(ar[int,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_float(ar[np.float32_t,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
#cython.boundscheck(False)
#cython.wraparound(False)
def access_write_double(ar[double,ndim=2] c, int n):
cdef int l, col, row, h=c.shape[0], w=c.shape[1]
time_ = time.time()
for l in range(n):
for row in range(h-1):
for col in range(w-1):
c[row+1,col+1] = c[row,col]
print "Numpy + cython time: {0}".format(time.time() - time_)
To call these functions from Python I run this
import numpy as np
from numpy.random import rand, randint
print "integers"
c = randint(0, high=20, size=(101,101))
access_write_int(c, 5000)
print "32 bit float"
c = rand(101, 101).astype(np.float32)
access_write_float(c, 5000)
print "double precision"
c = rand(101, 101)
access_write_double(c, 5000)
The following changes are important:
Avoid slicing the array by accessing it using indices of the form [i,j] instead of [i][j]
Define the variables l, col, and row, as integers so that the for loops run in C.
Use the function decorators #cython.boundscheck(False) and '#cython.wraparound(False)` to turn off boundschecking and wraparound indexing for the key portion of the program. This allows for out of bounds memory accesses, so you should do this only when you are certain that your indices are what they should be.
Swap the two innermost for loops so that you access your array according to how it is arranged in memory. This makes a much bigger difference for larger arrays. The arrays given by np.zeros np.random.rand, etc. are usually C contiguous, so rows are stored in contiguous blocks and it is faster to vary the index along the rows in the outer for loop and not the inner one. If you want to keep the for loops as they are, consider taking the transpose of your array before you run the function on it so that the columns are in contiguous blocks instead.
The problem seems to be the way you are accessing the matrix elements.
Use [i,j] instead of [i][j].
Also you can remove the casting <>, which prevent wrong values from being taken but increase the function call overhead.
Also, I would use range instead of xrange since in all Cython the examples from the documentation they are using range.
The result will be something like:
import numpy
import time
cimport numpy
cimport cython
cdef int col, row
INT = numpy.int
ctypedef numpy.int_t cINT
cdef numpy.ndarray[cINT, ndim=2] matrix_c = numpy.zeros([100 + 1, 100 + 1], dtype=INT)
time_ = time.time()
for l in range(5000):
for col in range(100):
for row in range(100):
matrix_c[row + 1, col + 1] = matrix_c[row, col]
print "Numpy + cython time: {0}".format(time.time() - time_)
Strongly recommended reference:
- Working with NumPy in Cython