The following is my cython code, the purpose is to do a bootstrap.
def boots(int trial, np.ndarray[double, ndim=2] empirical, np.ndarray[double, ndim=2] expected):
cdef int length = len(empirical)
cdef np.ndarray[double, ndim=2] ret = np.empty((trial, 100))
cdef np.ndarray[long] choices
cdef np.ndarray[double] m
cdef np.ndarray[double] n
cdef long o
cdef int i
cdef int j
for i in range(trial):
choices = np.random.randint(0, length, length)
m = np.zeros(100)
n = np.zeros(100)
for j in range(length):
o = choices[j]
m.__iadd__(empirical[o])
n.__iadd__(expected[o])
empirical_boot = m / length
expected_boot = n / length
ret[i] = empirical_boot / expected_boot - 1
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
# test code
empirical = np.ones((40000, 100))
expected = np.ones((40000, 100))
%prun -l 10 boots(100, empirical,expected)
It takes 11 seconds in pure python with fancy indexing, and no matter how hard I tuned in cython it stays the same.
np.random.randint(0, 40000, 40000) takes 1 ms, so 100x takes 0.1s.
np.sort(np.ones((40000, 100)) takes 0.2s.
Thus I feel there must be ways to improve boots.
The primary issue you are seeing is that Cython only optimizes single-item access for typed arrays. This means that each of the lines in your code where you are using vectorization from NumPy still involve creating and interacting with Python objects.
The code you have there wasn't faster than the pure Python version because it wasn't really doing any of the computation differently.
You will have to avoid this by writing out the looping operations explicitly.
Here is a modified version of your code that runs significantly faster.
from numpy cimport ndarray as ar
from numpy cimport int32_t as int32
from numpy import empty
from numpy.random import randint
cimport cython
ctypedef int
# Notice the use of these decorators to tell Cython to turn off
# some of the checking it does when accessing arrays.
#cython.boundscheck(False)
#cython.wraparound(False)
def boots(int32 trial, ar[double, ndim=2] empirical, ar[double, ndim=2] expected):
cdef:
int32 length = empirical.shape[0], i, j, k
int32 o
ar[double, ndim=2] ret = empty((trial, 100))
ar[int32] choices
ar[double] m = empty(100), n = empty(100)
for i in range(trial):
# Still calling Python on this line
choices = randint(0, length, length)
# It was faster to compute m and n separately.
# I suspect that has to do with cache management.
# Instead of allocating new arrays, I just filled the old ones with the new values.
o = choices[0]
for k in range(100):
m[k] = empirical[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
m[k] += empirical[o,k]
o = choices[0]
for k in range(100):
n[k] = expected[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
n[k] += expected[o,k]
# Here I simplified some of the math and got rid of temporary arrays
for k in range(100):
ret[i,k] = m[k] / n[k] - 1.
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
If you want to have a look at which lines of your code involve Python calls, the Cython compiler can generate an html file showing which lines call Python.
This option is called annotation.
The way you use it depends on how you are compiling your cython code.
If you are using the IPython notebook, just add the --annotate flag to the Cython cell magic.
You may also be able to benefit from turning on the C compiler optimization flags.
Related
My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!
A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.
After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!
I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.
you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]
I used cython to speed my bottleneck in python. The task is to compute the selective inverse (below S) of a sparse matrix given by its cholesky factorization provided in csc-format (data, indptr, indices). But the the task is not really important, in the end it is a 3 times nested for-loop where I have to access elements of S fast.
When I use a memoryview of a full/huge matrix
double[:,:] Sfull
and access the entries then the algorithm is quite fast and meets my expectations. But it is clear, that this is only possible when the matrix Sfull fits into the memory.
My approach was to use a list/vector of dictionaries/maps such that I can access the elements also relatively fast.
cdef vector[map[int, double]] S
It turned out, that accessing the elements inside the loop with this data structure is around 20 times slower. Is this expected or is there another issue? Do you see any other data structure?
Thank you very much for any comments or help!
Best,
Manuel
Bellow, the cython code, where the version with the full memoryview is commented out.
cdef int invTakC12( double[:] id_diag, double[:] data, int len_i, int[:] indptr, int[:] indices, double[:, :] Sfull):
cdef vector[map[int, double]] S = testDictC(len_i-1) #list of empty dicts
cdef int i, j, j_i, lc
cdef double q
for i in range(len_i-2, -1, -1):
for j_i in range(indptr[i+1]-1, indptr[i]-1, -1):
j = indices[j_i]
q = 0
for lc in range(indptr[i+1] -1, indptr[i], -1):
q += data[lc] * S[j][ indices[lc] ]
#q += data[lc] * Sfull[ indices[lc], j ]
S[j][i] = -q
#Sfull[i,j] = -q
if i==j:
S[j][i] += id_diag[i]
#Sfull[i,j] += id_diag[i]
else:
S[i][j] -= q
#Sfull[j,i] -= id_diag[i]
return 0
You can access the arrays independently - e.g.:
cdef double[:] S_data = S.data
cdef np.int32_t[:] S_ind = S.indices
cdef np.int32_t[:] S_indptr = S.indptr
If that's too inconvenient, you can put them in a C struct as pointers:
cdef struct mapped_csc:
double *data
np.int32_t *indices
np.int32_t *indptr
How does one iterate in parallel over a (Python) list in Cython?
Consider the following simple function:
def sumList():
cdef int n = 1000
cdef int sum = 0
ls = [i for i in range(n)]
cdef Py_ssize_t i
for i in prange(n, nogil=True):
sum += ls[i]
return sum
This gives a lot of compiler errors, because a parallel section without the GIL apparently cannot work with any Python object:
Error compiling Cython file:
------------------------------------------------------------
...
ls = [i for i in range(n)]
cdef Py_ssize_t i
for i in prange(n, nogil=True):
sum += ls[i]
^
------------------------------------------------------------
src/parallel.pyx:42:6: Coercion from Python not allowed without the GIL
Error compiling Cython file:
------------------------------------------------------------
...
ls = [i for i in range(n)]
cdef Py_ssize_t i
for i in prange(n, nogil=True):
sum += ls[i]
^
------------------------------------------------------------
src/parallel.pyx:42:6: Operation not allowed without gil
Error compiling Cython file:
------------------------------------------------------------
...
ls = [i for i in range(n)]
cdef Py_ssize_t i
for i in prange(n, nogil=True):
sum += ls[i]
^
------------------------------------------------------------
src/parallel.pyx:42:6: Converting to Python object not allowed without gil
Error compiling Cython file:
------------------------------------------------------------
...
ls = [i for i in range(n)]
cdef Py_ssize_t i
for i in prange(n, nogil=True):
sum += ls[i]
^
------------------------------------------------------------
src/parallel.pyx:42:11: Indexing Python object not allowed without gil
I am not aware of any way to do this. A list is a Python object, so using its __getitem__ method requires the GIL. If you are able to use a NumPy array in this case, it will work. For example, if you wanted to iterate over an array A of double precision floating point values you could do something like this:
cimport cython
from numpy cimport ndarray as ar
from cython.parallel import prange
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef cysumpar(ar[double] A):
cdef double tot=0.
cdef int i, n=A.size
for i in prange(n, nogil=True):
tot += A[i]
return tot
On my machine, for this particular case, prange doesn't make it any faster than a normal loop, but it could work better in other cases. For more on how to use prange see the documentation at http://docs.cython.org/src/userguide/parallelism.html
Whether or not you can use an array depends on how much you are changing the size of the array. If you need a lot of flexibility with the size, the array will not work. You could also try interfacing with the vector class in C++. I've never done that myself, but there is a brief description of how to do that here: http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#nested-class-declarations
Convert your list into an array if you need any numeric value, or a bytearray if values are limited between 0 and 255. If you store anything else than numeric values, try numpy or use dtypes directly. For example with bytes:
cdef int[::1] gen = array.array('i',[1, 2, 3, 4])
And if you want to use C types:
ctypedef unsigned char uint8_t
I have the following piece of code which I'd like to optimize using Cython:
sim = numpy.dot(v1, v2) / (sqrt(numpy.dot(v1, v1)) * sqrt(numpy.dot(v2, v2)))
dist = 1-sim
return dist
I have written and compiled the .pyx file and when I ran the code I do not see any significant improvement in performance. According to the Cython documentation I have to add c_types. The HTML file generated by Cython indicates that the bottleneck is the dot products (which is expected of course). Does this mean that I have to define a C function for the dot products? If yes how do I do that?
EDIT:
After some research I have come up with the following code. The improvement is only marginal. I am not sure if there is something I can do to improve it :
from __future__ import division
import numpy as np
import math as m
cimport numpy as np
cimport cython
cdef extern from "math.h":
double c_sqrt "sqrt"(double)
ctypedef np.float reals #typedef_for easier readding
cdef inline double dot(np.ndarray[reals,ndim = 1] v1, np.ndarray[reals,ndim = 1] v2):
cdef double result = 0
cdef int i = 0
cdef int length = v1.size
cdef double el1 = 0
cdef double el2 = 0
for i in range(length):
el1 = v1[i]
el2 = v2[i]
result += el1*el2
return result
#cython.cdivision(True)
def distance(np.ndarray[reals,ndim = 1] ex1, np.ndarray[reals,ndim = 1] ex2):
cdef double dot12 = dot(ex1, ex2)
cdef double dot11 = dot(ex1, ex1)
cdef double dot22 = dot(ex2, ex2)
cdef double sim = dot12 / (c_sqrt(dot11 * dot22))
cdef double dist = 1-sim
return dist
As a general note, if you are calling numpy functions from within cython and doing little else, you generally will see only marginal gains if any at all. You generally only get massive speed-ups if you are statically typing code that makes use of an explicit for loop at the python level (not in something that is calling the Numpy C-API already).
You could try writing out the code for a dot product with all of the static typing of the counter, input numpy arrays, etc, with wraparound and boundscheck set to False, import the clib version of the sqrt function and then try to leverage the parallel for loop (prange) to make use of openmp.
You can change the expression
sim = numpy.dot(v1, v2) / (sqrt(numpy.dot(v1, v1)) * sqrt(numpy.dot(v2, v2)))
to
sim = numpy.dot(v1, v2) / sqrt(numpy.dot(v1, v1) * numpy.dot(v2, v2))
Is there anything I've forgotten to do here in order to speed things up a bit? I'm trying to implement an algorithm described in a book called Tuning Timbre Spectrum Scale. Also---if all else fails, is there a way for me to just write this part of the code in C, then be able to call it from python?
import numpy as np
cimport numpy as np
# DTYPE = np.float
ctypedef np.float_t DTYPE_t
np.seterr(divide='raise', over='raise', under='ignore', invalid='raise')
"""
I define a timbre as the following 2d numpy array:
[[f0, a0], [f1, a1], [f2, a2]...] where f describes the frequency
of the given partial and a is its amplitude from 0 to 1. Phase is ignored.
"""
#Test Timbre
# cdef np.ndarray[DTYPE_t,ndim=2] t1 = np.array( [[440,1],[880,.5],[(440*3),.333]])
# Calculates the inherent dissonance of one timbres of the above form
# using the diss2Partials function
cdef DTYPE_t diss1Timbre(np.ndarray[DTYPE_t,ndim=2] t):
cdef DTYPE_t runningDiss1
runningDiss1 = 0.0
cdef unsigned int len = np.shape(t)[0]
cdef unsigned int i
cdef unsigned int j
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
# Calculates the dissonance between two timbres of the above form
cdef DTYPE_t diss2Timbres(np.ndarray[DTYPE_t,ndim=2] t1, np.ndarray[DTYPE_t,ndim=2] t2):
cdef DTYPE_t runningDiss2
runningDiss2 = 0.0
cdef unsigned int len1 = np.shape(t1)[0]
cdef unsigned int len2 = np.shape(t2)[0]
runningDiss2 += diss1Timbre(t1)
runningDiss2 += diss1Timbre(t2)
cdef unsigned int i1
cdef unsigned int i2
for i1 from 0 <= i1 < len1:
for i2 from 0 <= i2 < len2:
runningDiss2 += diss2Partials(t1[i1], t2[i2])
return runningDiss2
cdef inline DTYPE_t float_min(DTYPE_t a, DTYPE_t b): return a if a <= b else b
# Calculates the dissonance of two partials of the form [f,a]
cdef DTYPE_t diss2Partials(np.ndarray[DTYPE_t,ndim=1] p1, np.ndarray[DTYPE_t,ndim=1] p2):
cdef DTYPE_t f1 = p1[0]
cdef DTYPE_t f2 = p2[0]
cdef DTYPE_t a1 = abs(p1[1])
cdef DTYPE_t a2 = abs(p2[1])
# In order to insure that f2 > f1:
if (f2 < f1):
(f1,f2,a1,a2) = (f2,f1,a2,a1)
# Constants of the dissonance curves
cdef DTYPE_t _xStar
_xStar = 0.24
cdef DTYPE_t _s1
_s1 = 0.021
cdef DTYPE_t _s2
_s2 = 19
cdef DTYPE_t _b1
_b1 = 3.5
cdef DTYPE_t _b2
_b2 = 5.75
cdef DTYPE_t a = float_min(a1,a2)
cdef DTYPE_t s = _xStar/(_s1*f1 + _s2)
return (a * (np.exp(-_b1*s*(f2-f1)) - np.exp(-_b2*s*(f2-f1)) ) )
cpdef dissTimbreScale(np.ndarray[DTYPE_t,ndim=2] t,np.ndarray[DTYPE_t,ndim=1] s):
cdef DTYPE_t currDiss
currDiss = 0.0;
cdef unsigned int i
for i from 0 <= i < s.size:
currDiss += diss2Timbres(t, transpose(t,s[i]))
return currDiss
cdef np.ndarray[DTYPE_t,ndim=2] transpose(np.ndarray[DTYPE_t,ndim=2] t, DTYPE_t ratio):
return np.dot(t, np.array([[ratio,0],[0,1]]))
Link to code: Cython Code
Here are some things that I noticed:
Use t1.shape[0] instead of np.shape(t1)[0] and in so on in other places.
Don't use len as a variable because it is a built-in function in Python (not for speed, but for good practice). Use L or something like that.
Don't pass two-element arrays to functions unless you really need to. Cython checks the buffer every time you do pass an array. So, when using diss2Partials(t[i], t[j]) do diss2Partials(t[i,0], t[i,1], t[j,0], t[j,1]) instead and redefine diss2Partials appropriately.
Don't use abs, or at least not the Python one. It is having to convert your C double to a Python float, call the abs function, then convert back to a C double. It would probably be better to make an inlined function like you did with float_min.
Calling np.exp is doing a similar thing to using abs. Change np.exp to exp and add from libc.math cimport exp to your imports at the top.
Get rid of the transpose function completely. The np.dot is really slowing things down, but there really is no need for matrix multiplication here anyway. Rewrite your dissTimbreScale function to create an empty matrix, say t2. Before the current loop, set the second column of t2 to be equal to the second column of t (using a loop preferably, but you could probably get away with a Numpy operation here). Then, inside of the current loop, put in a loop that sets the first column of t2 equal to the first column of t times s[i]. That's what your matrix multiplication was really doing. Then just pass t2 as the second parameter to diss2Timbres instead of the one returned by the transpose function.
Do 1-5 first because they are rather easy. Number 6 may take a little more time, effort and maybe experimentation, but I suspect that it may also give you a significant boost in speed.
In your code:
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
bounds checking is performed for each array lookup, use the decorator #cython.boundscheck(False) before the function, and then cast to an unsigned int type before using i and j as the indices. Look up the cython for Numpy tutorial for more info.
I would profile your code in order to see which function takes the most time. If it is diss2Timbres you may benefit from the package "numexpr".
I compared Python/Cython and Numexpr for one of my functions (link to SO). Depending on the size of the array, numexpr outperformed both, Cython and Fortran.
NOTE: Just figured out this post is really old...