Why this function in cython returns different results for every run?
I passed in 50000 for test
cpdef int fun1(int num):
cpdef int result
cdef int x
for x in range(num):
result += x*x
return result
edit:
so now I changed it to long long result like this
cpdef long long fun1(int num):
cdef long long result = 0
cdef int x = 0
for x in range(num):
result += x*x
return result
and it returns 25950131338936
but :) the python func
def pyfunc(num):
result = 0
for x in range(num):
result += x * x
return result
return 41665416675000
so hm so what is wrong?
There are probably two problems here. First, result should be initialized to zero. Secondly, the result is the sum of all integers squared from 0 to 50 000 (non-inclusive)
The problem is that the storage type int cannot fit such a big number. Try using a larger storage type like long long and it will work. The maximum value a 32-bit integer can hold is roughly 2^31. The maximum value a long long can hold is typically 2^63. Consult the C compiler on the system at hand to figure out the exact limits.
The cdef statement is used to declare C variables, either local or module-level.
So you need to set the initial value to the result variable. If you don’t, it gets what it found in memory at call time, which can be anything.
cpdef int fun1(int num):
cdef int result = 0
cdef int x
for x in range(num):
result += x * x
return result
Related
My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!
A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.
After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!
I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.
you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]
Say I have a simple function that takes as input a pointer to an integer. How do I change the originating integer value?
My idea was as follows:
cdef myFunc(int n, int *nnz):
nnz_int = <uintptr_t>nnz
nnz_int = 0
for i in range(0, n):
nnz_int += n
but upon reflection, I think I only initially cast the value of nnz onto nnz_int, and then change nnz_int, without changing the original nnz. How do I achieve that?
From the Cython docs:
Note that Cython uses array access for pointer dereferencing, as *x is not valid Python syntax, whereas x[0] is.
So this should work:
cdef myFunc(int n, int *nnz):
for i in range(0, n):
nnz[0] += n
Not sure what you're trying to achieve by adding n to the pointed-to value n times; why not simply add n*n to it once?
I am trying to write the short-code below in python (it is from a .pyx file). my issue is the lines with "double[:,::1]" in them. Is there any equivalent in python for it? also, how does "cdef unsigned int i, j" translate to python? I am new to programming and most of what I found online is over my head. any suggestion or help is appreciated.
def _step_scalar(
double[:,::1] u_tp1 not None,
double[:,::1] u_t not None,
double[:,::1] u_tm1 not None,
unsigned int x1, unsigned int x2, unsigned int z1, unsigned int z2,
double dt, double ds,
double[:,::1] vel not None):
"""
Perform a single time step in the Finite Difference solution for scalar
waves 4th order in space
"""
cdef unsigned int i, j
for i in xrange(z1, z2):
for j in xrange(x1, x2):
u_tp1[i,j] = (2.*u_t[i,j] - u_tm1[i,j]
+ ((vel[i,j]*dt/ds)**2)*(
(-u_t[i,j + 2] + 16.*u_t[i,j + 1] - 30.*u_t[i,j] +
16.*u_t[i,j - 1] - u_t[i,j - 2])/12. +
(-u_t[i + 2,j] + 16.*u_t[i + 1,j] - 30.*u_t[i,j] +
16.*u_t[i - 1,j] - u_t[i - 2,j])/12.))
They're type declarations to help Cython speed up the code. Python is dynamically typed (accepts variables of any type) so they aren't meaningful in Cython. Therefore you can get rid of them.
double[:,::1] defines the variable as a 2D, C contiguous memoryview of doubles. This means the function expects something similar to a 2D numpy array (as this is still what you should pass your Cython function).
u_tp1 is the variable name. You should keep this.
not None tells Cython to assume that you won't pass None into the function (so it disables some checks for extra speed). This can be deleted in Python.
cdef unsigned int i, j defines i and j as C integers, for extra speed. In Python i and j are created when they are needed in the for loop so the definition can be deleted completely.
What I want to do - transform my pure Python code into Cython.
Pure Python code:
def conflicts(list1,list2):
numIt = 10
for i in list1:
for j in list2:
if i == j and i < numIt:
return True
return False
conflicts([1,2,3], [6,9,8])
My Cython code so far:
cdef char conflicts(int [] list1,int [] list2):
cdef int numIt = 10
for i in list1:
for j in list2:
if i == j and i < numIt:
return True
return False
conflicts([1,2,3], [6,9,8])
Since I am completely new to Cython (and not really a pro in Python) I would like to get some feedback about my transformation. Am I doing the right thing? Is there anything else I should do in order to make the function even faster?
Update:
Does anyone know how i can add types in the header of the function for the input (list1, list2)? I tried "int [:]" which compiles without error but when i try to call the function with two lists i get the message "TypeError: 'list' does not have the buffer interface".
"i" and "j" could be declared for optimize your code. First optimization with cython is accomplished using explicit declaration.
You can use
cython -a yourcode.py
and see some automatic suggestion of possible changes for optimize your python code with cython (yellow lines). You can work with c module generated (work perfect!).
Some handwrite cython optimization:
+ Using type list for list1 and list2.
+ bint type for conflicts functions because that return boolean value.
+ Get lenght of lists because for loop requiere end index.
+ Map lists in int arrays (because lists has only integer values).
cdef bint conflicts(list list1, list list2):
cdef int numIt = 10
cdef int i, j
cdef int end_index = len(list1)
cdef int[:] my_list1 = list1
cdef int[:] my_list2 = list2
for i in range(end_index):
for j in range(end_index):
if my_list1[i] == my_list2[j] and my_list1[i] < numIt:
return True
return False
conflicts([1,2,3], [6,9,8])
As I commented, you should be able to get a pretty substantial improvement by changing your algorithm, without messing with cython at all. Your current code is O(len(list1)*len(list2)), but you can reduce this to O(len(list1)+len(list2)) by using a set. You can also simplify the code by using the builtin any function:
def conflicts(list1,list2):
numIt = 10
s1 = set(list1)
return any(x in s1 and x < numIt for x in list2)
Depending on how many numbers in each list you expect to be less than 10, you might try moving the x < numIt test around a bit to see what is fastest (filtering list1 before you turn it into a set, for instance, or putting if x < numIt after the for in the generator expression inside any).
Is there anything I've forgotten to do here in order to speed things up a bit? I'm trying to implement an algorithm described in a book called Tuning Timbre Spectrum Scale. Also---if all else fails, is there a way for me to just write this part of the code in C, then be able to call it from python?
import numpy as np
cimport numpy as np
# DTYPE = np.float
ctypedef np.float_t DTYPE_t
np.seterr(divide='raise', over='raise', under='ignore', invalid='raise')
"""
I define a timbre as the following 2d numpy array:
[[f0, a0], [f1, a1], [f2, a2]...] where f describes the frequency
of the given partial and a is its amplitude from 0 to 1. Phase is ignored.
"""
#Test Timbre
# cdef np.ndarray[DTYPE_t,ndim=2] t1 = np.array( [[440,1],[880,.5],[(440*3),.333]])
# Calculates the inherent dissonance of one timbres of the above form
# using the diss2Partials function
cdef DTYPE_t diss1Timbre(np.ndarray[DTYPE_t,ndim=2] t):
cdef DTYPE_t runningDiss1
runningDiss1 = 0.0
cdef unsigned int len = np.shape(t)[0]
cdef unsigned int i
cdef unsigned int j
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
# Calculates the dissonance between two timbres of the above form
cdef DTYPE_t diss2Timbres(np.ndarray[DTYPE_t,ndim=2] t1, np.ndarray[DTYPE_t,ndim=2] t2):
cdef DTYPE_t runningDiss2
runningDiss2 = 0.0
cdef unsigned int len1 = np.shape(t1)[0]
cdef unsigned int len2 = np.shape(t2)[0]
runningDiss2 += diss1Timbre(t1)
runningDiss2 += diss1Timbre(t2)
cdef unsigned int i1
cdef unsigned int i2
for i1 from 0 <= i1 < len1:
for i2 from 0 <= i2 < len2:
runningDiss2 += diss2Partials(t1[i1], t2[i2])
return runningDiss2
cdef inline DTYPE_t float_min(DTYPE_t a, DTYPE_t b): return a if a <= b else b
# Calculates the dissonance of two partials of the form [f,a]
cdef DTYPE_t diss2Partials(np.ndarray[DTYPE_t,ndim=1] p1, np.ndarray[DTYPE_t,ndim=1] p2):
cdef DTYPE_t f1 = p1[0]
cdef DTYPE_t f2 = p2[0]
cdef DTYPE_t a1 = abs(p1[1])
cdef DTYPE_t a2 = abs(p2[1])
# In order to insure that f2 > f1:
if (f2 < f1):
(f1,f2,a1,a2) = (f2,f1,a2,a1)
# Constants of the dissonance curves
cdef DTYPE_t _xStar
_xStar = 0.24
cdef DTYPE_t _s1
_s1 = 0.021
cdef DTYPE_t _s2
_s2 = 19
cdef DTYPE_t _b1
_b1 = 3.5
cdef DTYPE_t _b2
_b2 = 5.75
cdef DTYPE_t a = float_min(a1,a2)
cdef DTYPE_t s = _xStar/(_s1*f1 + _s2)
return (a * (np.exp(-_b1*s*(f2-f1)) - np.exp(-_b2*s*(f2-f1)) ) )
cpdef dissTimbreScale(np.ndarray[DTYPE_t,ndim=2] t,np.ndarray[DTYPE_t,ndim=1] s):
cdef DTYPE_t currDiss
currDiss = 0.0;
cdef unsigned int i
for i from 0 <= i < s.size:
currDiss += diss2Timbres(t, transpose(t,s[i]))
return currDiss
cdef np.ndarray[DTYPE_t,ndim=2] transpose(np.ndarray[DTYPE_t,ndim=2] t, DTYPE_t ratio):
return np.dot(t, np.array([[ratio,0],[0,1]]))
Link to code: Cython Code
Here are some things that I noticed:
Use t1.shape[0] instead of np.shape(t1)[0] and in so on in other places.
Don't use len as a variable because it is a built-in function in Python (not for speed, but for good practice). Use L or something like that.
Don't pass two-element arrays to functions unless you really need to. Cython checks the buffer every time you do pass an array. So, when using diss2Partials(t[i], t[j]) do diss2Partials(t[i,0], t[i,1], t[j,0], t[j,1]) instead and redefine diss2Partials appropriately.
Don't use abs, or at least not the Python one. It is having to convert your C double to a Python float, call the abs function, then convert back to a C double. It would probably be better to make an inlined function like you did with float_min.
Calling np.exp is doing a similar thing to using abs. Change np.exp to exp and add from libc.math cimport exp to your imports at the top.
Get rid of the transpose function completely. The np.dot is really slowing things down, but there really is no need for matrix multiplication here anyway. Rewrite your dissTimbreScale function to create an empty matrix, say t2. Before the current loop, set the second column of t2 to be equal to the second column of t (using a loop preferably, but you could probably get away with a Numpy operation here). Then, inside of the current loop, put in a loop that sets the first column of t2 equal to the first column of t times s[i]. That's what your matrix multiplication was really doing. Then just pass t2 as the second parameter to diss2Timbres instead of the one returned by the transpose function.
Do 1-5 first because they are rather easy. Number 6 may take a little more time, effort and maybe experimentation, but I suspect that it may also give you a significant boost in speed.
In your code:
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
bounds checking is performed for each array lookup, use the decorator #cython.boundscheck(False) before the function, and then cast to an unsigned int type before using i and j as the indices. Look up the cython for Numpy tutorial for more info.
I would profile your code in order to see which function takes the most time. If it is diss2Timbres you may benefit from the package "numexpr".
I compared Python/Cython and Numexpr for one of my functions (link to SO). Depending on the size of the array, numexpr outperformed both, Cython and Fortran.
NOTE: Just figured out this post is really old...