Slow indexing of cython memoryview

Slow indexing of cython memoryview - python

I have a very sparse matrix, say 5000x3000, double precision floats. 80% of this matrix are zeros. I need to compute a sum of each row. All of that in python/cython. I wanted to speed up the process. Because I need to compute this sum a few million times, I thought that if I make indices of non zero elements and sum only them it will be faster. The result turns to be much slower than original "brute-force" summation of all zeros.
Here a minimal example:
#cython: language_level=2
import numpy as np
cimport numpy as np
import time
cdef int Ncells = 5000, KCells = 400, Ne= 350
cdef double x0=0.1, x1=20., x2=1.4, x3=2.8, p=0.2
# Setting up weight
all_weights = np.zeros( (Ncells,KCells) )
all_weights[ :Ne, :Ne ] = x0
all_weights[ :Ne, Ne: ] = x1
all_weights[Ne: , :Ne ] = x2
all_weights[Ne: , Ne: ] = x3
all_weights = all_weights * (np.random.rand(Ncells,KCells) < p)
# Making a memory view
cdef np.float64_t[:,:] my_weights = all_weights
# make an index of non zero weights
x,y = np.where( np.array(my_weights) > 0.)
#np_pawid = np.column_stack( (x ,y ) )
np_pawid = np.column_stack( (x ,y ) ).astype(int)
cdef np.int_t[:,:] pawid = np_pawid
# Making vector for column sum
summEE = np.zeros(KCells)
# Memory view
cdef np.float64_t [:] my_summEE = summEE
cdef int cc,dd,i
# brute-force summing
ntm = time.time()
for cc in range(KCells):
my_summEE[cc] = 0
for dd in range(Ncells):
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "BRUTE-FORCE summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices
ntm = time.time()
for dd,cc in pawid:
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "INDEX summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices unpacked by zip
ntm = time.time()
for dd,cc in zip(pawid[:,0],pawid[:,1]):
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "ZIPPED INDEX summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices unpacked by zip
ntm = time.time()
for i in range(pawid.shape[0]):
dd = pawid[i,0]
cc = pawid[i,1]
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "INDEXING over INDEX summation: %f s"%(stm-ntm)
# Numpy brute-froce summing
ntm = time.time()
sumwee = np.sum(all_weights,axis=0)
stm = time.time()
print "NUMPY BRUTE-FORCE summation : %f s"%(stm-ntm)
#>
print
print "Number of brute-froce summs :",my_weights.shape[0]*my_weights.shape[1]
print "Number of indexing summs :",pawid.shape[0]
#<
I ran it on Raspberry Pi 3, but it seems the same results on PC too.
BRUTE-FORCE summation : 0.381014 s
INDEX summation : 18.479018 s
ZIPPED INDEX summation : 3.615952 s
INDEXING over INDEX summation: 0.450131 s
NUMPY BRUTE-FORCE summation : 0.013017 s
Number of brute-froce summs : 2000000
Number of indexing summs : 400820
NUMPY BRUTE-FORCE in Python : 0.029143 s
Can anyone explain why is cython code 3-4 time slower than numpy? Why is indexing, which reduces the number of summations from 2000000 to 400820, 45 times slower? It doesn't make any sense.

You're outside a function so accessing global variables. This means that Cython has to check they exist each time they're accessed, unlike function locals which it knows can't be accessed from elsewhere.
By default Cython handles negative indices and does bounds checking. You can turn these off in a number of ways. An obvious way is to add #cython.wraparound(False) and #cython.boundscheck(False) as decorators to your function definition. Be aware of what these actually do - the only turn off these features on cdefed numpy arrays or typed memoryviews and don't apply to much else (so don't just apply them everywhere as a cargo-cult type thing).
A good way to see where issues might be is to run cython -a <filename> and look at the annotated html file. Areas with yellow are potentially not optimized, and you can expand the lines to see the underlying C code. Obviously only worry about frequently called functions and loops in this respect - the fact your code to setup the Numpy arrays contains Python calls is expected and not a problem.
A few measurements:
As you wrote it
BRUTE-FORCE summation : 0.008625 s
INDEX summation : 0.713661 s
ZIPPED INDEX summation : 0.127343 s
INDEXING over INDEX summation: 0.002154 s
NUMPY BRUTE-FORCE summation : 0.001461 s
In a function
BRUTE-FORCE summation : 0.007706 s
INDEX summation : 0.681892 s
ZIPPED INDEX summation : 0.123176 s
INDEXING over INDEX summation: 0.002069 s
NUMPY BRUTE-FORCE summation : 0.001429 s
In a function with boundscheck and wraparound off:
BRUTE-FORCE summation : 0.005208 s
INDEX summation : 0.672948 s
ZIPPED INDEX summation : 0.124641 s
INDEXING over INDEX summation: 0.002006 s
NUMPY BRUTE-FORCE summation : 0.001467 s
My suggestions do help, but not too dramatically. My differences aren't as dramatic as you see (even for your unchanged code). Numpy still wins - at a guess:
I suspect it's multithreading.
a direct sum over a whole array will have predictable patterns of memory access, which may make it more efficient than a smaller number of operations with unpredictable memory access

Related

Nonzero for integers

My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!

A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.

After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!

I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.

you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]

`numpy.nanpercentile` is extremely slow

numpy.nanpercentile is extremely slow.
So, I wanted to use cupy.nanpercentile; but there is not cupy.nanpercentile implemented yet.
Do someone have solution for it?

I also had a problem with np.nanpercentile being very slow for my datasets. I found a wokraround that lets you use the standard np.percentile. And it can also be applied to many other libs.
This one should solve your problem. And it also works alot faster than np.nanpercentile:
arr = np.array([[np.nan,2,3,1,2,3],
[np.nan,np.nan,1,3,2,1],
[4,5,6,7,np.nan,9]])
mask = (arr >= np.nanmin(arr)).astype(int)
count = mask.sum(axis=1)
groups = np.unique(count)
groups = groups[groups > 0]
p90 = np.zeros((arr.shape[0]))
for g in range(len(groups)):
pos = np.where (count == groups[g])
values = arr[pos]
values = np.nan_to_num (values, nan=(np.nanmin(arr)-1))
values = np.sort (values, axis=1)
values = values[:,-groups[g]:]
p90[pos] = np.percentile (values, 90, axis=1)
So instead of taking the percentile with the nans, it sorts the rows by the amount of valid data, and takes the percentile of those rows separated. Then adds everything back together. This also works for 3D-arrays, just add y_pos and x_pos instead of pos. And watch out for what axis you are calculating over.

def testset_gen(num):
init=[]
for i in range (num):
a=random.randint(65,122) # Dummy name
b=random.randint(1,100) # Dummy value: 11~100 and 10% of nan
if b<11:
b=np.nan # 10% = nan
init.append([a,b])
return np.array(init)
np_testset=testset_gen(30000000) # 468,751KB
def f1_np (arr, num):
return np.percentile (arr[:,1], num)
# 55.0, 0.523902416229248 sec
print (f1_np(np_testset[:,1], 50))
def cupy_nanpercentile (arr, num):
return len(cp.where(arr > num)[0]) / (len(arr) - cp.sum(cp.isnan(arr))) * 100
# 55.548758317136446, 0.3640251159667969 sec
# 43% faster
# If You need same result, use int(). But You lose saved time.
print (cupy_nanpercentile(cp_testset[:,1], 50))
I can't imagine How test result takes few days. With my computer, It seems 1 Trillion line of data or more. Because of this, I can't reproduce same problem due to lack of resource.

Here's an implementation with numba. After it's been compiled it is more than 7x faster than the numpy version.
Right now it is set up to take the percentile along the first axis, however it could be changed easily.
#numba.jit(nopython=True, cache=True)
def nan_percentile_axis0(arr, percentiles):
"""Faster implementation of np.nanpercentile
This implementation always takes the percentile along axis 0.
Uses numba to speed up the calculation by more than 7x.
Function is equivalent to np.nanpercentile(arr, <percentiles>, axis=0)
Params:
arr (np.array): Array to calculate percentiles for
percentiles (np.array): 1D array of percentiles to calculate
Returns:
(np.array) Array with first dimension corresponding to
values as passed in percentiles
"""
shape = arr.shape
arr = arr.reshape((arr.shape[0], -1))
out = np.empty((len(percentiles), arr.shape[1]))
for i in range(arr.shape[1]):
out[:,i] = np.nanpercentile(arr[:,i], percentiles)
shape = (out.shape[0], *shape[1:])
return out.reshape(shape)

Python exponentiation of two lists performance

Is there a way to compute the Cobb-Douglas utility function faster in Python. I run it millions of time, so a speed increase would help. The function raises elements of quantities_list to power of corresponding elements of exponents list, and then multiplies all the resulting elements.
n = 10
quantities = range(n)
exponents = range(n)
def Cobb_Douglas(quantities_list, exponents_list):
number_of_variables = len(quantities_list)
value = 1
for variable in xrange(number_of_variables):
value *= quantities_list[variable] ** exponents_list[variable]
return value
t0 = time.time()
for i in xrange(100000):
Cobb_Douglas(quantities, exponents)
t1 = time.time()
print t1-t0

Iterators are your friend. I got a 28% speedup on my computer by switching your loop to this:
for q, e in itertools.izip(quantities_list, exponents_list):
value *= q ** e
I also got similar results when switching your loop to a functools.reduce call, so it's not worth providing a code sample.
In general, numpy is the right choice for fast arithmetic operations, but numpy's largest integer type is 64 bits, which won't hold the result for your example. If you're using a different numeric range or arithmetic type, numpy is king:
quantities = np.array(quantities, dtype=np.int64)
exponents = np.array(exponents, dtype=np.int64)
def Cobb_Douglas(quantities_list, exponents_list):
return np.product(np.power(quantities_list, exponents_list))
# result: 2649120435010011136
# actual: 21577941222941856209168026828800000

Couple of suggestions:
Use Numpy
Vectorize your code
If quantities are large and and nothing's going to be zero or negative, work in log-space.
I got about a 15% speedup locally using:
def np_Cobb_Douglas(quantities_list, exponents_list):
return np.product(np.power(quantities_list, exponents_list))
And about 40% using:
def np_log_Cobb_Douglas(quantities_list, exponents_list):
return np.exp(np.dot(np.log(quantities_list), np.log(exponents_list)))
Last but not least, there should be some scaling of your Cobb-Douglas parameters so you don't run into overflow errors (if I'm remembering my intro macro correctly).

Python vectorization with a constant

I have a series X of length n(=300,000). Using a window length of w (=40), I need to implement:
mu(i)= X(i)-X(i-w)
s(i) = sum{k=i-w to i} [X(k)-X(k-1) - mu(i)]^2
I was wondering if there's a way to prevent loops here. The fact that mu(i) is constant in second equation is causing complications in vectorization. I did the following so far:
x1=x.shift(1)
xw=x.shift(w)
mu= x-xw
dx=(x-x1-mu)**2 # wrong because mu wouldn't be constant for each i
s=pd.rolling_sum(dx,w)
The above code would work (and was working) in a loop setting but takes too long, so any help regarding vectorization or other speed improvement methods would be helpful. I posted this on crossvalidated with mathjax formatting but that doesn't seem to work here.
https://stats.stackexchange.com/questions/241050/python-vectorization-with-a-constant
Also just to clarify, I wasn't using a double loop, just a single one originally:
for i in np.arange(w, len(X)):
x=X.ix[i-w:i,0] # clip a series of size w
x1=x.shift(1)
mu.ix[i]= x.ix[-1]-x.ix[0]
temp= (x-x1-mu.ix[i])**2 # returns a series of size w but now mu is constant
s.ix[i]= temp.sum()

Approach #1 : One vectorized approach would be using broadcasting -
N = X.shape[0]
a = np.arange(N)
k2D = a[:,None] - np.arange(w+1)[::-1]
mu1D = X - X[a-w]
out = ((X[k2D] - X[k2D-1] - mu1D[:,None])**2).sum(-1)
We can further optimize the last step to get squared summations with np.einsum -
subs = X[k2D] - X[k2D-1] - mu1D[:,None]
out = np.einsum('ij,ij->i',subs,subs)
Further improvement is possible with the use of NumPy strides to get X[k2D] and X[k2D-1].
Approach #2 : To save on memory when working very large arrays, we can use one loop instead of two loops used in the original code, like so -
N = X.shape[0]
s = np.zeros((N))
k_idx = np.arange(-w,1)
for i in range(N):
mu = X[i]-X[i-w]
s[i] = ((X[k_idx]-X[k_idx-1] - mu)**2).sum()
k_idx += 1
Again, np.einsum could be used here to compute s[i], like so -
subs = X[k_idx]-X[k_idx-1] - mu
s[i] = np.einsum('i,i->',subs,subs)

Numpy optimization

I have a function that assigns value depending on the condition. My dataset size is usually in the range of 30-50k. I am not sure if this is the correct way to use numpy but when it's more than 5k numbers, it gets really slow. Is there a better way to make it faster ?
import numpy as np
N = 5000; #dataset size
L = N/2;
d=0.1; constant = 5;
x=constant+d*np.random.random(N);
matrix = np.zeros([L,N]);
print "Assigning matrix"
for k in xrange(L):
for i in xrange(k+1):
matrix[k,i] = random.random()
for i in xrange(k+1,N-k-1):
if ( x[i] > x[i-k-1] ) and ( x[i] > x[i+k+1] ):
matrix[k,i] = 0
else:
matrix[k,i] = random.random()
for i in xrange(N-k-1,N):
matrix[k,i] = random.random()

If you are using for loops, you are going to lose the speed in numpy. The way to get speed is to use numpys functions and vectorized operations. Is there a way you can create a random matrix:
matrix = np.random.randn(L,k+1)
Then do something to this matrix to get the 0's positioned you want? Can you elaborate on the condition for setting an entry to 0? For example, you can make the matrix then do:
matrix[matrix > value]
To retain all values above a threshold. If the condition can be expressed as some boolean indexer or arithmetic operation, you can speed it up. If it has to be in the for loop (ie it depends on the values surrounding it as the loop cycles) it may not be able to be vectorized.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slow indexing of cython memoryview - python

Related

Nonzero for integers

`numpy.nanpercentile` is extremely slow

Python exponentiation of two lists performance

Python vectorization with a constant

Numpy optimization

Categories

Resources