Cupy is slower than numpy - python

I tried to speed up my python code with cupy instead of numpy. The problem here is, that using cupy, my code got drastically slower. Maybe I went a little bit to naive on that problem.
Maybe anyone can find a bottleneck in my code:
import cupy as np
import time as ti
def f(y, t):
y_ = np.zeros(2 * N_1*N_2) # n: e-6, c: e-5
for i in range(0, N_1*N_2):
y_[i] = y[i + N_1*N_2] # n: e-7, c: e-5 or e-6
for i in range(N_1*N_2):
sum = -4*y[i] # n: e-7, c: e-7 after some statements e-5
if (i + 1 in indexes) and (not (i in indi)):
sum += y[i+1] # n: e-7, c: e-7 after some statements e-5
if (i - 1) in indexes and (i % N_1 != 0):
sum += y[i-1] # n: e-7, c: e-7 after some statements e-5
if i + N_1 in indexes:
sum += y[i+N_1] # n: e-7, c: e-7 after some statements e-5
if i - N_1 in indexes:
sum += y[i-N_1] # n: e-7, c: e-7 after some statements e-5
y_[i + N_1*N_2] = sum
return y_
def k_1(y, t, h):
return np.asarray(f(y, t)) * h
def k_2(y, t, h):
return np.asarray(f(np.add(np.asarray(y) , np.multiply(1/2 , k_1(y, t, h))), t + 1/2 * h)) * h
# k_2, k_4 look just like k_2, may be with an 1/2 here or there
# some init stuff is happening here
while t < T_end:
# also some magic happening here which is just data saving
y = np.asarray(y) + 1/6*(k_1(y, t, m) + 2*k_2(y, t, m) + 2*k_3(y, t, m) + k_4(y, t, m))
t += m
EDIT
I tried to benchmark my code and here are some results they can be seen as a comment in the code. Each number stays for one line. The units are seconds. n: Numpy, c:CuPy, i mostly give a rough estimate of the order.
Additional i tested
np.multiply # n: e-6, c: e-5
and
np.add # n: e-5 or e-6, c: 0.005 or e-5

Your code is not slow because numpy is slow but because you call many (python) functions, and calling functions (and iterating and accessing objects and basically everything in python) is slow in python. Thus cupy will not help you (but probably harm performance because it has to do more setup e.g. copying data over to the gpu). If you can formulate your algorithm to use less python functions (vectorizing as in the other answer) this will speedup your code tremendously (you probably do not need cupy).
You could also look into numba which compiles your code with llvm in native code. If you do so be sure to read some documenation and use nopython=True, otherwise you will only switch slow cupy code with slow numba code.

Your code example doesn't work since you haven't defined N_1, N_2, indexes and indi anywhere. Also your comments in the code doesn't seem to help others understand what's going on.
Your code probably won't benefit from numba/cupy since you haven't vectorized the operations in your code. Lists would probably be just as fast as numpy arrays in the way your code works at the moment.
If you get rid of your for loops and change
y_ = np.zeros(2 * N_1*N_2)
for i in range(0, N_1*N_2):
y_[i] = y[i + N_1*N_2]
to
n = N1*N2
y_ = np.zeros(2*n)
y_[:n] = y[n:2*n]
and so forth, you will speed your code up substantially.

Related

Converting Mathematica Fourier series code to Python

I have some simple Mathematica code that I'm struggling to convert to Python and could use some help:
a = ((-1)^(n))*4/(Pi*(2 n + 1));
f = a*Cos[(2 n + 1)*t];
sum = Sum[f, {n, 0, 10}];
Plot[sum, {t, -2 \[Pi], 2 \[Pi]}]
The plot looks like this:
For context, I have a function f(t):
I need to plot the sum of the first 10 terms. In Mathematica this was pretty straighforward, but for some reason I just can't seem to figure out how to make it work in Python. I've tried defining a function a(n), but when I try to set f(t) equal to the sum using my list of odd numbers, it doesn't work because t is not defined, but t is a variable. Any help would be much appreciated.
Below is a sample of one of the many different things I've tried. I know that it's not quite right in terms of getting the parity of the terms to alternate, but more important I just want to figure out how to get 'f' to be the sum of the first 10 terms of the summation:
n = list(range(1,20,2))
def a(n):
return ((-1)**(n))*4/(np.pi*n)
f = 0
for i in n:
f += a(i)*np.cos(i*t)
modifying your code, look the part which are different, mostly the mistake was in the part which you are not calculating based on n 0-10 :
n = np.arange(0,10)
t = np.linspace(-2 * np.pi, 2 *np.pi, 10000)
def a(n):
return ((-1)**(n))*4/(np.pi*(2*n+1))
f = 0
for i in n:
f += a(i)*np.cos((2*i +1) * t)
however you could write you could in matrix form, and avoid looping, using the vector and broadcasting:
n = np.arange(10)[:,None]
t = np.linspace(-2 * np.pi, 2 *np.pi, 10000)[:,None]
a = ((-1) ** n) * 4 / (np.pi*(2*n + 1))
f = (a * np.cos((2 * n + 1) * t.T )).sum(axis=0)

Is there a better way to search a sorted list if the other list is sorted too?

In the numpy library, one can pass a list into the numpy.searchsorted function, whereby it searched through a different list one element at a time and returns an array of the same sizes as the indices needed to preserve order. However, it seems to be wasting performance if both lists are sorted. For example:
m=[1,3,5,7,9]
n=[2,4,6,8,10]
numpy.searchsorted(m,n)
would return [1,2,3,4,5] which is the correct answer, but it looks like this would have complexity O(n ln(m)), whereby if one were to simply loop through m, and have some kind of pointer to n, it seems like the complexity is more like O(n+m)? Is there some kind of function in NumPy which does this?
AFAIK, this is not possible to do that in linear time only with Numpy without making additional assumptions on the inputs (eg. the integer are small and bounded). An alternative solution is to use Numba to do the merge manually:
import numba as nb
# Note: Numba requires a function signature with well defined array types
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted(a, b):
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < a.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
return result
a, b = np.cumsum(np.random.randint(0, 100, (2, 1000000)).astype(np.int64), axis=1)
result = search_both_sorted(a, b)
A faster implementation consists in using a branch-less approach so to remove the overhead of branch mis-prediction (especially on random/unpredictable inputs) when a and b are about the same size. Additionally, the O(n log m) algorithm can be faster when b is small so using np.searchsorted in that case is very efficient as pointed out by #MichaelSzczesny. Note that the Numba implementation of np.searchsorted can be a bit slower than the one of Numpy so it is better to pick the Numpy implementation. Here is the optimized version:
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted_opt_numba(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if sb < sa * 0.15:
# Use a version with branches because `a[i] < b[j]`
# should be most of the time true.
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
else:
# Use a branchless approach to avoid miss-predictions
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
tmp = a[i] < b[j]
result[j] = i
i += tmp
j += ~tmp
for k in range(j, b.size):
result[k] = i
return result
def search_both_sorted_opt(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if 2 * sb * np.log2(sa) < sa + sb:
return np.searchsorted(a, b)
else:
return search_both_sorted_opt_numba(a, b)
searchsorted: 19.1 ms
snp_search: 11.8 ms
search_both_sorted: 6.5 ms
search_both_sorted_branchless: 4.3 ms
The optimized branchless Numba implementation is about 4.4 times faster than searchsorted which is pretty good considering that the code of searchsorted is already highly optimized. It can be even faster when a and b are huge because of cache locality.
You could use sortednp, unfortunately it does not give too much flexibility, In the code snippet below I used its merge tracking indices, but it produces three arrays, four times more memory than necessary is used, but it is faster than searchsorted.
import numpy as np
import sortednp as snp
a = np.cumsum(np.random.rand(1000000))
b = np.cumsum(np.random.rand(1000000))
def snp_search(a,b):
m, (ib, ia) = snp.merge(b, a, indices=True)
return ib - np.arange(len(ib))
assert(np.all(snp_search(a,b) == np.searchsorted(a,b)))
np.searchsorted(a, b); #58 ms
snp_search(a,b); # 22ms
np.searchsorted takes this into account already as can be seen from the source code:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (cmp(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
Here min_idx, max_idx are used to perform binary search on the array. If last_key_val < key_val then only max_idx is reset to the array length, but min_idx remains at its current value, i.e. binary search starts at the same lower boundary as for the previous key.

Converting SimpleCluster1D pseudo code to python

I refer to the dissertation written by Marcel R. Ackermann found https://d-nb.info/100345531X/34 . In the dissertation, Marcel wrote a pseudo-code for optimal 1-Dimensional K-Median algorithm. It is shown as such:
pseudo-code for optimal K-Median
I tried to convert the code into python, as shown below:
import math
import statistics
def cost(arr, median):
cost = 0
for i in range(len(arr)):
cost = cost + abs(arr[i] - median)
return cost
def simpleCluster1D(arr, k):
n = len(arr)
B = [[0] * k for i in range(n)]
C = [[0] * k for i in range(n)]
for i in range(k):
c = statistics.median(arr[:i+1])
B[i][0] = cost(arr[:i+1], c)
C[i][0] = c
for j in range(1, k):
for i in range(j, n):
B[i][j] = math.inf
C[i][j] = []
for t in range (j, i+1):
c = statistics.median(arr[t:i+1])
b = B[t-1][j-1] + cost(arr[t:i+1],c)
if b < B[i][j]:
B[i][j] = b
tmp = C[t-1][j-1]
C[i][j] = [C[t-1][j-1]] + [c]
return C[n-1][k-1]
However, the results i obtained is not intuitive.
For example, when
arr = [50,60,70,80]
k = 2
simpleCluster1D(arr, k)
The result is [0,80], which is wrong. The answer should be [55,75] or [50,70].
I don't know where I have gone wrong.
I am wondering if anyone can help me with this conversion? I am a little confused as to the declaration of the array C - column 1 of the array contains the median, and column 2 contains a list in each array index. How do I do that?
Also, are the libraries/packages available online for R/Python (e.g flexclust in R and pyclustering in Python) already has a built-in optimal 1-D solver? I know that for d >1, it is impossible to achieve optimal result and thus heuristics are used to obtain local optimal solution. Which is why I concluded that these libraries will also solve 1-D problems with heuristics and hence answer is not deterministic. Am I right to come to that conclusion?
I don't know where I have gone wrong.
You haven't. The error is in the dissertation; the line
1: for i = 1,2,...,k do
has to be
1: for i = 1,2,...,n do
- otherwise the rows from k+1 to n of the arrays B and C aren't fully initialized.

Python: Faster Universal Hashing function with built in libs

I am trying to implement the universal hashing function with only base libs:
I am having issues because I am unable to run this in an effective time. I know % is slow so I have tried the following:
((a * x + b) % P) % n
divmod(divmod(a * x + b, P)[1], n)[1]
subeq = pow(a * x + b, 1, P)
hash = pow(subeq, 1, self.n)
All of these function are too slow for what I am trying to do. Is there a faster way to do mod division only using the base libs that I am unaware of?
Edit To elaborate, I will be running this function about 200000 times (or more) and I need for all 200000 runs to complete in under 4 seconds. None of these methods are even in that ball park (taking minutes)
You're not going to do better than ((a * x + b) % P) % m in pure Python code; the overhead of the Python interpreter is going to bottleneck you more than anything else; yes, if you ensure the m is a power of two, you can precompute mm1 = m - 1 and change the computation to ((a * x + b) % P) & mm1, replacing a more expensive remaindering operation with a cheaper bitmasking operation, but unless P is huge (hundreds of bits minimum), the interpreter overhead will likely outweigh the differences between remainder and bitmasking.
If you really need the performance, and the types you're working with will fit in C level primitive type, you may benefit from writing a Python C extension that converts all the values to size_t, Py_hash_t, uint64_t, or whatever suits your problem and performs the math as a set of bulk conversions to C types, C level math, then a single conversion back to Python int, saving a bunch of byte code and intermediate values (that are promptly tossed).
If the values are too large to fit in C primitives, GMP types are an option (look at mpz_import and mpz_export for efficient conversions from PyLong to mpz_t and back), but the odds of seeing big savings go down; GMP does math faster in general, and can mutate numbers in place rather than creating and destroying lots of temporaries, but even with mpz_import and mpz_export, the cost of converting between Python and GMP types would likely eat most of the savings.
from math import ceil, log2
from primesieve import nth_prime #will get nth prime number [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
from random import randint
class UniversalHashing:
""" N = #bins
p = prime number st: p >= N
nth_prime(1, 1 << max(32, ceil(log2(N))))
nth_prime(1,1<<max(32),ceil(log2(2)))))
nth_prime(1,2**32)
nth_prime(1,4294967296)
=4294967311
assert:- Returns Error if condition not satisfied
<< operatior:- multiply with 2the power like 2<<2 =2*2'2=8 or 7*2'3=56 and ceil will give the exact value or next vlue ceil(1)=1 , ceil(1.1)=2
randint:- Return a random integer N such that a <= N <= b. Alias for randrange(a, b+1). """
def __init__(self, N, p = None):
self.N = N
if p is None:
p = nth_prime(1, 1 << max(32, ceil(log2(N))))
assert p >= N, 'Prime number p should be at least N!'
self.p = p
def draw(self):
a = randint(1, self.p - 1)
b = randint(0, self.p - 1)
return lambda x: ((a * x + b) % self.p) % self.N
if __name__ == '__main__':
N = 50 #bins
n = 100000 #elements
H = UniversalHashing(N)
h = H.draw()
T = [0] * N
for _ in range(n):
x = randint(0, n * 10)
T[h(x)] += 1
for i in range(len(T)):
print(T[i] / n) # This should be approximately equal

optimization tool

I wonder there are any tool to optimize my program in term of loop unrolling, and how can I use it?
I have the following python code:
for i in range(0, 1000):
a = a * 10 + a%4 + i
for j in range(0, 1000):
j = j + a
for b in range(0, 1000):
result = j + b
I want to optimize this code segment so that I can try to understand loop unrolling a bit. With Python, I want to know a C optimizer.
a = 30
for i in range ( 0,1000 ) :
a = a * 10 + a%4 + i
can be rewritten as:
a = reduce(lambda a,b: a * 10 + a%4 + b, xrange(1000), 30)
takes about the same time (~4ms on my computer).
for j in range ( 0, 1000 ) :
j = j + a
doesn't make much sense. You are iterating j over 0-999, and each time add your huge a to it, which is immediately forgotten, because next j is taken. It can be rewritten as:
j = 999 + a
for b in range ( 0 , 1000 ) :
result = j + b
doesn't make much sense either. It is equivalent to:
result = j + 999
If you aren't satisfied with the performance of your code, have profiled it, and found that low-level loops like this are a bottleneck, you should be able to speed up your code hugely by using cython to turn the expensive bits of code into C extensions. Also, if you are using python 2.x, you should be using xrange instead of range.
There exists a scientific paper regarding effects of loop unrolling in Python (pdf link). These are the slides of the related talk.
However, in terms of automatic C code optimization you can use LLVM in combination with LooPo and possibly Polly. Anyway, LLVM is a good starting point.

Categories