I have the following nested loop:
sum_tot = 0.0
for i in range(len(N)-1):
for j in range(len(N)-1):
sum_tot = sum_tot + N[i]**2*N[j]**2*W[i]*W[j]*x_i[j][-1] / (N[j]**2 - x0**2) *(z_i[i][j] - z_j[i][j])*x_j[i][-1] / (N[i]**2 - x0**2)
It's basically a mathematical function that has a double summation. Each sum goes up to the length of N. I've been trying to figure out if there was a way to write this without using a nested for-loop in order to reduce computational time. I tried using list comprehension, but the computational time is similar if not the same. Is there a way to write this expression as matrices to avoid the loops?
Note that range will stop at N-2 given your current loop: range goes up to but not including its argument. You probably mean to write for i in range(len(N)).
It's also difficult to reduce summation: the actual time it takes is based on the number of terms computed, so if you write it a different way which still involves the same number of terms, it will take just as long. However, O(n^2) isn't exactly bad: it looks like the best you can do in this situation unless you find a mathematical simplification of the problem.
You might consider checking this post to gather ways to write out the summation in a neater fashion.
#Kraigolas makes valid points. But let's try a few benchmarks on a dummy, double nested operation, either way. (Hint: Numba might help you speed things up)
Note, I would avoid numpy arrays specifically because all of the cross-product between the range is going to be in memory at once. If this is a massive range, you may run out of memory.
Nested for loops
n = 5000
s1 = 0
for i in range(n):
for j in range(n):
s1 += (i/2) + (j/3)
print(s1)
#2.26 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension
n = 5000
s2 = 0
s2 = sum([i/2+j/3 for i in range(n) for j in range(n)])
print(s2)
#3.2 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Itertools product
from itertools import product
n = 5000
s3 = 0
for i,j in product(range(n),repeat=2):
s3 += (i/2) + (j/3)
print(s3)
#2.35 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note: When using Numba, you would want to run the code at least once before, because the first time it compiles the code and therefore the speed is slow. The real speedup comes second run onwards.
Numba njit (SIMD)
from numba import njit
n=5000
#njit
def f(n):
s = 0
for i in range(n):
for j in range(n):
s += (i/2) + (j/3)
return s
s4 = f(n)
#29.4 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba njit parallel with prange
An excellent suggestion by #Tim, added to benchmarks
#njit(parallel=True)
def f(n):
s = 0
for i in prange(n):
for j in prange(n):
s += (i/2) + (j/3)
return s
s5 = f(n)
#21.8 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Significant boost up with Numba as expected. Maybe try that?
To convert this to matrix calculations, I would suggest combine some terms first.
If these objects are not numpy arrays, it's better to convert them to numpy arrays, as they support element-wise operations.
To convert, simply do
import numpy
N = numpy.array(N)
w = numpy.array(w)
x_i = numpy.array(x_i)
x_j = numpy.array(x_j)
z_i = numpy.array(z_i)
z_j = numpy.array(z_j)
Then,
common_terms = N**2*w/(N**2-x0**2)
i_terms = common_terms*x_j[:,-1]
j_terms = common_terms*x_i[:,-1]
i_j_matrix = z_i - z_j
sum_output = (i_terms.reshape((1,-1)) # i_j_matrix # j_terms.reshape((-1,1)))[0,0]
Related
Apply slicing, conditionals to Sparse Arrays with Pallalization
I want to do something like dynamic programming on sparse array.
could you check the following example function,which I would like to implement for Sparse Array
(the first example is for numpy.array)
First,importing modules
from numba import jit
import numpy as np
from scipy import sparse as sp
from numba import prange
then the first example
#jit(parallel=True, nopython=True)
def mytest_csc(inptmat):
something = np.zeros(inptmat.shape[1])
for i in prange(inptmat.shape[1]):
target=0
partmat = inptmat[:, i]
for j in range(len(partmat)):
counter=0
if partmat[j] > 0:
new_val = partmat[j] / (partmat[j] + something[j])
target = (something[j] + new_val) / (counter + 1)
counter+=1
something[i] = target
return something
In the above function,
slicing/indexing sparse array
add and mulitiplication
nested for-loop
with Parallelization by Numba's prange
were done.
here is my question,how can I implement this for Sparse Array like scipy.sparse.csc_matrix?
the following is what I have tried.
This function can accept np.array or scipy.sparse.csc_matrix as the input,but it cannot be parallelized...
def mytest_csc2(inptmat):
something = np.zeros(inptmat.shape[1])
for i in prange(inptmat.shape[1]):
target=0
partmat = inptmat[:, i]
for j in range(len(partmat)):
counter=0
if partmat[j] > 0:
new_val = partmat[j] / (partmat[j] + something[j])
target = (something[j] + new_val) / (counter + 1)
counter+=1
something[i] = target
return something
The parallalization is must.
here is the speeds of the above functions.
in the example I made 100100 matrix,but in fact I need to process the significant big matrix like 100000100000. so I can't avoid Sparse Array...
inptmat=np.zeros((100,100)) #test input matrix,normal numpy array
%%timeit
mytest_csc(inptmat)
16.1 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
inptmat=sp.csc_matrix(inptmat) #test input matrix,scipy.sparse.csc_matrix
%%timeit
mytest_csc2(inptmat)
1.39 s ± 70.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I need to optimize the test function2 so that it can work fast as possible as the example with Numba.
Multiplies large matrices for a very long time. How can this problem be solved. I use the galois library, and numpy, I think it should still work stably. I tried to implement my GF4 arithmetic and multiplied matrices using numpy, but it takes even longer. Thank you for your reply.
When r = 2,3,4,5,6 multiplies quickly, then it takes a long time. As for me, these are not very large sizes of matrices. This is just a code snippet. I get the sizes n, k of matrices of a certain family given r. And I need to multiply the matrices of those obtained parameters.
import numpy as np
import galois
def family_Hamming(q,r):
n = int((q**r-1)/(q-1))
k = int((q**r-1)/(q-1)-r)
res = (n,k)
return res
q = 4
r = 7
n,k = family_Hamming(q,r)
GF = galois.GF(2**2)
#(5461,5461)
a = GF(np.random.randint(4, size=(k, k)))
#(5454,5461)
b = GF(np.random.randint(4, size=(k, n)))
c = np.dot(a,b)
print(c)
I'm not sure if it is actually faster but np.dot should be used for the dot product of two vectors, for matrix multiplication use A # B. That's as efficient as you can get with Python as far as I know
I'm the author of galois. I added performance improvements to matrix multiplication in v0.3.0 by parallelizing the arithmetic over multiple cores. The next performance improvement will come once GPU support is added.
I'm open to other performance improvement suggestions, but as far as I know the algorithm is running as fast as possible on a CPU.
In [1]: import galois
In [2]: GF = galois.GF(2**2)
In [3]: A = GF.Random((300, 400), seed=1)
In [4]: B = GF.Random((400, 500), seed=2)
# v0.2.0
In [5]: %timeit A # B
1.02 s ± 7.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# v0.3.0
In [5]: %timeit A # B
99 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Try using jax on a CUDA runtime. For example, you can try it out on Google Colab's free GPU. (Open a notebook -> Runtime -> Change runtime type -> GPU).
import jax.numpy as jnp
from jax import device_put
a = GF(np.random.randint(4, size=(k, k)))
b = GF(np.random.randint(4, size=(k, n)))
a, b = device_put(a), device_put(b)
c = jnp.dot(a, b)
c = np.asarray(c)
Timing test:
%timeit jnp.dot(a, b).block_until_ready()
# 765 ms ± 96.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a doubt when using numba for optimization. I am coding a fixed point iteration to calculate the value of a certain array, named gamma, which satisfies the equation f(gamma)=gamma. I am trying to optimize this function with python package Numba. It seems as follows.
#jit
def fixed_point(gamma_guess):
for i in range(17):
gamma_guess=f(gamma_guess)
return gamma_guess
Numba is capable of optimizing well this function, because it knows how many times it will perform the opertation, 17 times,and it works fast. But I need to control the tolerance of error of my desired gamma, I mean , the difference of a gamma and the next one obtained by the fixed point iteration should be less than some number epsilon=0.01, then I tried
#jit
def fixed_point(gamma_guess):
err=1000
gamma_old=gamma_guess.copy()
while(error>0.01):
gamma_guess=f(gamma_guess)
err=np.max(abs(gamma_guess-gamma_old))
gamma_old=gamma_guess.copy()
return gamma_guess
It also works and calculate the desired result, but not as fast as last implementation, it is much slower. I think it is because Numba cannot optimize well the while cycle since we do not know when will it stop. Is there a way I can optimizate this and run as fast as last implementation?
Edit:
Here is the f that I'm using
from scipy import fftpack as sp
S=0.01
Amu=0.7
#jit
def f(gammaa,z,zal,kappa):
ka=sp.diff(kappa)
gamma0=gammaa
for i in range(N):
suma=0
for j in range(N):
if (abs(j-i))%2 ==1:
if((z[i]-z[j])==0):
suma+=(gamma0[j]/(z[i]-z[j]))
gamma0[i]=2.0*Amu*np.real(-(zal[i]/z[i])+zal[i]*(1.0/(2*np.pi*1j))*suma*2*h)+S*ka[i]
return gamma0
I always use np.ones(2048)*0.5 as initial guess and the other parameters that I pass to my function are z=np.cos(alphas)+1j*(np.sin(alphas)+0.1) , zal=-np.sin(alphas)+1j*np.cos(alphas) , kappa=np.ones(2048) and alphas=np.arange(0,2*np.pi,2*np.pi/2048)
I made a small test script, to see if I could reproduce your error:
import numba as nb
from IPython import get_ipython
ipython = get_ipython()
#nb.jit(nopython=True)
def f(x):
return (x+1)/x
def fixed_point_for(x):
for _ in range(17):
x = f(x)
return x
#nb.jit(nopython=True)
def fixed_point_for_nb(x):
for _ in range(17):
x = f(x)
return x
def fixed_point_while(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
#nb.jit(nopython=True)
def fixed_point_while_nb(x):
error=1
x_old = x
while error>0.01:
x = f(x)
error = abs(x_old-x)
x_old = x
return x
print("for loop without numba:")
ipython.magic("%timeit fixed_point_for(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_for_nb(10)")
print("while loop without numba:")
ipython.magic("%timeit fixed_point_while(10)")
print("for loop with numba:")
ipython.magic("%timeit fixed_point_while_nb(10)")
As I don't know about your f I just used the most simple stabilizing function, that I could think of. I then ran tests with and without numba, both times with for and while loops. The results on my machine are:
for loop without numba:
3.35 µs ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
for loop with numba:
282 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
while loop without numba:
1.86 µs ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
for loop with numba:
214 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The following thoughts arise:
It can't be, that your function is not optimizable, since your for loop is fast (at least you said so; have you tested without numba?).
It could be, that your function takes way more loops to converge as you might think
We are using different software versions. My versions are:
numba 0.49.0
numpy 1.18.3
python 3.8.2
Is == comparison between a large np.array with a single number very low in python? I used line_profiler to locate the bottleneck in my code. The bottleneck is just a simple comparison between a 1d np.array with a constant number. It accounts for 80% of the total runtime. Did I do anything wrong causing it to be so slow? Is there any way to accelerate it?
I tried to use multiprocessing, however, in the test code (snippet 2), using multiprocessing is slower than running in sequence and using map directly. Could anyone explain this phenomenon?
Any comments or suggestions are sincerely appreciated.
Snippet 1:
Line # Hits Time Per Hit %Time Line Contents
38 12635 305767927.0 24200.1 80.0 res = map(logicalEqual,assembly)
def logicalEqual(x):
return F[:,-1] == x
assembly = [1,2,3,4,5,7,8,9,...,25]
F is an int typed (281900, 6) np.array
Snippet 2:
import numpy as np
from multiprocessing import Pool
import time
y=np.random.randint(2, 20, size=10000000)
def logicalEqual(x):
return y == x
p=Pool()
start = time.time()
res0=p.map(logicalEqual, [1,2,3,4,5,7,8,9,10,11,12,13,14,15])
# p.close()
# p.join()
runtime = time.time()-start
print(f'runtime using multiprocessing.Pool is {runtime}')
res1 = []
start = time.time()
for x in [1,2,3,4,5,7,8,9,10,11,12,13,14,15]:
res1.append(logicalEqual(x))
runtime = time.time()-start
print(f'sequential runtime is {runtime}')
start = time.time()
res2=list(map(logicalEqual,[1,2,3,4,5,7,8,9,10,11,12,13,14,15]))
runtime = time.time()-start
print(f'runtime is {runtime}')
runtime using multiprocessing.Pool is 0.3612203598022461
sequential runtime is 0.17401981353759766
runtime is 0.19697237014770508
Array comparison is fast, since it is done in C code, not Python.
x = np.random.rand(1000000)
y = 4.5
test = 0.55
%timeit x == test
386 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit y == test
33.2 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
So, comparing one Python float to another takes 33*10^-9 s, while comparing 1E6 numpy floats takes only 386 µs / 33 ns ~= 11700 times longer, despite comparing 1000000 more values. The same is true for ints (377 µs vs 34 ns). But as Dunes mentioned in a comment, comparing a lot of values takes a lot of cycles. Nothing you can do about that.
Often to save some time, I would like we to use n = len(s) in my local function.
I am curious about which call is faster or they are the same?
while i < len(s):
# do something
vs
while i < n:
# do something
There should not be too much difference, but using len(s), we need to reach s first, then call s.length. This is O(1) + O(1). But using n, it is O(1). I assume so.
it has to be faster.
Using n you're looking in the variables (dictionaries) once.
Using len(s) you're looking twice (len is also a function that we have to look for). Then you call the function.
That said if you do while i < n: most of the time you can get away with a classical for i in range(len(s)): loop since upper boundary doesn't change, and is evaluated once only at start in range (which may lead you to: Why wouldn't I iterate directly on the elements or use enumerate ?)
while i < len(s) allows to compare your index against a varying list. That's the whole point. If you fix the bound, it becomes less attractive.
In a for loop, it's easy to skip increments with continue (as easy as it is to forget to increment i and end up with an infinite while loop)
You're right, here's some benchmarks:
s = np.random.rand(100)
n = 100
Above is setup.
%%timeit
50 < len(s)
86.3 ns ± 2.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Versus:
%%timeit
50 < n
36.8 ns ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
But then again, it's hard to imagine differences on ~60ns level would have affected speed. Unless you're calling len(s) millions of times.