I have this code:
import numpy as np
from skimage.util import img_as_ubyte
from skimage.feature import canny
import math
image = img_as_ubyte(sf_img)
edges = np.flipud(canny(image, sigma=3, low_threshold=10, high_threshold=25))
non_zeros = np.nonzero(edges)
true_rows = non_zeros[0]
true_col = non_zeros[1]
plt.imshow(edges)
plt.show()
N_im = 256
x0 = 0
y0 = -0.25
Npx = 129
Npy = 60
delta_py = 0.025
delta_px = 0.031
Nr = 9
delta_r = 0.5
rho = 0.063
epsilon = 0.75
r_k = np.zeros((1, Nr))
r_min = 0.5
for k in range(0, Nr):
r_k[0, k] = k * delta_r + r_min
a = np.zeros((Npy, Npx, Nr))
#FOR LOOP TO BE TIME OPTIMIZED
for i in range(0, np.size(true_col, 0)): #true_col and true_rows has the same size so it doesn't matter
for m in range(0, Npy):
for l in range(0, Npx):
d = math.sqrt(math.pow(
(((true_col[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
l * delta_px - (Npx * delta_px / 2) + x0)),
2) + math.pow(
(((true_rows[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
m * delta_py - (Npy * delta_py / 2) + y0)),
2))
min_idx = np.argmin(np.abs(d - r_k))
rk_hat = r_k[0, min_idx]
if np.abs(d - rk_hat) < rho:
a[m, l, min_idx] = a[m, l, min_idx] + 1
#ANOTHER LOOP TO BE OPTIMIZED
# for m in range(0, Npy):
# for l in range(0, Npx): #ORIGINAL
# for k in range(0, Nr):
# if a[m, l, k] < epsilon * np.max(a):
# a[m, l, k] = 0
a[np.where(a[:, :, :] < epsilon * np.max(a))] = 0 #SUBSTITUTED
a_prime = np.sum(a, axis=2)
acc_x = np.zeros((Npx, 1))
acc_y = np.zeros((Npy, 1))
for l in range(0, Npx):
acc_x[l, 0] = l * delta_px - (Npx * delta_px / 2) + x0
for m in range(0, Npy):
acc_y[m, 0] = m * delta_py - (Npy * delta_py / 2) + y0
prod = 0
for m in range(0, Npy):
for l in range(0, Npx):
prod = prod + (np.array([acc_x[l, 0], acc_y[m, 0]]) * a_prime[m, l])
points = prod / np.sum(a_prime)
Based on comment to an answer:
true_rows = np.random.randint(0,256,10)
true_col = np.random.randint(0,256,10)
Which, briefly, scans a 256x256 image that has been previously processed through the Canny Edge detection.
The For Loop so must scan every pixel of the resulting image and must also compute 2 nested for loops which does some operations depending on the value of l and m indexes of the 'a' matrix.
Since the edge detection returns an image with zeros and ones (in correspondence of edges) and since the inside operations has to be done only for the one-valued points, I've used
non_zeros = np.nonzero(edges)
to obtain only the indexes I'm interested in. Indeed, previously the code was in this way
for i in range(0, N_im):
for j in range(0, N_im):
if edges[i, j] == 1:
for m in range(0, Npy):
for l in range(0, Npx):
d = math.sqrt(math.pow(
(((i - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
l * delta_px - (Npx * delta_px / 2) + x0)),
2) + math.pow(
(((j - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
m * delta_py - (Npy * delta_py / 2) + y0)),
2))
min_idx = np.argmin(np.abs(d - r_k))
rk_hat = r_k[0, min_idx]
if np.abs(d - rk_hat) < rho:
a[m, l, min_idx] = a[m, l, min_idx] + 1
It seems like I managed to optimize the first two loops, but my script needs to be faster than that.
It takes roughly 6~7 minutes to run and I need to execute it for like 1000 times. Can you help me optimize even further those for loops of this script? Thank you!
You can use Numba JIT to speed up the computation (since the default CPython interpreter is very bad for such computation). Moreover, you can rework the loops so that the code can run in parallel.
Here is the resulting code:
import numba as nb
# Assume you work with 64-bits integer,
# feel free to change it to 32-bit integers if this is not the case.
# If you encounter type issue, let Numba choose with: #nb.njit(parallel=True)
# However, note that the first run will be slower if you let Numba choose.
#nb.njit('int64[:,:,::1](bool_[:,:], float64[:,:], int64, int64, int64, int64, float64, float64, float64, float64, float64)', parallel=True)
def fasterImpl(edges, r_k, Npy, Npx, Nr, N_im, delta_px, delta_py, rho, x0, y0):
a = np.zeros((Npy, Npx, Nr), dtype=nb.int64)
# Find all the position where edges[i,j]==1
validEdgePos = np.where(edges == 1)
for m in nb.prange(0, Npy):
for l in range(0, Npx):
# Iterate over the i,j value where edges[i,j]==1
for i, j in zip(validEdgePos[0], validEdgePos[1]):
d = math.sqrt(math.pow(
(((i - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
l * delta_px - (Npx * delta_px / 2) + x0)),
2) + math.pow(
(((j - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
m * delta_py - (Npy * delta_py / 2) + y0)),
2))
min_idx = np.argmin(np.abs(d - r_k))
rk_hat = r_k[0, min_idx]
if np.abs(d - rk_hat) < rho:
a[m, l, min_idx] += 1
return a
On my machine, with inputs described in your question (including the provided sf_img), this code is 616 times faster.
Reference time: 109.680 s
Optimized time: 0.178 s
Note that results are exactly the same than the reference implementation.
Based on your script, you have little experience with numpy in general. Numpy is optimized with SIMD instructions and your code kinda defeats it. I would advise you to review the basics on how to write numpy code
Please review this cheat sheet. https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf
For instance, this code can be changed from
r_k = np.zeros((1, Nr))
for k in range(0, Nr):
r_k[0, k] = k * delta_r + r_min
### to a simple np.arange assignment
r_k = np.zeros((1, Nr))
r_k[0,:] = np.arange(Nr) * delta_r + r_min
### or you can do everything in one line
r_k = np.expand_dims(np.arange(Nr) * delta_r + r_min,axis=0)
This code is a little awkward because you are creating a np.array while looping through each element. You can probably simplify this code too. Are you changing the data type from int to a np.array of two here?
prod = 0
for m in range(0, Npy):
for l in range(0, Npx):
prod = prod + (np.array([acc_x[l, 0], acc_y[m, 0]]) * a_prime[m, l])
For this loop, you can slowly separate out dependent and independent elements.
#FOR LOOP TO BE TIME OPTIMIZED
for i in range(0, np.size(true_col, 0)): #true_col and true_rows has the same size so it doesn't matter
for m in range(0, Npy):
for l in range(0, Npx):
d = math.sqrt(math.pow(
(((true_col[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
l * delta_px - (Npx * delta_px / 2) + x0)),
2) + math.pow(
(((true_rows[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
m * delta_py - (Npy * delta_py / 2) + y0)),
2))
min_idx = np.argmin(np.abs(d - r_k))
rk_hat = r_k[0, min_idx]
if np.abs(d - rk_hat) < rho:
a[m, l, min_idx] = a[m, l, min_idx] + 1
The outer loop for i in range(0, np.size(true_col, 0)) is fine
You do not need a loop to compute this. For index multiplication, you can allocate an extra matrix array such that you have the desired 1:1 format.
for m in range(0, Npy):
for l in range(0, Npx):
d = math.sqrt(math.pow(
(((true_col[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
l * delta_px - (Npx * delta_px / 2) + x0)),
2) + math.pow(
(((true_rows[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (
m * delta_py - (Npy * delta_py / 2) + y0)),
2))
To emulate m and l behavior, you can create an Npx by Npy index matrix. Although this pattern may seem odd, Numpy inherited tricks from MATLAB ecosystem because the goal of MATLAB/numpy is to simplify code and allow you to spend more fixing your logic.
## l matrix
[[0,1,2,3,4,5,6,7,8....Npx],
[0,1,2,3,4,5,6,7,8....Npx],
.....
[0,1,2,3,4,5,6,7,8....Npx]]
##m matrix
[[0,0,0,0,0,0,0,0,0,0,0,0],
[1,1,1,1,,1,1,1,1,1,1,1,1],
.....
[Npx,Npx,Npx.....,Npx]]
## You can create both with one command
l_mat, m_mat = np.meshgrid(np.arange(Npx), np.arange(Npy))
>>> l_mat
array([[ 0, 1, 2, ..., 147, 148, 149],
[ 0, 1, 2, ..., 147, 148, 149],
[ 0, 1, 2, ..., 147, 148, 149],
...,
[ 0, 1, 2, ..., 147, 148, 149],
[ 0, 1, 2, ..., 147, 148, 149],
[ 0, 1, 2, ..., 147, 148, 149]])
>>> m_mat
array([[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 1, 1, ..., 1, 1, 1],
[ 2, 2, 2, ..., 2, 2, 2],
...,
[97, 97, 97, ..., 97, 97, 97],
[98, 98, 98, ..., 98, 98, 98],
[99, 99, 99, ..., 99, 99, 99]])
With those two matrices, you can multiply it to create the result.
d = np.sqrt(np.pow( true_col[i] - np.floor((N_im + 1)/2)) / (N_im + l_mat).....
For these two lines of code, you seem to be setting up an argmin matrix.
min_idx = np.argmin(np.abs(d - r_k))
rk_hat = r_k[0, min_idx]
https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html
vfunc = np.vectorize(lambda x: np.argmin(np.abs(x - r_k))
min_idx = vfunc(d)
vfunc2 = np.vectorize(lambda x: r_k[0, x])
rk_hat = vfunc2(min_idx)
For the last two lines, d and rk_hat should be Npx by Npy matrixes. You can use matrix slicing or np.where to create a matrix mask.
if np.abs(d - rk_hat) < rho:
points = np.where( np.abs(d-rk_hat) < rho )
https://numpy.org/doc/stable/reference/generated/numpy.where.html
I give up the last line, it probably doesn't matter if you put it in a loop
a[m, l, min_idx] = a[m, l, min_idx] + 1
for xy in points:
a[xy[0],xy[1], min_idx[xy[0],xy[1]]] += 1
New answer which optimizes the the nested loop,
....
for i in range(0, np.size(true_col, 0)): #true_col and true_rows has the same size so it doesn't matter
for m in range(0, Npy):
for l in range(0, Npx):
There is a substantial improvement in processing time. For true_col and true_rows lengths of 2500 it takes about 3 seconds on my machine. It is in a function for testing purposes.
def new():
a = np.zeros((Npy, Npx, Nr),dtype=int)
# tease out and separate some of the terms
# used in the calculation of the distance - d
bb = N_im + 1
cc = (Npx * delta_px / 2)
dd = (Npy * delta_py / 2)
l, m = np.meshgrid(np.arange(Npx), np.arange(Npy))
q = (true_col - math.floor(bb / 2)) / bb / 2 # shape (true_col length,)
r = l * delta_px - cc + x0 # shape(Npy,Npx)
s = np.square(q - r[...,None]) # shape(Npy,Npx,true_col length)
# - last dimension is the outer loop of the original
t = (true_rows - math.floor(bb / 2)) / bb / 2 # shape (len(true_rows),)
u = m * delta_py - dd + y0 # shape(60,129) ... (Npx,Npy)
v = np.square(t - u[...,None]) # shape(Npy,Npx,true_col length)
d = np.sqrt(s + v) # shape(Npy,Npx,true_col length)
e1 = np.abs(d[...,None] - r_k.squeeze()) # shape(Npy,Npx,true_col length,len(r_k[0,:]))
min_idx = np.argmin(e1,-1) # shape(Npy,Npx,true_col length)
rk_hat = r_k[0,min_idx] # shape(Npy,Npx,true_col length)
zz = np.abs(d-rk_hat) # shape(Npy,Npx,true_col length)
condition = zz < rho # shape(Npy,Npx,true_col length)
# seemingly unavoidable for loop needed to perform
# a bincount along the last dimension (filtered)
# while retaining the 2d position info
# this will be pretty fast though,
# nothing really going on other than indexing and assignment
for iii in range(Npy*Npx):
row,col = divmod(iii,Npx)
filter = condition[row,col]
one_d = min_idx[row,col]
counts = np.bincount(one_d[filter])
a[row,col,:counts.size] = counts
return a
I could not figure out how to use Numpy methods to get rid of the final loop which filters for less then rho AND does a bincount - if I figure this out, I will update
Data from you question and comments
import math
import numpy as np
np.random.seed(5)
n_ = 2500
true_col = np.random.randint(0,256,n_)
true_rows = np.random.randint(0,256,n_)
N_im = 256
x0 = 0
y0 = -0.25
Npx = 129
Npy = 60
# Npx = 8
# Npy = 4
delta_py = 0.025
delta_px = 0.031
Nr = 9
delta_r = 0.5
rho = 0.063
epsilon = 0.75
r_min = 0.5
r_k = np.arange(Nr) * delta_r + r_min
r_k = r_k.reshape(1,Nr)
Your original nested loops in a function - with some diagnostic additions.
def original(writer=None):
'''writer should be a csv.Writer object.'''
a = np.zeros((Npy, Npx, Nr),dtype=int)
for i in range(0, np.size(true_col, 0)): #true_col and true_rows has the same size so it doesn't matter
for m in range(0, Npy):
for l in range(0, Npx):
d = math.sqrt(math.pow((((true_col[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (l * delta_px - (Npx * delta_px / 2) + x0)),2) +
math.pow((((true_rows[i] - math.floor((N_im + 1) / 2)) / (N_im + 1) / 2) - (m * delta_py - (Npy * delta_py / 2) + y0)),2))
min_idx = np.argmin(np.abs(d - r_k)) # scalar
rk_hat = r_k[0, min_idx] # scalar
if np.abs(d - rk_hat) < rho:
# if (m,l) == (0,0):
if writer:
writer.writerow([i,m,l,d,min_idx,rk_hat,a[m, l, min_idx] + 1])
# print(f'condition satisfied: i:{i} a[{m},{l},{min_idx}] = {a[m, l, min_idx]} + 1')
a[m, l, min_idx] = a[m, l, min_idx] + 1
return a
I have come across a very strange problem where i do a lot of math and the result is inf or nan when my input is of type <class 'numpy.int64'>, but i get the correct (checked analytically) results when my input is of type <class 'int'>. The only library functions i use are np.math.factorial(), np.sum() and np.array(). I also use a generator object to sum over series and the Boltzmann constant from scipy.constants.
My question is essentially this: Are their any known cases where np.int64 objects will behave very differently from int objects?
When i run with np.int64 input, i get the RuntimeWarnings: overflow encountered in long_scalars, divide by zero encountered in double_scalars and invalid value encountered in double_scalars. However, the largest number i plug into the factorial function is 36, and i don't get these warnings when i use int input.
Below is a code that reproduces the behaviour. I was unable to find out more exactly where it comes from.
import numpy as np
import scipy.constants as const
# Some representible numbers
sigma = np.array([1, 2])
sigma12 = 1.5
mole_weights = np.array([10,15])
T = 100
M1, M2 = mole_weights/np.sum(mole_weights)
m0 = np.sum(mole_weights)
fac = np.math.factorial
def summation(start, stop, func, args=None):
#sum over the function func for all ints from start to and including stop, pass 'args' as additional arguments
if args is not None:
return sum(func(i, args) for i in range(start, stop + 1))
else:
return sum(func(i) for i in range(start, stop + 1))
def delta(i, j):
#kronecker delta
if i == j:
return 1
else:
return 0
def w(l, r):
# l,r are ints, return a float
return 0.25 * (2 - ((1 / (l + 1)) * (1 + (-1) ** l))) * np.math.factorial(r + 1)
def omega(ij, l, r):
# l, r are int, ij is and ID, returns float
if ij in (1, 2):
return sigma[ij - 1] ** 2 * np.sqrt(
(np.pi * const.Boltzmann * T) / mole_weights[ij - 1]) * w(l, r)
elif ij in (12, 21):
return 0.5 * sigma12 ** 2 * np.sqrt(
2 * np.pi * const.Boltzmann * T / (m0 * M1 * M2)) * w(l, r)
else:
raise ValueError('(' + str(ij) + ', ' + str(l) + ', ' + str(r) + ') are non-valid arguments for omega.')
def A_prime(p, q, r, l):
'''
p, q, r, l are ints. returns a float
'''
F = (M1 ** 2 + M2 ** 2) / (2 * M1 * M2)
G = (M1 - M2) / M2
def inner(w, args):
i, k = args
return ((8 ** i * fac(p + q - 2 * i - w) * (-1) ** (r + i) * fac(r + 1) * fac(
2 * (p + q + 2 - i - w)) * 2 ** (2 * r) * F ** (i - k) * G ** w) /
(fac(p - i - w) * fac(q - i - w) * fac(r - i) * fac(p + q + 1 - i - r - w) * fac(2 * r + 2) * fac(
p + q + 2 - i - w)
* 4 ** (p + q + 1) * fac(k) * fac(i - k) * fac(w))) * (
2 ** (2 * w - 1) * M1 ** i * M2 ** (p + q - i - w)) * 2 * (
M1 * (p + q + 1 - i - r - w) * delta(k, l) - M2 * (r - i) * delta(k, l - 1))
def sum_w(k, i):
return summation(0, min(p, q, p + q + 1 - r) - i, inner, args=(i, k))
def sum_k(i):
return summation(l - 1, min(l, i), sum_w, args=i)
return summation(l - 1, min(p, q, r, p + q + 1 - r), sum_k)
def H_i(p, q):
'''
p, q are ints. Returns a float
'''
def inner(r, l):
return A_prime(p, q, r, l) * omega(12, l, r)
def sum_r(l):
return summation(l, p + q + 2 - l, inner, args=l)
val = 8 * summation(1, min(p, q) + 1, sum_r)
return val
p, q = np.int64(8), np.int64(8)
print(H_i(p,q)) #nan
print(H_i(int(p) ,int(q))) #1.3480582058153066e-08
Numpy's int64 is a 64-bit integer, meaning it consists of 64 places that are either 0 or 1. Thus the smallest representable value is -2**63 and the biggest one is 2**63 - 1
Python's int is essentially unlimited in length, so it can represent any value. It is equivalent to a BigInteger in Java. It's stored as a list of int64s essentially that are considered a single large number.
What you have here is a classic integer overflow. You mentioned that you "only" plug 36 into the factorial function, but the factorial function grows very fast, and 36! = 3.7e41 > 9.2e18 = 2**63 - 1, so you get a number bigger than you can represent in an int64!
Since int64s are also called longs this is exactly what the warning overflow encountered in long_scalars is trying to tell you!