Speeding up vector distance calculation using Numba

Speeding up vector distance calculation using Numba - python

Below are some of the functions I wrote for distance (square) calculation in 3-D toroidal geometry for a collection of particles in that 3-D space:
import itertools
import time
import numpy as np
import scipy
import numba
from numba import njit
#njit(cache=True)
def get_dr2(i=np.array([]),j=np.array([]),cellsize=np.array([])):
k=np.zeros(3,dtype=np.float64)
dr2=0.0
for idx in numba.prange(cellsize.shape[0]):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.rint((j[idx]-i[idx])/cellsize[idx])
dr2+=k[idx]**2
return dr2
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64[:])"],
"(m),(m),(m)->()",nopython=True,cache=True)
def get_dr2_vec(i,j,cellsize,dr2):
dr2[:]=0.0
k=np.zeros(3,dtype=np.float64)
for idx in numba.prange(cellsize.shape[0]):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.rint((j[idx]-i[idx])/cellsize[idx])
dr2[0]+=k[idx]**2
#njit(cache=True)
def pair_vec_gen(pIList=np.array([[]]),pJList=np.array([[]])):
assert pIList.shape[1] == pJList.shape[1]
vecI=np.zeros((pIList.shape[0]*pJList.shape[0],pIList.shape[1]))
vecJ=np.zeros_like(vecI)
for i in numba.prange(pIList.shape[0]):
for j in numba.prange(pJList.shape[0]):
for k in numba.prange(pIList.shape[1]):
vecI[j+pJList.shape[0]*i][k]=pIList[i][k]
vecJ[j+pJList.shape[0]*i][k]=pJList[j][k]
return vecI,vecJ
#njit(cache=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
vecI=np.zeros((pIList.shape[0]*pJList.shape[0],pIList.shape[1]))
vecJ=np.zeros_like(vecI)
r2List=np.zeros(vecI.shape[0])
for i in numba.prange(pIList.shape[0]):
for j in numba.prange(pJList.shape[0]):
for k in numba.prange(pIList.shape[1]):
vecI[j+pJList.shape[0]*i][k]=pIList[i][k]
vecJ[j+pJList.shape[0]*i][k]=pJList[j][k]
r2List=get_dr2_vec2(vecI,vecJ,cellsize)
return r2List
#njit(cache=True)
def get_dr2_vec2(i=np.array([[]]),j=np.array([[]]),cellsize=np.array([])):
dr2=np.zeros(i.shape[0],dtype=np.float64)
k=np.zeros(i.shape[1],dtype=np.float64)
for m in numba.prange(i.shape[0]):
for n in numba.prange(i.shape[1]):
k[n] = (j[m,n]-i[m,n])-cellsize[n]*np.rint((j[m,n]-i[m,n])/cellsize[n])
dr2[m]+=k[n]**2
return dr2
def pair_dist_calculator_cdist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
r2List = (scipy.spatial.distance.cdist(pIList, pJList, metric=get_dr2_wrapper(cellsize=cellsize))).flatten()
return np.array(r2List).flatten()
def get_dr2_wrapper(cellsize=np.array([])):
return lambda u, v: get_dr2(u,v,cellsize)
frames=50
timedata=np.zeros((5,frames),dtype=np.float64)
N, dim = 100, 3 # 100 particles in 3D
cellsize=np.array([26.4,19.4,102.4])
for i in range(frames):
print("\rIter {}".format(i),end='')
vec = np.random.random((N, dim))
rList1=[];rList2=[];rList3=[];rList4=[];rList5=[]
#method 1
#print("method 1")
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
rList1.append(get_dr2(pI,pJ,cellsize))
end =time.perf_counter()
timedata[0,i]=(end-start)
#method 2
#print("method 2")
pIvec=[];pJvec=[];rList2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
pIvec.append(pI)
pJvec.append(pJ)
rList2=get_dr2_vec(np.array(pIvec),np.array(pJvec),cellsize)
end =time.perf_counter()
timedata[1,i]=(end-start)
#method 3
#print("method 3")
start = time.perf_counter()
rList3=get_dr2_vec(*pair_vec_gen(vec,vec),cellsize)
end =time.perf_counter()
timedata[2,i]=(end-start)
#method 4
#print("method 4")
start = time.perf_counter()
rList4=pair_vec_dist(vec,vec,cellsize)
end =time.perf_counter()
timedata[3,i]=(end-start)
#method 5
#print("method 5")
#start = time.perf_counter()
#rList5=pair_dist_calculator_cdist(np.array(pIvec),np.array(pJvec),cellsize)
#end =time.perf_counter()
#timedata[4,i]=(end-start)
assert (rList1 == rList2).all()
assert (rList2 == rList3).all()
assert (rList3 == rList4).all()
#assert rList4 == rList5
print("\n")
for i in range(4):
print("Method {} Average time {:.3g}s \u00B1 {:.3g}s".format(i+1,np.mean(timedata[i,1:]),np.std(timedata[i,1:])))
exit()
The essential idea is that at a particular time you have a snapshot of the particles or frame which contains the position of the particles. Now we can calculate all the distances between the particles we can use the following approaches:
Calculate distance between points iteratively in pure python; passing the combination of the position of the two particles one by one via Numba.
Create an iteration list (in pure python) beforehand and pass the whole list to a Numba #guvectorize function
Do (2) but all steps in Numba
Integrate all step in (3) to a simple Numba function
(optional) parse the positions to scipy.spatial.distance.cdist with the distance function as the distance metric.
For 50 frames containing 100 particles we have the respective times (frames, N = 50, 100):
Method 1 Average time 0.017s ± 0.00555s
Method 2 Average time 0.0181s ± 0.00573s
Method 3 Average time 0.00182s ± 0.000944s
Method 4 Average time 0.000485s ± 0.000348s
For 50 frames containing 1000 particles we have the respective times (frames, N = 50, 1000):
Method 1 Average time 2.11s ± 0.977s
Method 2 Average time 2.42s ± 0.859s
Method 3 Average time 0.349s ± 0.12s
Method 4 Average time 0.0694s ± 0.022s
and for 1000 frames containing 100 particles we have the respective times (frames, N = 1000, 100):
Method 1 Average time 0.0244s ± 0.0166s
Method 2 Average time 0.0288s ± 0.0254s
Method 3 Average time 0.00258s ± 0.00231s
Method 4 Average time 0.000636s ± 0.00086s
(All the time shown above are after removing the contribution from the first iteration)
Method 5 simply fails due to memory requirements and is much slower in comparison to any other method
Given the above dataset, I tend to prefer Method 4 though I am a bit concerned about the average time increase when I increase frames from 50 to 1000. Is there any further optimizations I can do in these implementations or if someone has ideas for much faster and memory conscious implementations? Any suggestions are welcome.
Update
Based on Jerome's answer the modified function is now:
#njit(cache=True,parallel=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
assert cellsize.size == 3
dr2=np.zeros(pIList.shape[0]*pJList.shape[0],dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
offset = j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[offset] = xk**2+yk**2+zk**2
return dr2
As Jerome pointed out that a very simple optimization could be running the loops through just the "lower half of the symmetric matrix" the distance calculation creates, though in a realistic situation I might have vector lists as pI and pJ where pI is a subset of pJ, which complicates this situation. Either I have to create two separate functions and control them via a wrapper function or somehow manage that in one single function. If there are any suggestions on how to do so that would be really helpful.
Update 2
I should clarify the problem furthermore. In this code I am trying to calculate distance between all points in a frame/snapshot, which is used further for pair distance distribution analysis. But in some cases we might want to focus on a subset of coordinates in a frame and calculate the distribution from their perspective. In such a case we select this subset smallVec from a pool of all coordinates vec (such that smallVec +restOfVec = vec) and calculate pair_vec_dist(smallVec,vec) instead of pair_vec_dist(vec,vec). For this calculation one can use list(pair_vec_dist(smallVec,smallVec)).append(pair_vec_dist(smallVec,restOfVec).
Based on the discussion with Jerome, I modified my function as:
#njit(cache=True,parallel=True)
def pair_vec_dist_cmb(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([]),is_sq=True,is_nonsq=True):
assert pIList.shape[1] == pJList.shape[1]
assert cellsize.size == 3
dr2_1=0; dr2_2=0
dr2_1=int(0.5*pIList.shape[0]*(pIList.shape[0]+1))
if is_nonsq:
dr2_2=int(pIList.shape[0]*pJList.shape[0])
dr2 = np.zeros((dr2_1+dr2_2),dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for j in numba.prange(0,pIList.shape[0],1):
if is_sq:
for i in range(j,pIList.shape[0],1):
index_1 = int(0.5*i*(i+1)+j)
xdist = pIList[j,0]-pIList[i,0]
ydist = pIList[j,1]-pIList[i,1]
zdist = pIList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index_1] = xk**2+yk**2+zk**2
if is_nonsq:
for j in range(pJList.shape[0]):
index_2 = dr2_1+ j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index_2] = xk**2+yk**2+zk**2
return dr2
Where pI (size: (N,3)) is the subset of pJ (size (M,3). In this code we subdivide the calculation into two sections: pair distance between pI-pI, which is symmetric and hence we can calculate only the lower triangular matrix i.e. N(N-1)/2 unique values. The other section is pI-pJ distances where we have to go through M(M-N) unique values. To further optimize the function, I have two additional changes:
Combining the outer loop for both sections. In order to do so I am now iterating over the upper triangular matrix which translates to N(N+1)/2 values. One can also implement an if check to see if coordinates are identical, though I am not sure how much time it would save.
To avoid appending the results from the two section together, I am predefining and partitioning the returned array by length.
A further assumption I have made is that time needed for partitioning vec into smallVec and restOfVec is negligent with respect to the pair distance calculation. Obviously, if wrong, one might need to rethink another optimization pathway.
The resultant function is 1.5 times faster than the previous function. I am looking to further optimize the function, but I am very new to loop tilling and other advanced optimizations, so if you have any suggestions, please let me know.
Update 3
So I figured that I should focus on making the function more optimized in terms of serial calculations as I might simply use Dask or multiprocessing to implement to work on multiple sections of an input collection of frames. So the reference function now is:
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test(pIList,pJList,cellsize):
_I=pIList.shape[0]
_J=pJList.shape[0]
dr2 = np.empty(int(_I*_J),dtype=np.float32)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
index = j + pJList.shape[0] * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index] = xk**2+yk**2+zk**2
return dr2
Going back to the main problem while ignoring the symmetry aspect, I tried to further optimize the distance function as:
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test_v2(pIList,pJList,cellsize):
_I=pIList.shape[0]
_J=pJList.shape[0]
dr2 = np.empty(int(_I*_J),dtype=np.float32)
inv_cellsize = 1.0 / cellsize
tile=32
for ii in range(0,_I,tile):
for jj in range(0,_J,tile):
for i in range(ii,min(_I,ii+tile)):
for j in range(jj,min(_J,jj+tile)):
index = j + _J * i
xdist = pJList[j,0]-pIList[i,0]
ydist = pJList[j,1]-pIList[i,1]
zdist = pJList[j,2]-pIList[i,2]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
dr2[index] = xk**2+yk**2+zk**2
return dr2
which is essentially tiling up the two vector arrays. However I couldn't get any speedup as the exec time for both functions are roughly the same. I also thought about working with the transpose of the vector arrays, but I couldn't figure out how to align them in a loop when the vector lengths are not a multiple of tile length. Does anyone has any further suggestions or ideas on how to procced?
Edit: Another failed trial
#njit(cache=True,parallel=False, fastmath=True, boundscheck=False, nogil=True)
def pair_vec_dist_test_v3(pIList,pJList,cellsize):
inv_cellsize = 1.0 / cellsize
tile=32
_I=pIList.shape[0]
_J=pJList.shape[0]
vecI=np.empty((_I+2*tile,3),dtype=np.float64) # for rolling effect
vecJ=np.empty((_J+2*tile,3),dtype=np.float64) # for rolling effect
vecI_mask=np.ones((_I+2*tile),dtype=np.uint8)
vecJ_mask=np.ones((_J+2*tile),dtype=np.uint8)
vecI[:_I]=pIList
vecJ[:_J]=pJList
vecI[_I:]=0.
vecJ[_J:]=0.
vecI_mask[_I:]=0
vecI_mask[_J:]=0
#print(vecI,vecJ)
ILim=_I+(tile-_I%tile)
JLim=_J+(tile-_J%tile)
dr2 = np.empty((ILim*JLim),dtype=np.float64)
vecI=vecI.T
vecJ=vecJ.T
for ii in range(ILim):
for jj in range(0,JLim,tile):
index = jj + JLim*ii
#print(ii,jj,index)
mask = np.multiply(vecJ_mask[jj:jj+tile],vecI_mask[ii:ii+tile])
xdist = vecJ[0,jj:jj+tile]-vecI[0,ii:ii+tile]
ydist = vecJ[1,jj:jj+tile]-vecI[1,ii:ii+tile]
zdist = vecJ[2,jj:jj+tile]-vecI[2,ii:ii+tile]
xk = xdist-cellsize[0]*np.rint(xdist*inv_cellsize[0])
yk = ydist-cellsize[1]*np.rint(ydist*inv_cellsize[1])
zk = zdist-cellsize[2]*np.rint(zdist*inv_cellsize[2])
arr = xk**2+yk**2+zk**2
dr2[index:index+tile] = np.multiply(arr,mask)
return dr2

First things first: there are races conditions in your current code. This basically means the produced results can be corrupted (and it also impact performance). In practice, this causes an undefined behaviour. For example, k[n] is read by multiple thread in get_dr2_vec2. One need to be very careful when using prange. In this case, the race condition can be removed by just not using a temporary array which is not really useful and not using prange in the inner loop due to dr2[m] being updated (updating it from multiple threads also cause a race condition).
Moreover, prange is often not practically useful when parallel=True is not set in the Numba decorator. Indeed, the current functions are not parallel since this flag is missing.
Finally, you can merge the function pair_vec_dist and get_dr2_vec2 and the internal loops so to avoid creating and filling large temporary arrays. Indeed, the RAM throughput is pretty small nowadays compared to the computing power of modern processor. This gap is getting bigger since the last two decades. This effect is called the "memory wall" and it is not expected to disappear any time soon. Codes less memory-bound generally tends to be faster and scale better.
Here is the resulting code:
#njit(cache=True, parallel=True)
def pair_vec_dist(pIList=np.array([[]]),pJList=np.array([[]]),cellsize=np.array([])):
assert pIList.shape[1] == pJList.shape[1]
dr2=np.zeros(pIList.shape[0]*pJList.shape[0],dtype=np.float64)
inv_cellsize = 1.0 / cellsize
for i in numba.prange(pIList.shape[0]):
for j in range(pJList.shape[0]):
offset = j + pJList.shape[0] * i
for k in range(pIList.shape[1]):
tmp = pJList[j,k]-pIList[i,k]
k = tmp-cellsize[k]*np.rint(tmp*inv_cellsize[k])
dr2[offset] += k**2
return dr2
It is 11 times faster with frames=50 and N=1000 on my 6-core machine (i5-9600KF).
The code can be optimized further. For example, dr2 is a flatten symmetric square matrix, so only the upper-right part needs to be computed and the bottom-left part can just be copied. Note that to do that efficiently in parallel, the work needs to be balanced between the thread (otherwise, the slowest will not be faster and will be the bottleneck). One can also generate an optimized version of the function only supporting cellsize.size == 3. Moreover, one can use register tiling so to make the code more cache-friendly. Finally, one can transpose the input so the layout is more SIMD-friendly (this certainly require the loop to be manually unrolled and the register tiling optimization to be done before).

Related

Python Calculate the similarity between two origin destination pairs

I'm looking to calculate the similarity between two routes, where a route is defined as an origin-destination pair. I've looked around and couldn't find a similarity measure which takes both direction and length into account, so I invented my own (but to be honest, I'd prefer doing something standard):
Here's the code to calculate similarity score for this metric:
def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))
sim = (l1+l2)/(l1+l2+od+dd)
return sim
My main problem now is that this is too slow - I have 200k origin destination pairs that all need to be compared to each other and on my current compute that'll take 45 days.
Does anyone know of a different approach that I could use?

import mpu
import time
from itertools import combinations
import numpy as np
from numpy import sin, cos, sqrt, arctan
# initial function
def calculate_similarity(o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long):
l1 = mpu.haversine_distance((o1_lat, o1_long), (d1_lat, d1_long))
l2 = mpu.haversine_distance((o2_lat, o2_long), (d2_lat, d2_long))
od = mpu.haversine_distance((o1_lat, o1_long), (o2_lat, o2_long))
dd = mpu.haversine_distance((d1_lat, d1_long), (d2_lat, d2_long))
sim = (l1+l2)/(l1+l2+od+dd)
return sim
# get n random origin, destination coordinates
def get_random_coordinates(n_coordinates):
MIN_LATITUDE = -90
MAX_LATITUDE = 90
MIN_LONGITUDE = -180
MAX_LONGITUDE = 180
coordinates = []
for i in range(n_coordinates):
o_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
o_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
d_lat = np.random.uniform(MIN_LATITUDE, MAX_LATITUDE)
d_long = np.random.uniform(MIN_LONGITUDE, MAX_LONGITUDE)
coordinates.append((o_lat, o_long, d_lat, d_long))
return coordinates
# compute similarity matrix (iterative approach)
def calculate_similarity_matrix(coordinates, similarity_function):
n = len(coordinates)
similarity_matrix = np.diag(np.ones(n))
for i, j in combinations(range(n), 2):
o1_lat, o1_long, d1_lat, d1_long = coordinates[i]
o2_lat, o2_long, d2_lat, d2_long = coordinates[j]
similarity = similarity_function(
o1_lat, o1_long, d1_lat, d1_long, o2_lat, o2_long, d2_lat, d2_long
)
similarity_matrix[i, j] = similarity
similarity_matrix[j, i] = similarity
return similarity_matrix
# faster approach
def csm(coordinates):
n = len(coordinates)
rc = np.radians(coordinates)
c = np.array(list(combinations(np.arange(n),2)))
a = sin((rc[:,2]-rc[:,0])/2)**2+cos(rc[:,0])*cos(rc[:,2])*sin((rc[:,3]-rc[:,1])/2)**2
d = arctan(sqrt(a/(1-a)))
l1_p_l2 = d[c[:,0]]+d[c[:,1]]
aod = sin((rc[c[:,1],0]-rc[c[:,0],0])/2)**2+cos(rc[c[:,0],0])*cos(rc[c[:,1],0])*sin((rc[c[:,1],1]-rc[c[:,0],1])/2)**2
add = sin((rc[c[:,1],2]-rc[c[:,0],2])/2)**2+cos(rc[c[:,0],2])*cos(rc[c[:,1],2])*sin((rc[c[:,1],3]-rc[c[:,0],3])/2)**2
tri = np.diag(np.full(n, 1/2))
tri[np.triu_indices(n,1)] = l1_p_l2/(l1_p_l2+arctan(sqrt(aod/(1-aod)))+arctan(sqrt(add/(1-add))))
return tri + tri.T
# simulate coordinates
n_coordinates = 1000
np.random.seed(69)
coordinates = get_random_coordinates(n_coordinates)
# number of pairs
target = 200000
# time provided function
start = time.time()
similarity_matrix = calculate_similarity_matrix(coordinates, calculate_similarity)
end = time.time()
runtime = end - start
# time faster approach
start2 = time.time()
similarity_matrix2 = csm(coordinates)
end2 = time.time()
runtime2 = end2 - start2
# ensure correctness
assert np.max(np.abs(similarity_matrix - similarity_matrix2)) < 1e-8
# calculate estimated speedup & runtime
speedup = runtime / runtime2
n_sim = n_coordinates * (n_coordinates - 1) / 2
n_target = target * (target - 1) / 2
ratio = n_target / n_sim
runtime = runtime2 * ratio / 3600
print(f'Estimated speedup: {speedup:.2f}x')
print(f'Estimated runtime: {runtime:.2f} hours')
prints
Estimated speedup: 31.31x
Estimated runtime: 4.70 hours
This code employs a wide range of tricks to speed up runtime, above all, it vectorizes the computation of the similarity measures by using numpy. Some of those tricks are:
Avoid unneccessary recomputations of the route distances: If calculate_similarity were to be called on each pair of routes out of a total of n routes, then each route distance would be computed n-1 times. By effectively precomputing the distances only once, one can thus save n-2 distance computations per route.
Avoid unneccessarily converting the distances to kilometers by removing the multiplication with the earth's diameter. This works due to the way SimScore is defined, i.e. due to the fact that multiplicating each additive component l1, l2, od, dd by a constant will not affect the ratio (l1 + l2) / (l1 + l2 + od + dd).
Avoid unneccessary conversions of latitude and longitude degrees to radians by computing them only once up front.
Avoid performing checks for coordinate validity. Note that this preconditions all coordinates to be within the valid range.
Simplifying math to save on expensive operations, e.g. rewriting atan2(sqrt(a), sqrt(1 - a) as atan(sqrt(a / (1 - a)) (saving square root operations) or sin(x) * sin(x) as sin(x) ** 2 (saving sine operations).
To estimate the speedup and runtime, n=1000 random routes were simulated and both approaches were timed. To verify correctness, the code asserts the infinity norm of the difference between the similarity matrices to be essentially zero. Note that these estimates should only serve as rough proxies, as the chosen timing method is not very robust and things will likely look a bit different for larger n (the esimated runtime is simply a linear extrapolation) and these numbers were achieved on a GPU via Google Colab, which additionally affects the estimation bias.
I believe there are still optimizations that could be made, especially: If highly accurate similarity measures are not required for the use case at hand and routes (i.e. l1, l2) and/or distances between start- and endpoints (i.e. od, dd) are small enough, then using a cheaper to compute distance metric may approximate the Haversine distance well enough, and thus could be employed if/where applicable.
However, as 200000 * (200000 - 1) / 2 = 19'999'900'000 similarity measures are to be computed, this will inevitably take a while.

How to implement multiprocessing in Monte Carlo integration

I created a Python program that integrates a given function over a given interval using Monte Carlo simulation. It works well, except for the fact that it runs painfully slow when you want higher levels of accuracy (larger N value). I figured I'd give multiprocessing a try in order to speed it up, but then I realized I have no clue how to implement it. Here's what I have right now:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Process
import os
# GOAL: Approximate the integral of a function f(x) from lower bound a to upper bound b using Monte Carlo simulation
# bounds of integration
a = 0
b = np.pi
# function to integrate
def f(x):
return np.sin(x)
N = 10000
areas = []
def mcIntegrate():
for i in range(N):
# array filled with random numbers between limits
xrand = random.uniform(a, b, N)
# sum the return values of the function of each random number
integral = 0.0
for i in range(N):
integral += f(xrand[i])
# scale integral by difference of bounds divided by amount of random values
ans = integral * ((b - a) / float(N))
# add approximation to list of other approximations
areas.append(ans)
if __name__ == "__main__":
processes = []
numProcesses = os.cpu_count()
for i in range(numProcesses):
process = Process(target=mcIntegrate)
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.start()
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
Can I get some help with this implementation?
Took advice from the comments and used multiprocessor.Pool, and also cut down on some operations by using NumPy instead. Went from taking about 5min to run to now about 6sec (for N = 10000). Here's my implementation:
import scipy
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import os
# GOAL: Approximate the integral of function f from lower bound a to upper bound b using Monte Carlo simulation
a = 0 # lower bound of integration
b = np.pi # upper bound of integration
f = np.sin # function to integrate
N = 10000 # sample size
def mcIntegrate(p):
xrand = scipy.random.uniform(a, b, N) # create array filled with random numbers within bounds
integral = np.sum(f(xrand)) # sum return values of function of each random number
approx = integral * ((b - a) / float(N)) # scale integral by difference of bounds divided by sample size
return approx
if __name__ == "__main__":
# run simulation N times in parallel and store results in array
with multiprocessing.Pool(os.cpu_count()) as pool:
areas = pool.map(mcIntegrate, range(N))
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()

This turned out to be a more interesting problem than I thought it would when I got to optimising it. The basic method is very simple:
from multiprocessing import pool
def f(x):
return x
results = pool.map(f, range(100))
Here is your mcIntegerate adapted for multiprocessing:
from tqdm import tqdm
def mcIntegrate(steps):
tasks = []
print("Setting up simulations")
# linear
for _ in tqdm(range(steps)):
xrand = random.uniform(a, b, steps)
for i in range(steps):
tasks.append(xrand[i])
pool = Pool(cpu_count())
print("Simulating (no progress)")
results = pool.map(f, tasks)
pool.close()
print("summing")
areas = []
for chunk in tqdm(range(steps)):
vals = results[chunk * steps : (chunk + 1) * steps]
integral = sum(vals)
ans = integral * ((b - a) / float(steps))
areas.append(ans)
return areas
tqdm is just used to display a progress bar.
This is the basic workflow for multiprocessing: break the question up into tasks, solve all the tasks, then add them all back together again. And indeed the code as given works. (Note that I've changed your N for steps).
For completeness, the script now begins:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
# function to integrate
def f(x):
return np.sin(x)
and ends
areas = mcIntegrate(3_000)
a = 0
b = np.pi
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec="black")
plt.xlabel("Areas")
plt.show()
Optimisation
I deliberately split the problem up at the smallest possible level. Was this a good idea? To answer that, consider: how might we optimise the linear process of generating the tasks? This does take a considerable while at the moment. We could parallelise it:
def _prepare(steps):
xrand = random.uniform(a, b, steps)
return [xrand[i] for i in range(steps)]
def mcIntegrate(steps):
...
tasks = []
for res in tqdm(pool.imap(_prepare, (steps for _ in range(steps))), total=steps):
tasks += res # slower except for very large steps
Here I've used pool.imap, which returns an iterator which we can iterate as soon as the results are available, allowing us to build a progress bar. If you do this and compare, you will see that it runs slower than the linear solution. Removing the progress bar (on my machine) and replace with:
import time
start = time.perf_counter()
results = pool.map(_prepare, (steps for _ in range(steps)))
tasks = []
for res in results:
tasks += res
print(time.perf_counter() - start)
Is only marginally faster: it's still slower than running linear. Serialising data to a process and then deserialising it has an overhead. If you try to get a progress bar on the whole thing, it becomes excruciatingly slow:
results = []
for result in tqdm(pool.imap(f, tasks), total=len(tasks)):
results.append(result)
So what about iterating at a higher level? Here's another adaption of your mcIterate:
a = 0
b = np.pi
def _mcIntegrate(steps):
xrand = random.uniform(a, b, steps)
integral = 0.0
for i in range(steps):
integral += f(xrand[i])
ans = integral * ((b - a) / float(steps))
return ans
def mcIntegrate(steps):
areas = []
p = Pool(cpu_count())
for ans in tqdm(p.imap(_mcIntegrate, ((steps) for _ in range(steps))), total=steps):
areas.append(ans)
return areas
This, on my machine, is considerably faster. It's also considerably simpler. I was expecting a difference, but not such a considerable difference.
Takeaways
Multiprocessing isn't free. Something as simple as np.sin() is too cheap to multprocess: we pay to serialise, deserialise, append, and so on, all for one sin() calculation. But if you do too many calculations, you will waste time as you lose granularity. Here the effect is more striking than I was expecting. The only way to know the right level of granularity for a particular problem... is to profile and try.

My experience is that multiprocessing is often not very efficient (a ton of overhead). The more you push your code into numpy the faster it will be, with one caveat; you can overload your memory if you're not careful (10k x 10k is getting large). Lastly, it looks like N is doing double duty, both defining sample size for each estimate, and also serving as the number of trial estimates.
Here is how I would do this (with minor style changes):
import numpy as np
f = np.sin
a = 0
b = np.pi
# number samples for each trial, trial count, and number calculated at once
N = 10000
TRIALS = 10000
BATCH_SIZE=1000
def mc_integrate(f, a, b, N, batch_size=BATCH_SIZE):
# compute everything carrying `batch_size` copies by extending the array dimension.
# samples.shape == (N, batch_size)
samples = np.random.uniform(a, b, size=(N, batch_size))
integrals = np.sum(f(samples), axis=0)
mc_estimates = integrals * ((b - a) / N)
return mc_estimates
# loop over batch values to get final result
n, r = divmod(TRIALS, BATCH_SIZE)
results = []
for j in [BATCH_SIZE]*n + [r]:
results.extend(mc_integrate(f, a, b, N, batch_size=j))
On my machine this takes a few seconds.

Loop speed up of FFT in python (with `np.einsum`)

Problem: I want to speed up my python loop containing a lot of products and summations with np.einsum, but I'm also open to any other solutions.
My function takes an vector configuration S of shape (n,n,3) (my case: n=72) and does a Fourier-Transformation on the correlation function for N*N points. The correlation function is defined as the product of every vector with every other. This gets multiplied by a cosine function of the postions of vectors times the kx and ky values. Every position i,j is in the end summed to get one point in k-space p,m:
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
return(chi,kx,ky)
My problem is that I need roughly 100*100 points which are denoted by kx*ky and the loop needs to many hours to finish this job for a lattice with 72*72 vectors.
Number of calculations: 72*72*72*72*100*100
I cannot use the built-in FFT of numpy, because of my triangular grid, so I need some other option to reduce here the computional cost.
My idea: First I recognized that reshaping the configuration into a list of vectors instead of a matrix reduces the computational cost. Furthermore I used the numba package, which also has reduced the cost, but its still too slow. I found out that a good way of calculating these kind of objects is the np.einsum function. Calculating the product of every vector with every vector is done with the following:
np.einsum('ij,kj -> ik',np.reshape(S,(72**2,3)),np.reshape(S,(72**2,3)))
The tricky part is the calculation of the term inside the np.cos. Here I want to caclulate the product between a list of shape (100,1) with the positions of the vectors (e.g. np.shape(x)=(72**2,1)). Especially I really dont know how to implement the distance in x-direction and y-direction with np.einsum.
To reproduce the code (Probably you won't need this): First you need a vector configuration. You can do it simply with np.ones((72,72,3) or you take random vectors as example with:
def spherical_to_cartesian(r, theta, phi):
'''Convert spherical coordinates (physics convention) to cartesian coordinates'''
sin_theta = np.sin(theta)
x = r * sin_theta * np.cos(phi)
y = r * sin_theta * np.sin(phi)
z = r * np.cos(theta)
return x, y, z # return a tuple
def random_directions(n, r):
'''Return ``n`` 3-vectors in random directions with radius ``r``'''
out = np.empty(shape=(n,3), dtype=np.float64)
for i in range(n):
# Pick directions randomly in solid angle
phi = random.uniform(0, 2*np.pi)
theta = np.arccos(random.uniform(-1, 1))
# unpack a tuple
x, y, z = spherical_to_cartesian(r, theta, phi)
out[i] = x, y, z
return out
S = np.reshape(random_directions(72**2,1),(72,72,3))
(The reshape in this example is needed to shape it in the function spin_spin back to the (72**2,3) shape.)
For the positions of vectors I use a triangular grid defined by
def triangular(nsize):
'''Positional arguments of the spin configuration'''
X=np.zeros((nsize,nsize))
Y=np.zeros((nsize,nsize))
for i in range(nsize):
for j in range(nsize):
X[i,j]+=1/2*j+i
Y[i,j]+=np.sqrt(3)/2*j
return(X,Y)

Optimized Numba implementation
The main problem in your code is calling external BLAS function np.dot repeatedly with extremely small data. In this code it would make more sense to calculate them only once, but if you have to do this calculations in a loop write a Numba implementation. Example
Optimized function (brute-force)
import numpy as np
import numba as nb
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin(S,N):
n= len(S)
conf = np.reshape(S,(n**2,3))
chi = np.zeros((N,N))
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N).astype(np.float32)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N).astype(np.float32)
x=np.reshape(triangular(n)[0],(n**2)).astype(np.float32)
y=np.reshape(triangular(n)[1],(n**2)).astype(np.float32)
#precalc some values
fact=nb.float32(2/(n**2))
conf_dot=np.dot(conf,conf.T).astype(np.float32)
for p in nb.prange(N):
for m in range(N):
#accumulating on a scalar is often beneficial
acc=nb.float32(0)
for i in range(n**2):
for j in range(n**2):
acc+= conf_dot[i,j]*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
chi[p,m]=fact*acc
return(chi,kx,ky)
Optimized function (removing of redundant calculations)
There are a lot of redundant calculations done. This is an example on how to remove them. This is also a version which does the calculations in double precision.
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
#nb.njit()
def precalc(S):
#There may not be all redundancies removed
n= len(S)
conf = np.reshape(S,(n**2,3))
conf_dot=np.dot(conf,conf.T)
x=np.reshape(triangular(n)[0],(n**2))
y=np.reshape(triangular(n)[1],(n**2))
x_s=set()
y_s=set()
for i in range(n**2):
for j in range(n**2):
x_s.add((x[i]-x[j]))
y_s.add((y[i]-y[j]))
x_arr=np.sort(np.array(list(x_s)))
y_arr=np.sort(np.array(list(y_s)))
conf_dot_sel=np.zeros((x_arr.shape[0],y_arr.shape[0]))
for i in range(n**2):
for j in range(n**2):
ii=np.searchsorted(x_arr,x[i]-x[j])
jj=np.searchsorted(y_arr,y[i]-y[j])
conf_dot_sel[ii,jj]+=conf_dot[i,j]
return x_arr,y_arr,conf_dot_sel
#nb.njit(fastmath=True,error_model="numpy",parallel=True)
def spin_spin_opt_2(S,N):
chi = np.empty((N,N))
n= len(S)
kx = np.linspace(-5*np.pi/3,5*np.pi/3,N)
ky = np.linspace(-3*np.pi/np.sqrt(3),3*np.pi/np.sqrt(3),N)
x_arr,y_arr,conf_dot_sel=precalc(S)
fact=2/(n**2)
for p in nb.prange(N):
for m in range(N):
acc=nb.float32(0)
for i in range(x_arr.shape[0]):
for j in range(y_arr.shape[0]):
acc+= fact*conf_dot_sel[i,j]*np.cos(kx[p]*x_arr[i]+ ky[m]*y_arr[j])
chi[p,m]=acc
return(chi,kx,ky)
Timings
#brute-force
%timeit res=spin_spin(S,100)
#48 s ± 671 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#new version
%timeit res_2=spin_spin_opt_2(S,100)
#5.33 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=spin_spin_opt_2(S,1000)
#1min 23s ± 2.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit (SVML-check)
import numba as nb
import numpy as np
#nb.njit(fastmath=True)
def foo(n):
x = np.empty(n*8, dtype=np.float64)
ret = np.empty_like(x)
for i in range(ret.size):
ret[i] += np.cos(x[i])
return ret
foo(1000)
if 'intel_svmlcc' in foo.inspect_llvm(foo.signatures[0]):
print("found")
else:
print("not found")
#found
If there is a not found read this link. It should work on Linux and Windows, but I haven't tested it on macOS.

Here is one approach to speed things up. I didn't start using np.einsum because a little tweaking of your loops was sufficient.
The main thing slowing down your code was redundant recalculations of the same thing. The nested loop here is the perpetrator:
for p in range(N):
for m in range(N):
for i in range(n**2):
for j in range(n**2):
chi[p,m] += 2/(n**2)*np.dot(conf[i],conf[j])*np.cos(kx[p]*(x[i]-x[j])+ ky[m]*(y[i]-y[j]))
It contains a lot of redundancy, recalculating vector operations many times.
Consider the np.dot(...): this calculation is completely independent of the points kx and ky. But only the points kx and ky required indexing with m and n. So you can run the dot products over all i and j just once, and save the result, as opposed to recalculating for each m,n (which would be 10,000 times!).
In a similar approach, no need for the vector differences between to be recalculated at each point in the lattice. At every point you calculate every vector distance, when all that is needed is to calculate the vector distances once and merely multiply this result by each lattice point.
So, having fixed the loops and used dictionaries with indices (i,j) as keys to store all the values, you can just look up the relevant value during the loop over i, j. Here is my code:
def spin_spin(S, N):
n = len(S)
conf = np.reshape(S,(n**2, 3))
chi = np.zeros((N, N))
kx = np.linspace(-5*np.pi/3, 5*np.pi/3, N)
ky = np.linspace(-3*np.pi/np.sqrt(3), 3*np.pi/np.sqrt(3), N)
# Minor point; no need to use triangular twice
x, y = triangular(n)
x, y = np.reshape(x,(n**2)), np.reshape(y,(n**2))
# Build a look-up for all the dot products to save calculating them many times
dot_prods = dict()
x_diffs, y_diffs = dict(), dict()
for i, j in itertools.product(range(n**2), range(n**2)):
dot_prods[(i, j)] = np.dot(conf[i], conf[j])
x_diffs[(i, j)], y_diffs[(i, j)] = x[i] - x[j], y[i] - y[j]
# Minor point; improve syntax by converting nested for loops to one line
for p, m in itertools.product(range(N), range(N)):
for i, j in itertools.product(range(n**2), range(n**2)):
# All vector operations are replaced by look ups to the dictionaries defined above
chi[p, m] += 2/(n**2)*dot_prods[(i, j)]*np.cos(kx[p]*(x_diffs[(i, j)]) + ky[m]*(y_diffs[(i, j)]))
return(chi, kx, ky)
I am running this at the moment with the dimensions you provide, on a decent machine, and the loop over i,j finishes in two minutes. That only needs to happen once; then it is just a loop over m, n. Each one of these is taking about 90 seconds, so still a 2-3 hour run time. I welcome any suggestions on how to optimise that cos calculation to speed that up!
I hit the low hanging fruit of optimization, but to give a sense of speed, the loop of i, j takes 2 minutes, and this way it runs 9,999 fewer times!

Intersection of two arrays, retaining order in larger array

I have a numpy array a of length n, which has the numbers 0 through n-1 shuffled in some way. I also have a numpy array mask of length <= n, containing some subset of the elements of a, in a different order.
The query I want to compute is "give me the elements of a that are also in mask in the order that they appear in a".
I had a similar question here, but the difference was that mask was a boolean mask instead of a mask on the individual elements.
I've outlined and tested 4 methods below:
import timeit
import numpy as np
import matplotlib.pyplot as plt
n_test = 100
n_coverages = 10
np.random.seed(0)
def method1():
return np.array([x for x in a if x in mask])
def method2():
s = set(mask)
return np.array([x for x in a if x in s])
def method3():
return a[np.in1d(a, mask, assume_unique=True)]
def method4():
bmask = np.full((n_samples,), False)
bmask[mask] = True
return a[bmask[a]]
methods = [
('naive membership', method1),
('python set', method2),
('in1d', method3),
('binary mask', method4)
]
p_space = np.linspace(0, 1, n_coverages)
for n_samples in [1000]:
a = np.arange(n_samples)
np.random.shuffle(a)
for label, method in methods:
if method == method1 and n_samples == 10000:
continue
times = []
for coverage in p_space:
mask = np.random.choice(a, size=int(n_samples * coverage), replace=False)
time = timeit.timeit(method, number=n_test)
times.append(time * 1e3)
plt.plot(p_space, times, label=label)
plt.xlabel(r'Coverage ($\frac{|\mathrm{mask}|}{|\mathrm{a}|}$)')
plt.ylabel('Time (ms)')
plt.title('Comparison of 1-D Intersection Methods for $n = {}$ samples'.format(n_samples))
plt.legend()
plt.show()
Which produced the following results:
So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.
My question is, is there a faster way?

So, binary mask, is, without a doubt, the fastest method of these 4 for any size of the mask.
My question is, is there a faster way?
I totally agree that binary mask method is the fastest one. I also don't think there could be any better ways in terms of computation complexity to do what you need.
Let me analyse your method time results:
Method running time is T = O(|a|*|mask|) time. Every element of a is checked to be present in mask by iterating over every its element. It gives O(|mask|) time per element in the worst case when element is missing in mask. |a| does not change,
consider it a constant.
|mask| = coverage * |a|
T = O(|a|2 * coverage)
Hence a linear dependency of coverage in plot. Note that running time has quadratic dependency of |a|. If |mask| ≤ |a| and |a| = n then T = O(n2)
Second method is using set. Set is a data-structure that performs operations of insertion/lookup in O(log(n)), where n is a number of elements in the set. s = set(mask) takes O(|mask|*log(|mask|)) to complete because there are |mask| insertion operations.
x in s is a lookup operation. So second row runs in O(|a|*log(|mask|))
Overall time complexity is O(|mask|*log(|mask|) + |a|*log(|mask|)). If |mask| ≤ |a| and |a| = n then T = O(n*log(n)). You probably observe f(x) = log(x) dependency on plot.
in1d runs in O(|mask|*log(|mask|) + |a|*log(|mask|)) as well. Same T = O(n*log(n)) complexity and f(x) = log(x) dependency on plot.
Time complexity is O(|a| + |mask|) which is T = O(n) and its the best. You observe constant dependency on plot. Algorithm simply iterates over a and mask arrays couple of times.
The thing is that if you have to output n items you will already have T = O(n) complexity. So this method 4 algorithm is optimal.
P.S. In order to observe mentioned f(n) dependencies you'd better vary |a| and let |mask| = 0.9*|a|.
EDIT: Looks like python set indeed performs lookup/insert in O(1) using hash table.

Assuming a is the bigger one.
def with_searchsorted(a, b):
sb = b.argsort()
bs = b[sb]
sa = a.argsort()
ia = np.arange(len(a))
ra = np.empty_like(sa)
ra[sa] = ia
ac = bs.searchsorted(ia) % b.size
return a[(bs[ac] == ia)[ra]]
demo
a = np.arange(10)
np.random.shuffle(a)
b = np.random.choice(a, 5, False)
print(a)
print(b)
[7 2 9 3 0 4 8 5 6 1]
[0 8 5 4 6]
print(with_searchsorted(a, b))
[0 4 8 5 6]
how it works
# sort b for faster searchsorting
sb = b.argsort()
bs = b[sb]
# sort a for faster searchsorting
sa = a.argsort()
# this is the sorted a... we just cheat because we know what it will be
ia = np.arange(len(a))
# construct the reverse sort look up
ra = np.empty_like(sa)
ra[sa] = ia
# perform searchsort
ac = bs.searchsorted(ia) % b.size
return a[(bs[ac] == ia)[ra]]

Can I vectorise this python code?

I have written this python code to get neighbours of a label (a set of pixels sharing some common properties). The neighbours for a label are defined as the other labels that lie on the other side of the boundary (the neighbouring labels share a boundary). So, the code I wrote works but is extremely slow:
# segments: It is a 2-dimensional numpy array (an image really)
# where segments[x, y] = label_index. So each entry defines the
# label associated with a pixel.
# i: The label whose neighbours we want.
def get_boundaries(segments, i):
neighbors = []
for y in range(1, segments.shape[1]):
for x in range(1, segments.shape[0]):
# Check if current index has the label we want
if segments[x-1, y] == i:
# Check if neighbour in the x direction has
# a different label
if segments[x-1, y] != segments[x, y]:
neighbors.append(segments[x,y])
# Check if neighbour in the y direction has
# a different label
if segments[x, y-1] == i:
if segments[x, y-1] != segments[x, y]:
neighbors.append(segments[x, y])
return np.unique(np.asarray(neighbors))
As you can imagine, I have probably completely misused python here. I was wondering if there is a way to optimize this code to make it more pythonic.

Here you go:
def get_boundaries2(segments, i):
x, y = np.where(segments == i) # where i is
right = x + 1
rightMask = right < segments.shape[0] # keep in bounds
down = y + 1
downMask = down < segments.shape[1]
rightNeighbors = segments[right[rightMask], y[rightMask]]
downNeighbors = segments[x[downMask], down[downMask]]
neighbors = np.union1d(rightNeighbors, downNeighbors)
return neighbors
As you can see, there are no Python loops at all; I also tried to minimize copies (the first attempt made a copy of segments with a NAN border, but then I devised the "keep in bounds" check).
Note that I did not filter out i itself from the "neighbors" here; you can add that easily at the end if you want. Some timings:
Input 2000x3000: original takes 13 seconds, mine takes 370 milliseconds (35x speedup).
Input 1000x300: original takes 643 ms, mine takes 17.5 ms (36x speedup).

You need to replace your for loops with numpy's implicit looping.
I don't know enough about your code to convert it directly, but I can give an example.
Suppose you have an array of 100000 random integers, and you need to get an array of each element divided by its neighbor.
import random, numpy as np
a = np.fromiter((random.randint(1, 100) for i in range(100000)), int)
One way to do this would be:
[a[i] / a[i+1] for i in range(len(a)-1)]
Or this, which is much faster:
a / np.roll(a, -1)
Timeit:
initcode = 'import random, numpy as np; a = np.fromiter((random.randint(1, 100) for i in range(100000)), int)'
timeit.timeit('[a[i] / a[i+1] for i in range(len(a)-1)]', initcode, number=100)
5.822079309000401
timeit.timeit('(a / np.roll(a, -1))', initcode, number=100)
0.1392055350006558

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up vector distance calculation using Numba - python

Related

Python Calculate the similarity between two origin destination pairs

How to implement multiprocessing in Monte Carlo integration

Loop speed up of FFT in python (with `np.einsum`)

Intersection of two arrays, retaining order in larger array

Can I vectorise this python code?

Categories

Resources