Find all point pairs closer than a given maximum distance - python

I want to find (efficiently) all pairs of points that are closer than some distance max_d. My current method, using cdist, is:
import numpy as np
from scipy.spatial.distance import cdist
def close_pairs(X,max_d):
d = cdist(X,X)
I,J = (d<max_d).nonzero()
IJ = np.sort(np.vstack((I,J)), axis=0)
# remove diagonal element
IJ = IJ[:,np.diff(IJ,axis=0).ravel()<>0]
# remove duplicate
dt = np.dtype([('i',int),('j',int)])
pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)
return pairs
def test():
X = np.random.rand(100,2)*20
p = close_pairs(X,2)
from matplotlib import pyplot as plt
plt.clf()
plt.plot(X[:,0],X[:,1],'.r')
plt.plot(X[p,0].T,X[p,1].T,'-b')
But I think this is overkill (and not very readable), because most of the work is done only to remove distance-to-self and duplicates.
My main question is: is there a better way to do it?
(Note: the type of outputs (array, set, ...) is not important at this point)
My current thinking is on using pdist which returns a condensed distance array which contains only the right pairs. However, once I found the suitable coordinates k's from the condensed distance array, how do I compute which i,j pairs it is equivalent to?
So the alternative question is: is there an easy way to get the list of coordinate pairs relative to the entries of pdist outputs:
a function f(k)->i,j
such that cdist(X,X)[i,j] = pdist(X)[k]

In my experience, there are two fastest ways to find neighbor lists in 3D. One is to use a most naive double-for-loop code written in C++ or Cython (in my case, both). It runs in N^2, but is very fast for small systems. The other way is to use a linear time algorithm. Scipy ckdtree is a good choice, but has limitations. Neighbor list finders from molecular dynamics software are most powerful, but are very hard to wrap, and likely have slow initialization time.
Below I compare four methods:
Naive cython code
Wrapper around OpenMM (is very hard to install, see below)
Scipy.spatial.ckdtree
scipy.spatial.distance.pdist
Test setup: n points scattered in a rectangular box at volume density 0.2. System size ranging from 10 to a 1000000 (a million) particles. Contact radius is taken from 0.5, 1, 2, 4, 7, 10. Note that because density is 0.2, at contact radius 0.5 we'll have on average about 0.1 contacts per particle, at 1 = 0.8, at 2 = 6.4, and at 10 - about 800! Contact finding was repeated several times for small systems, done once for systems >30k particles. If time per call exceeded 5 seconds, the run was aborted.
Setup: dual xeon 2687Wv3, 128GB RAM, Ubuntu 14.04, python 2.7.11, scipy 0.16.0, numpy 1.10.1. None of the code was using parallel optimizations (except for OpenMM, though parallel part went so quick that it was not even noticeable on a CPU graph, most of the time was spend piping data to-from OpenMM).
Results: Note that plots below are logscale, and spread over 6 orders of magnitude. Even small visual difference may be actually 10-fold.
For systems less than 1000 particles, Cython code was always faster. However, after 1000 particles results are dependent on the contact radius. pdist implementation was always slower than cython, and takes much more memory, because it explicitly creates a distance matrix, which is slow because of sqrt.
At small contact radius (<1 contact per particle), ckdtree is a good choice for all system sizes.
At medium contact radius, (5-50 contacts per particle) naive cython implementation is the best up to 10000 particles, then OpenMM starts to win by about several orders of magnitude, but ckdtree performs just 3-10 times worse
At high contact radius (>200 contacts per particle) naive methods work up to 100k or 1M particles, then OpenMM may win.
Installing OpenMM is very tricky; you can read more in http://bitbucket.org/mirnylab/openmm-polymer file "contactmaps.py" or in the readme. However, the results below show that it is only advantageous for 5-50 contacts per particle, for N>100k particles.
Cython code below:
import numpy as np
cimport numpy as np
cimport cython
cdef extern from "<vector>" namespace "std":
cdef cppclass vector[T]:
cppclass iterator:
T operator*()
iterator operator++()
bint operator==(iterator)
bint operator!=(iterator)
vector()
void push_back(T&)
T& operator[](int)
T& at(int)
iterator begin()
iterator end()
np.import_array() # initialize C API to call PyArray_SimpleNewFromData
cdef public api tonumpyarray(int* data, long long size) with gil:
if not (data and size >= 0): raise ValueError
cdef np.npy_intp dims = size
#NOTE: it doesn't take ownership of `data`. You must free `data` yourself
return np.PyArray_SimpleNewFromData(1, &dims, np.NPY_INT, <void*>data)
#cython.boundscheck(False)
#cython.wraparound(False)
def contactsCython(inArray, cutoff):
inArray = np.asarray(inArray, dtype = np.float64, order = "C")
cdef int N = len(inArray)
cdef np.ndarray[np.double_t, ndim = 2] data = inArray
cdef int j,i
cdef double curdist
cdef double cutoff2 = cutoff * cutoff # IMPORTANT to avoid slow sqrt calculation
cdef vector[int] contacts1
cdef vector[int] contacts2
for i in range(N):
for j in range(i+1, N):
curdist = (data[i,0] - data[j,0]) **2 +(data[i,1] - data[j,1]) **2 + (data[i,2] - data[j,2]) **2
if curdist < cutoff2:
contacts1.push_back(i)
contacts2.push_back(j)
cdef int M = len(contacts1)
cdef np.ndarray[np.int32_t, ndim = 2] contacts = np.zeros((M,2), dtype = np.int32)
for i in range(M):
contacts[i,0] = contacts1[i]
contacts[i,1] = contacts2[i]
return contacts
Compilation (or makefile) for Cython code:
cython --cplus fastContacts.pyx
g++ -g -march=native -Ofast -fpic -c fastContacts.cpp -o fastContacts.o `python-config --includes`
g++ -g -march=native -Ofast -shared -o fastContacts.so fastContacts.o `python-config --libs`
Testing code:
from __future__ import print_function, division
import signal
import time
from contextlib import contextmanager
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import ckdtree
from scipy.spatial.distance import pdist
from contactmaps import giveContactsOpenMM # remove this unless you have OpenMM and openmm-polymer libraries installed
from fastContacts import contactsCython
class TimeoutException(Exception): pass
#contextmanager
def time_limit(seconds):
def signal_handler(signum, frame):
raise TimeoutException("Timed out!")
signal.signal(signal.SIGALRM, signal_handler)
signal.alarm(seconds)
try:
yield
finally:
signal.alarm(0)
matplotlib.rcParams.update({'font.size': 8})
def close_pairs_ckdtree(X, max_d):
tree = ckdtree.cKDTree(X)
pairs = tree.query_pairs(max_d)
return np.array(list(pairs))
def condensed_to_pair_indices(n, k):
x = n - (4. * n ** 2 - 4 * n - 8 * k + 1) ** .5 / 2 - .5
i = x.astype(int)
j = k + i * (i + 3 - 2 * n) / 2 + 1
return np.array([i, j]).T
def close_pairs_pdist(X, max_d):
d = pdist(X)
k = (d < max_d).nonzero()[0]
return condensed_to_pair_indices(X.shape[0], k)
a = np.random.random((100, 3)) * 3 # test set
methods = {"cython": contactsCython, "ckdtree": close_pairs_ckdtree, "OpenMM": giveContactsOpenMM,
"pdist": close_pairs_pdist}
# checking that each method gives the same value
allUniqueInds = []
for ind, method in methods.items():
contacts = method(a, 1)
uniqueInds = contacts[:, 0] + 100 * contacts[:, 1] # unique index of each contacts
allUniqueInds.append(np.sort(uniqueInds)) # adding sorted unique conatcts
for j in allUniqueInds:
assert np.allclose(j, allUniqueInds[0])
# now actually doing testing
repeats = [30,30,30, 30, 30, 20, 20, 10, 5, 3, 2 , 1, 1, 1]
sizes = [10,30,100, 200, 300, 500, 1000, 2000, 3000, 10000, 30000, 100000, 300000, 1000000]
systems = [[np.random.random((n, 3)) * ((n / 0.2) ** 0.333333) for k in range(repeat)] for n, repeat in
zip(sizes, repeats)]
for j, radius in enumerate([0.5, 1, 2, 4, 7, 10]):
plt.subplot(2, 3, j + 1)
plt.title("Radius = {0}; {1:.2f} cont per particle".format(radius, 0.2 * (4 / 3 * np.pi * radius ** 3)))
times = {i: [] for i in methods}
for name, method in methods.items():
for n, system, repeat in zip(sizes, systems, repeats):
if name == "pdist" and n > 30000:
break # memory issues
st = time.time()
try:
with time_limit(5 * repeat):
for ind in range(repeat):
k = len(method(system[ind], radius))
except:
print("Run aborted")
break
end = time.time()
mytime = (end - st) / repeat
times[name].append((n, mytime))
print("{0} radius={1} n={2} time={3} repeat={4} contPerParticle={5}".format(name, radius, n, mytime,repeat, 2 * k / n))
for name in sorted(times.keys()):
plt.plot(*zip(*times[name]), label=name)
plt.xscale("log")
plt.yscale("log")
plt.xlabel("System size")
plt.ylabel("Time (seconds)")
plt.legend(loc=0)
plt.show()

Here's how to do it with the cKDTree module. See query_pairs
import numpy as np
from scipy.spatial.distance import cdist
from scipy.spatial import ckdtree
def close_pairs(X,max_d):
d = cdist(X,X)
I,J = (d<max_d).nonzero()
IJ = np.sort(np.vstack((I,J)), axis=0)
# remove diagonal element
IJ = IJ[:,np.diff(IJ,axis=0).ravel()<>0]
# remove duplicate
dt = np.dtype([('i',int),('j',int)])
pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)
return pairs
def close_pairs_ckdtree(X, max_d):
tree = ckdtree.cKDTree(X)
pairs = tree.query_pairs(max_d)
return np.array(list(pairs))
def test():
np.random.seed(0)
X = np.random.rand(100,2)*20
p = close_pairs(X,2)
q = close_pairs_ckdtree(X, 2)
from matplotlib import pyplot as plt
plt.plot(X[:,0],X[:,1],'.r')
plt.plot(X[p,0].T,X[p,1].T,'-b')
plt.figure()
plt.plot(X[:,0],X[:,1],'.r')
plt.plot(X[q,0].T,X[q,1].T,'-b')
plt.show()
t

I finally found it myself. The function converting indices k in condensed distance array to equivalent i,j in square distance array is:
def condensed_to_pair_indices(n,k):
x = n-(4.*n**2-4*n-8*k+1)**.5/2-.5
i = x.astype(int)
j = k+i*(i+3-2*n)/2+1
return i,j
I had to play a little with sympy to find it. Now, to compute all point pairs than are less than a given distance apart:
def close_pairs_pdist(X,max_d):
d = pdist(X)
k = (d<max_d).nonzero()[0]
return condensed_to_pair_indices(X.shape[0],k)
As expected, it is more efficient than the other methods (but I did not test ckdtree). I will update the timeit answer.

slightly faster, didnt test the time difference thoroughly, but if i ran it a few times, it gave a time of about 0.0755529403687 for my method, and 0.0928771495819 for yours. I use the triu method to get rid of upper triangle of the array (where duplicates are) including diagonal (which is where the self-distances are), and i dont sort either, since if you plot it, it does not matter if i plot them in order or not. So i guess it speeds up about 15% or so
import numpy as np
from scipy.spatial.distance import cdist
from scipy.misc import comb
def close_pairs(X,max_d):
d = cdist(X,X)
I,J = (d<max_d).nonzero()
IJ = np.sort(np.vstack((I,J)), axis=0)
# remove diagonal element
IJ = IJ[:,np.diff(IJ,axis=0).ravel()<>0]
# remove duplicate
dt = np.dtype([('i',int),('j',int)])
pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)
return pairs
def close_pairs1(X,max_d):
d = cdist(X,X)
d1 = np.triu_indices(len(X)) # indices of the upper triangle including the diagonal
d[d1] = max_d+1 # value that will not get selected when doing d<max_d in the next line
I,J = (d<max_d).nonzero()
pairs = np.vstack((I,J)).T
return pairs
def close_pairs3(X, max_d):
d = pdist(X)
n = len(X)
pairs = np.zeros((0,2))
for i in range(n):
for j in range(i+1,n):
# formula from http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html
a=d[int(comb(n,2)-comb(n-i,2)+j-i-1+0.1)] # the +0.1 is because otherwise i get floating point trouble
if(a<max_d):
pairs = np.r_[pairs, np.array([i,j])[None,:]]
return pairs
def close_pairs4(X, max_d):
d = pdist(X)
n = len(X)
a = np.where(d<max_d)[0]
i = np.arange(n)[:,None]
j = np.arange(n)[None,:]
b = np.array(comb(n,2)-comb(n-i,2)+j-i-1+0.1, dtype=int)
d1 = np.tril_indices(n)
b[d1] = -1
pairs = np.zeros((0,2), dtype=int)
# next part is the bottleneck: the np.where each time,
for v in a:
i, j = np.where(v==b)
pairs = np.r_[pairs, np.array([i[0],j[0]])[None,:]]
return pairs
def close_pairs5(X, max_d):
t0=time.time()
d = pdist(X)
n = len(X)
a = np.where(d<max_d)[0]
i = np.arange(n)[:,None]
j = np.arange(n)[None,:]
t1 = time.time()
b = np.array(comb(n,2)-comb(n-i,2)+j-i-1+0.1, dtype=int)
d1 = np.tril_indices(n)
b[d1] = -1
t2 = time.time()
V = b[:,:,None]-a[None,None,:] # takes a little time
t3 = time.time()
p = np.where(V==0) # takes most of the time, thought that removing the for-loop from the previous method might improve it, but it does not do that much. This method contains the formula you wanted though, but apparently it is still faster if you use the cdist methods
t4 = time.time()
pairs = np.vstack((p[0],p[1])).T
print t4-t3,t3-t2, t2-t1, t1-t0
return pairs
def test():
X = np.random.rand(1000,2)*20
import time
t0 = time.time()
p = close_pairs(X,2)
t1 = time.time()
p2 = close_pairs1(X,2)
t2 = time.time()
print t2-t1, t1-t0
from matplotlib import pyplot as plt
plt.figure()
plt.clf()
plt.plot(X[:,0],X[:,1],'.r')
plt.plot(X[p,0].T,X[p,1].T,'-b')
plt.figure()
plt.clf()
plt.plot(X[:,0],X[:,1],'.r')
plt.plot(X[p2,0].T,X[p2,1].T,'-b')
plt.show()
test()
NOTE: plotting laggs if you do it for 1K points, but it needs 1K points to compare speeds, but i checked that it works correctly when plotting it if doing it with 100 points
The speed difference is something like ten to twenty percent, and i think it will not get much better than this, since i got rid of all the sorting and unique elements things, so the part that takes most of the time probably is the d = cdist(X, X) line
Edit: some more testing shows that in those times, that cdist line takes up about 0.065 sec, while the rest for your method is about 0.02 and for me about 0.015 sec or so. Conclusion: the main bottleneck of your code is the d = cdist(X, X) line, and the stuff i changed speeds up the rest of the code you got, but the main bottleneck stays
Edit: added the method close_pairs3, which gives you the formula, but speed blows, (still need to figure out how to invert that formula, and than it will be superfast, will do that tomorrow - will use np.where(pdist(X)
Edit: added method close_pairs4, which is slightly better than 3, and explains what happens, but is veeery slow, and same with method 5, does not have that for-loop, but is still very slow

I made some code to compare the proposed solutions.
Note: I use scipy 0.11 and cannot use the ckdtree solution (only kdtree) which I expect to be slower. Could anyone with scipy v0.12+ run this code?
import numpy as np
from scipy.spatial.distance import cdist, pdist
from scipy.spatial import ckdtree
from scipy.spatial import kdtree
def close_pairs(X,max_d):
d = cdist(X,X)
I,J = (d<max_d).nonzero()
IJ = np.sort(np.vstack((I,J)), axis=0)
# remove diagonal element
IJ = IJ[:,np.diff(IJ,axis=0).ravel()<>0]
# remove duplicate
dt = np.dtype([('i',int),('j',int)])
pairs = np.unique(IJ.T.view(dtype=dt)).view(int).reshape(-1,2)
return pairs
def condensed_to_pair_indices(n,k):
x = n-(4.*n**2-4*n-8*k+1)**.5/2-.5
i = x.astype(int)
j = k+i*(i+3-2*n)/2+1
return i,j
def close_pairs_pdist(X,max_d):
d = pdist(X)
k = (d<max_d).nonzero()[0]
return condensed_to_pair_indices(X.shape[0],k)
def close_pairs_triu(X,max_d):
d = cdist(X,X)
d1 = np.triu_indices(len(X)) # indices of the upper triangle including the diagonal
d[d1] = max_d+1 # value that will not get selected when doing d<max_d in the next line
I,J = (d<max_d).nonzero()
pairs = np.vstack((I,J)).T
return pairs
def close_pairs_ckdtree(X, max_d):
tree = ckdtree.cKDTree(X)
pairs = tree.query_pairs(max_d)
return pairs # remove the conversion as it is not required
def close_pairs_kdtree(X, max_d):
tree = kdtree.KDTree(X)
pairs = tree.query_pairs(max_d)
return pairs # remove the conversion as it is not required
methods = [close_pairs, close_pairs_pdist, close_pairs_triu, close_pairs_kdtree] #, close_pairs_ckdtree]
def time_test(n=[10,50,100], max_d=[5,10,50], iter_num=100):
import timeit
for method in methods:
print '-- time using ' + method.__name__ + ' ---'
for ni in n:
for d in max_d:
setup = '\n'.join(['import numpy as np','import %s' % __name__,'np.random.seed(0)','X = np.random.rand(%d,2)*100'%ni])
stmt = 'close_pairs.%s(X,%f)' % (method.__name__,d)
time = timeit.timeit(stmt=stmt, setup=setup, number=iter_num)/iter_num
print 'n=%3d, max_d=%2d: \t%.2fms' % (ni, d,time*1000)
Output of time_test(iter_num=10,n=[20,100,500],max_d=[1,5,10]) are:
-- time using close_pairs ---
n= 20, max_d= 1: 0.22ms
n= 20, max_d= 5: 0.16ms
n= 20, max_d=10: 0.21ms
n=100, max_d= 1: 0.41ms
n=100, max_d= 5: 0.53ms
n=100, max_d=10: 0.97ms
n=500, max_d= 1: 7.12ms
n=500, max_d= 5: 12.28ms
n=500, max_d=10: 33.41ms
-- time using close_pairs_pdist ---
n= 20, max_d= 1: 0.11ms
n= 20, max_d= 5: 0.10ms
n= 20, max_d=10: 0.11ms
n=100, max_d= 1: 0.19ms
n=100, max_d= 5: 0.19ms
n=100, max_d=10: 0.19ms
n=500, max_d= 1: 2.31ms
n=500, max_d= 5: 2.82ms
n=500, max_d=10: 2.49ms
-- time using close_pairs_triu ---
n= 20, max_d= 1: 0.17ms
n= 20, max_d= 5: 0.16ms
n= 20, max_d=10: 0.16ms
n=100, max_d= 1: 0.83ms
n=100, max_d= 5: 0.80ms
n=100, max_d=10: 0.80ms
n=500, max_d= 1: 23.64ms
n=500, max_d= 5: 22.87ms
n=500, max_d=10: 22.96ms
-- time using close_pairs_kdtree ---
n= 20, max_d= 1: 1.71ms
n= 20, max_d= 5: 1.69ms
n= 20, max_d=10: 1.96ms
n=100, max_d= 1: 34.99ms
n=100, max_d= 5: 35.47ms
n=100, max_d=10: 34.91ms
n=500, max_d= 1: 253.87ms
n=500, max_d= 5: 255.05ms
n=500, max_d=10: 256.66ms
Conclusion:
The overall fastest method is close_pairs_pdist
The initial method is relatively fast, but sensitive to both the number of samples and the percentage of pairs to return
both the close_pairs_triu and close_pairs_kdtree are sensitive to the number of samples but relatively insensitive to the number of outputs.
the close_pairs_triu method is much faster than close_pairs_kdtree
However, the ckdtree method needs to be tested.

Related

Numba parallel code slower than its sequential counterpart

I'm new to Numba and I'm trying to implement an old Fortran code in Python using Numba (version 0.54.1), but when I add parallel = True the program actually slows down. My program is very simple: I change the positions x and y in a L x L grid and for each position in the grid I perform a summation
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J = np.array([[1.0, -k*np.cos(x)], [1.0, 1.0 - k*np.cos(x)]])
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
# Compile
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1, 10)
# Parameters
N = int(1e3)
L = 128
pi = np.pi
k = 1.5
# Limits of the phase space
x0 = -pi
xf = pi
y0 = -pi
yf = pi
# Grid positions
x = np.linspace(x0, xf, L, endpoint=True)
y = np.linspace(y0, yf, L, endpoint=True)
lypnv = lyapunov_grid(x, y, k, N)
With parallel=False it takes about 8s to run, however with parallel=True it takes about 14s. I also tested with another code from https://github.com/animator/mandelbrot-numba and in this case the parallelization works.
import math
import numpy as np
import numba as nb
WIDTH = 1000
MAX_ITER = 1000
#nb.njit(parallel=True)
def mandelbrot(width, max_iter):
pixels = np.zeros((width, width, 3), dtype=np.uint8)
for y in nb.prange(width):
for x in range(width):
c0 = complex(3.0*x/width - 2, 3.0*y/width - 1.5)
c = 0
for i in range(1, max_iter):
if abs(c) > 2:
log_iter = math.log(i)
pixels[y, x, :] = np.array([int(255*(1+math.cos(3.32*log_iter))/2),
int(255*(1+math.cos(0.774*log_iter))/2),
int(255*(1+math.cos(0.412*log_iter))/2)],
dtype=np.uint8)
break
c = c * c + c0
return pixels
# compile
_ = mandelbrot(WIDTH, 10)
calcpixels = mandelbrot(WIDTH, MAX_ITER)
One main issue is that the second function call compile the function again. Indeed, the types of the provided arguments change: in the first call the third argument is an integer (int transformed to a np.int_) while in the second call the third argument (k) is a floating point number (float transformed to a np.float64). Numba recompiles the function for different parameter types because they are deduced from the type of the arguments and it does not know you want to use a np.float64 type for the third argument (since the first time the function is compiled with for a np.int_ type). One simple solution to fix the problem is to change the first call to:
_ = lyapunov_grid(np.linspace(0, 1, 10), np.linspace(0, 1, 10), 1.0, 10)
However, this is not a robust way to fix the problem. You can specify the parameter types to Numba so it will compile the function at declaration time. This also remove the need to artificially call the function (with useless parameters).
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
Note that (J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)) is zero the first time resulting in a division by 0.
Another main issue comes from the allocations of many small arrays in the loop causing a contention of the standard allocator (see this post for more information). While Numba could theoretically optimize it (ie. replace the array with local variables), it actually does not, resulting in a huge slowdown and a contention. Hopefully, in your case, you do not need to actually create the array. At last, you can create it only in the encompassing loop and modify it in the innermost loop. Here is the optimized code:
#nb.njit('float64[:,:](float64[::1], float64[::1], float64, float64)', parallel=True)
def lyapunov_grid(x_grid, y_grid, k, N):
L = len(x_grid)
lypnv = np.zeros((L, L))
for ii in nb.prange(L):
J = np.ones((2, 2), dtype=np.float64)
for jj in range(L):
x = x_grid[ii]
y = y_grid[jj]
beta0 = 0
sumT11 = 0
for j in range(N):
y = (y - k*np.sin(x)) % (2*np.pi)
x = (x + y) % (2*np.pi)
J[0, 1] = -k*np.cos(x)
J[1, 1] = 1.0 - k*np.cos(x)
beta = np.arctan((-J[1,0]*np.cos(beta0) + J[1,1]*np.sin(beta0))/(J[0,0]*np.cos(beta0) - J[0,1]*np.sin(beta0)))
T11 = np.cos(beta0)*(J[0,0]*np.cos(beta) - J[1,0]*np.sin(beta)) - np.sin(beta0)*(J[0,1]*np.cos(beta) - J[1,1]*np.sin(beta))
sumT11 += np.log(abs(T11))/np.log(2)
beta0 = beta
lypnv[ii, jj] = sumT11/N
return lypnv
Here is the results on a old 2-core machine (with 4 hardware threads):
Original sequential: 15.9 s
Original parallel: 11.9 s
Fix-build sequential: 15.7 s
Fix-build parallel: 10.1 s
Optimized sequential: 2.73 s
Optimized parallel: 0.94 s
The optimized implementation is much faster than the others. The parallel optimized version scale very well compared than the original one (2.9 times faster than the sequential one). Finally, the best version is about 12 times faster than the original parallel version. I expect a much faster computation on a recent machine with many more cores.

Fastest way to perform calculations on every NXN sub-array in 2D numpy array

I have a 2D numpy array which represents a grayscale image. I need to extract every N x N sub-array within that array, with a specified overlap between sub-arrays, and calculate a property such as the mean, standard deviation, or median.
The code below performs this task but is quite slow because it uses Python for loops. Any ideas on how to vectorize this calculation or otherwise speed it up?
import numpy as np
img = np.random.randn(100, 100)
N = 4
step = 2
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.mean(img[i:i+N, j:j+N]))
out.append(outr)
out = np.array(out)
For mean and standard deviation, there is a fast cumsum based solution.
Here are timings for a 500x200 image, 30x20 window and step sizes 5 and 3. For comparison I use skimage.util.view_as_windows with numpy mean and std.
mn + sd using cumsum 1.1531693299184553 ms
mn using view_as_windows 3.495307120028883 ms
sd using view_as_windows 21.855629019846674 ms
Code:
import numpy as np
from math import gcd
from timeit import timeit
def wsum2d(A, winsz, stepsz, canoverwriteA=False):
M, N = A.shape
m, n = winsz
i, j = stepsz
for X, x, s in ((M, m, i), (N, n, j)):
g = gcd(x, s)
if g > 1:
X //= g
x //= g
s //= g
A = A[:X*g].reshape(X, g, -1).sum(axis=1)
elif not canoverwriteA:
A = A.copy()
canoverwriteA = True
A[x:] -= A[:-x]
A = A.cumsum(axis=0)[x-1::s]
A = A.T
return A
def w2dmnsd(A, winsz, stepsz):
# combine A and A*A into a complex, so overheads apply only once
M21 = wsum2d(A*(A+1j), winsz, stepsz, True)
M2, mean_ = M21.real / np.prod(winsz), M21.imag / np.prod(winsz)
sd = np.sqrt(M2 - mean_*mean_)
return mean_, sd
# test
np.random.seed(0)
A = np.random.random((500, 200))
wsz = (30, 20)
stpsz = (5, 3)
mn, sd = w2dmnsd(A, wsz, stpsz)
from skimage.util import view_as_windows
Av = view_as_windows(A, wsz, stpsz) # this emits a warning on my system
assert np.allclose(mn, np.mean(Av, axis=(2, 3)))
assert np.allclose(sd, np.std(Av, axis=(2, 3)))
from timeit import repeat
print('mn + sd using cumsum ', min(repeat(lambda: w2dmnsd(A, wsz, stpsz), number=100))*10, 'ms')
print('mn using view_as_windows', min(repeat(lambda: np.mean(Av, axis=(2, 3)), number=100))*10, 'ms')
print('sd using view_as_windows', min(repeat(lambda: np.std(Av, axis=(2, 3)), number=100))*10, 'ms')
If Numba is an option the only thing to do is to avoid the list appends (It does work with list appends too, but slower.
To make use of parallization too, rewrote the implementation a bit to avoid the step within range, which is not supported when using parfor.
Example
#nb.njit(error_model='numpy',parallel=True)
def calc_p(img,N,step):
h,w=img.shape
i_w=(h - N)//step
j_w=(w - N)//step
out = np.empty((i_w,j_w))
for i in nb.prange(0, i_w):
for j in range(0, j_w):
out[i,j]=np.std(img[i*step:i*step+N, j*step:j*step+N])
return out
def calc_n(img,N,step):
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.std(img[i:i+N, j:j+N]))
out.append(outr)
return(np.array(out))
Timings
All timings are without compilation overhead of about 0.5s (the first call to the function is excluded from the timings).
#Data
img = np.random.randn(100, 100)
N = 4
step = 2
calc_n :17ms
calc_p :0.033ms
Because this is actually a rolling mean there is further room for improvement if N gets larger.
You could use scikit-image block_reduce:
So your code becomes:
import numpy as np
import skimage.measure
N = 4
# Your main array
a = np.arange(9).reshape(3,3)
mean = skimage.measure.block_reduce(a, (N,N), np.mean)
std_dev = skimage.measure.block_reduce(a, (N,N), np.std)
median = skimage.measure.block_reduce(a, (N,N), np.median)
However, the above code only works for strides/steps of size 1.
For mean, you could use mean pooling which is available in any modern day ML package. As for median and standard deviation, this seems the right approach.
The general case can be solved using scipy.ndimage.generic_filter:
import numpy as np
from scipy.ndimage import generic_filter
img = np.random.randn(100, 100)
N = 4
filtered = generic_filter(img.astype(np.float), np.std, size=N)
step = 2
output = filtered[::step, ::step]
However, this may actually run not much faster than a simple for loop.
To apply a mean and median filter you can use skimage.rank.mean and skimage.rank.median, respectively, which should be faster. There is also scipy.ndimage.median_filter. Otherwise, the mean can also be effectively computed through simple convolution with an (N, N) array with values 1./N^2. For the standard deviation you probably have to bite the bullet and use generic_filter unless your step size is larger or equal to N.

Vectorizing for loop with repeated indices in python

I am trying to optimize a snippet that gets called a lot (millions of times) so any type of speed improvement (hopefully removing the for-loop) would be great.
I am computing a correlation function of some j'th particle with all others
C_j(|r-r'|) = sqrt(E((s_j(r')-s_k(r))^2)) averaged over k.
My idea is to have a variable corrfun which bins data into some bins (the r, defined elsewhere). I find what bin of r each s_k belongs to and this is stored in ind. So ind[0] is the index of r (and thus the corrfun) for which the j=0 point corresponds to. Multiple points can fall into the same bin (in fact I want bins to be big enough to contain multiple points) so I sum together all of the (s_j(r')-s_k(r))^2 and then divide by number of points in that bin (stored in variable rw). The code I ended up making for this is the following (np is for numpy):
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
Note, the rw2 business was because I want to avoid divide by 0 problems but I do return the rw array and I want to be able to differentiate between the rw=0 and rw=1 elements. Perhaps there is a more elegant solution for this as well.
Is there a way to make the for-loop faster? While I would like to not add the self interaction (j==k) I am even ok with having self interaction if it means I can get significantly faster calculation (length of ind ~ 1E6 so self interaction is probably insignificant anyways).
Thank you!
Ilya
Edit:
Here is the full code. Note, in the full code I am averaging over j as well.
import numpy as np
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
print(r)
corrfun = r*0
rw = r*0
print(maxR)
''' go through all points'''
for j in range(0, n-1):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
rw2 = rw
rw2[rw < 1] = 1
corrfun = np.sqrt(np.divide(corrfun, rw2))
return r, corrfun, rw
I debug test it the following way
from twopointcorr import twopointcorr
import numpy as np
import matplotlib.pyplot as plt
import time
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
print('running two point corr functinon')
start_time = time.time()
r,corrfun,rw = twopointcorr(x,y,s,0.1)
print("--- Execution time is %s seconds ---" % (time.time() - start_time))
fig1=plt.figure()
plt.plot(r, corrfun,'-x')
fig2=plt.figure()
plt.plot(r, rw,'-x')
plt.show()
Again, the main issue is that in the real dataset n~1E6. I can resample to make it smaller, of course, but I would love to actually crank through the dataset.
Here is the code that use broadcast, hypot, round, bincount to remove all the loops:
def twopointcorr2(x, y, s, dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
osub = lambda x:np.subtract.outer(x, x)
ind = np.clip(np.round(np.hypot(osub(x), osub(y)) / dr), 0, len(r)-1).astype(int)
rw = np.bincount(ind.ravel())
rw[0] -= len(x)
corrfun = np.bincount(ind.ravel(), (osub(s)**2).ravel())
return r, corrfun, rw
to compare, I modified your code as follows:
def twopointcorr(x,y,s,dr):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
hypot = np.sqrt((x[j]-x)**2+(y[j]-y)**2)
ind = [np.abs(r-h).argmin() for h in hypot]
for k, v in enumerate(ind):
if j==k:
continue
corrfun[v] += (s[k]-s[j])**2
rw[v] += 1
return r, corrfun, rw
and here is the code to check the results:
import numpy as np
n=1000
x = np.random.rand(n)
y = np.random.rand(n)
s = np.random.rand(n)
r1, corrfun1, rw1 = twopointcorr(x,y,s,0.1)
r2, corrfun2, rw2 = twopointcorr2(x,y,s,0.1)
assert np.allclose(r1, r2)
assert np.allclose(corrfun1, corrfun2)
assert np.allclose(rw1, rw2)
and the %timeit results:
%timeit twopointcorr(x,y,s,0.1)
%timeit twopointcorr2(x,y,s,0.1)
outputs:
1 loop, best of 3: 5.16 s per loop
10 loops, best of 3: 134 ms per loop
Your original code on my system runs in about 5.7 seconds. I fully vectorized the inner loop and got it to run in 0.39 seconds. Simply replace your "go through all points" loop with this:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
# go through all points
for j in range(n): # n.b. previously n-1, not sure why
ind = inds[j]
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[ind[j]] -= 1 # subtract self
The first observation was that your hypot code was computing 2D distances, so I replaced that with cdist from SciPy to do it all in a single call. The second was that the inner for loop was slow, and thanks to an insightful comment from #hpaulj I vectorized that as well using np.add.at().
Since you asked how to vectorize the inner loop as well, I did that later. It now takes 0.25 seconds to run, for a total speedup of over 20x. Here's the final code:
points = np.column_stack((x,y))
hypots = scipy.spatial.distance.cdist(points, points)
inds = np.rint(hypots.clip(max=maxR) / dr).astype(np.int)
sn = np.tile(s, (n,1)) # n copies of s
diffs = (sn - sn.T)**2 # squares of pairwise differences
np.add.at(corrfun, inds, diffs)
rw = np.bincount(inds.flatten(), minlength=len(r))
np.subtract.at(rw, inds.diagonal(), 1) # subtract self
This uses more memory but does produce a substantial speedup vs. the single-loop version above.
Ok, so as it turns out outer products are incredibly memory expensive, however, using answers from #HYRY and #JohnZwinck i was able to make code that is still roughly linear in n in memory and computes fast (0.5 seconds for the test case)
import numpy as np
def twopointcorr(x,y,s,dr,maxR=-1):
width = np.max(x)-np.min(x)
height = np.max(y)-np.min(y)
n = len(x)
if maxR < dr:
maxR = np.sqrt((width/2)**2 + (height/2)**2)
r = np.arange(0, maxR+dr, dr)
corrfun = r*0
rw = r*0
for j in range(0, n):
ind = np.clip(np.round(np.hypot(x[j]-x,y[j]-y) / dr), 0, len(r)-1).astype(int)
np.add.at(corrfun, ind, (s - s[j])**2)
np.add.at(rw, ind, 1)
rw[0] -= n
corrfun = np.sqrt(np.divide(corrfun, np.maximum(rw,1)))
r=np.delete(r,-1)
rw=np.delete(rw,-1)
corrfun=np.delete(corrfun,-1)
return r, corrfun, rw

extract the N closest pairs from a numpy distance array

I have a large, symmetric, 2D distance array. I want to get closest N pairs of observations.
The array is stored as a numpy condensed array, and has of the order of 100 million observations.
Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like.
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped.
NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case.
You don't need to calculate ti in each call to condensed_to_square_index. Here's a basic modification that calculates it only once:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
pair = condensed_to_square_index(ti, i)
r.append(pair)
You can also vectorize the creation of r:
r = zip(ti[0][closest] + 1, ti[1][closest] + 1)
or
r = np.vstack(ti)[:, closest] + 1
You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition:
def smallest_n(a, n):
return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
ret = np.argpartition(a, n)[:n]
b = np.take(a, ret)
return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop
And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot:
closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
np.take(tu[1], closest))) + 1
The best solution probably won't generate all of the distances.
Proposal:
Make a heap of max size 100 (if it grows bigger, reduce it).
Use the Closest Pair algorithm to find the closest pair.
Add the pair to the heap (priority queue).
Choose one of that pair. Add its 99 closest neighbors to the heap.
Remove the chosen point from the list.
Find the next closest pair and repeat. The number of neighbors added is 100 minus the number of times you ran the Closest Pair algorithm.

How do i speed up a python nested loop?

I'm trying to calculate the gravity effect of a buried object by calculating the effect on each side of the body then summing up the contributions to get one measurement at one station, an repeating for a number of stations. the code is as follows( the body is a square and the code calculates clockwise around it, that's why it goes from -x back to -x coordinates)
grav = []
x=si.arange(-30.0,30.0,0.5)
#-9.79742526 9.78716693 22.32153704 27.07382349 2138.27146193
xcorn = (-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526)
zcorn = (22.32153704,22.32153704,27.07382349,27.07382349,22.32153704)
gamma = (6.672*(10**-11))#'N m^2 / Kg^2'
rho = 2138.27146193#'Kg / m^3'
grav = []
iter_time=[]
def procedure():
for i in si.arange(len(x)):# cycles position
t0=time.clock()
sum_lines = 0.0
for n in si.arange(len(xcorn)-1):#cycles corners
x1 = xcorn[n]-x[i]
x2 = xcorn[n+1]-x[i]
z1 = zcorn[n]-0.0 #just depth to corner since all observations are on the surface.
z2 = zcorn[n+1]-0.0
r1 = ((z1**2) + (x1**2))**0.5
r2 = ((z2**2) + (x2**2))**0.5
O1 = si.arctan2(z1,x1)
O2 = si.arctan2(z2,x2)
denom = z2-z1
if denom == 0.0:
denom = 1.0e-6
alpha = (x2-x1)/denom
beta = ((x1*z2)-(x2*z1))/denom
factor = (beta/(1.0+(alpha**2)))
term1 = si.log(r2/r1)#log base 10
term2 = alpha*(O2-O1)
sum_lines = sum_lines + (factor*(term1-term2))
sum_lines = sum_lines*2*gamma*rho
grav.append(sum_lines)
t1 = time.clock()
dt = t1-t0
iter_time.append(dt)
Any help in speeding this loop up would be appreciated Thanks.
Your xcorn and zcorn values repeat, so consider caching the result of some of the computations.
Take a look at the timeit and profile modules to get more information about what is taking the most computational time.
It is very inefficient to access individual elements of a numpy array in a Python loop. For example, this Python loop:
for i in xrange(0, len(a), 2):
a[i] = i
would be much slower than:
a[::2] = np.arange(0, len(a), 2)
You could use a better algorithm (less time complexity) or use vector operations on numpy arrays as in the example above. But the quicker way might be just to compile the code using Cython:
#cython: boundscheck=False, wraparound=False
#procedure_module.pyx
import numpy as np
cimport numpy as np
ctypedef np.float64_t dtype_t
def procedure(np.ndarray[dtype_t,ndim=1] x,
np.ndarray[dtype_t,ndim=1] xcorn):
cdef:
Py_ssize_t i, j
dtype_t x1, x2, z1, z2, r1, r2, O1, O2
np.ndarray[dtype_t,ndim=1] grav = np.empty_like(x)
for i in range(x.shape[0]):
for j in range(xcorn.shape[0]-1):
x1 = xcorn[j]-x[i]
x2 = xcorn[j+1]-x[i]
...
grav[i] = ...
return grav
It is not necessary to define all types but if you need a significant speed up compared to Python you should define at least types of arrays and loop indexes.
You could use cProfile (Cython supports it) instead of manual calls to time.clock().
To call procedure():
#!/usr/bin/env python
import pyximport; pyximport.install() # pip install cython
import numpy as np
from procedure_module import procedure
x = np.arange(-30.0,30.0,0.5)
xcorn = np.array((-9.79742526,9.78716693 ,9.78716693 ,-9.79742526,-9.79742526))
grav = procedure(x, xcorn)

Categories