I have a large, symmetric, 2D distance array. I want to get closest N pairs of observations.
The array is stored as a numpy condensed array, and has of the order of 100 million observations.
Here's an example to get the 100 closest distances on a smaller array (~500k observations), but it's a lot slower than I would like.
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
It seems to me like there must be quicker ways to do this with standard numpy or scipy functions, but I'm stumped.
NB If lots of pairs are equidistant, that's OK and I don't care about their ordering in that case.
You don't need to calculate ti in each call to condensed_to_square_index. Here's a basic modification that calculates it only once:
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
pair = condensed_to_square_index(ti, i)
r.append(pair)
You can also vectorize the creation of r:
r = zip(ti[0][closest] + 1, ti[1][closest] + 1)
or
r = np.vstack(ti)[:, closest] + 1
You can speed up the location of the minimum values very notably if you are using numpy 1.8 using np.partition:
def smallest_n(a, n):
return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
ret = np.argpartition(a, n)[:n]
b = np.take(a, ret)
return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop
And once you have the smallest indices, you don't need a loop to extract the indices, do it in a single shot:
closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
np.take(tu[1], closest))) + 1
The best solution probably won't generate all of the distances.
Proposal:
Make a heap of max size 100 (if it grows bigger, reduce it).
Use the Closest Pair algorithm to find the closest pair.
Add the pair to the heap (priority queue).
Choose one of that pair. Add its 99 closest neighbors to the heap.
Remove the chosen point from the list.
Find the next closest pair and repeat. The number of neighbors added is 100 minus the number of times you ran the Closest Pair algorithm.
Related
I have a 2D numpy array which represents a grayscale image. I need to extract every N x N sub-array within that array, with a specified overlap between sub-arrays, and calculate a property such as the mean, standard deviation, or median.
The code below performs this task but is quite slow because it uses Python for loops. Any ideas on how to vectorize this calculation or otherwise speed it up?
import numpy as np
img = np.random.randn(100, 100)
N = 4
step = 2
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.mean(img[i:i+N, j:j+N]))
out.append(outr)
out = np.array(out)
For mean and standard deviation, there is a fast cumsum based solution.
Here are timings for a 500x200 image, 30x20 window and step sizes 5 and 3. For comparison I use skimage.util.view_as_windows with numpy mean and std.
mn + sd using cumsum 1.1531693299184553 ms
mn using view_as_windows 3.495307120028883 ms
sd using view_as_windows 21.855629019846674 ms
Code:
import numpy as np
from math import gcd
from timeit import timeit
def wsum2d(A, winsz, stepsz, canoverwriteA=False):
M, N = A.shape
m, n = winsz
i, j = stepsz
for X, x, s in ((M, m, i), (N, n, j)):
g = gcd(x, s)
if g > 1:
X //= g
x //= g
s //= g
A = A[:X*g].reshape(X, g, -1).sum(axis=1)
elif not canoverwriteA:
A = A.copy()
canoverwriteA = True
A[x:] -= A[:-x]
A = A.cumsum(axis=0)[x-1::s]
A = A.T
return A
def w2dmnsd(A, winsz, stepsz):
# combine A and A*A into a complex, so overheads apply only once
M21 = wsum2d(A*(A+1j), winsz, stepsz, True)
M2, mean_ = M21.real / np.prod(winsz), M21.imag / np.prod(winsz)
sd = np.sqrt(M2 - mean_*mean_)
return mean_, sd
# test
np.random.seed(0)
A = np.random.random((500, 200))
wsz = (30, 20)
stpsz = (5, 3)
mn, sd = w2dmnsd(A, wsz, stpsz)
from skimage.util import view_as_windows
Av = view_as_windows(A, wsz, stpsz) # this emits a warning on my system
assert np.allclose(mn, np.mean(Av, axis=(2, 3)))
assert np.allclose(sd, np.std(Av, axis=(2, 3)))
from timeit import repeat
print('mn + sd using cumsum ', min(repeat(lambda: w2dmnsd(A, wsz, stpsz), number=100))*10, 'ms')
print('mn using view_as_windows', min(repeat(lambda: np.mean(Av, axis=(2, 3)), number=100))*10, 'ms')
print('sd using view_as_windows', min(repeat(lambda: np.std(Av, axis=(2, 3)), number=100))*10, 'ms')
If Numba is an option the only thing to do is to avoid the list appends (It does work with list appends too, but slower.
To make use of parallization too, rewrote the implementation a bit to avoid the step within range, which is not supported when using parfor.
Example
#nb.njit(error_model='numpy',parallel=True)
def calc_p(img,N,step):
h,w=img.shape
i_w=(h - N)//step
j_w=(w - N)//step
out = np.empty((i_w,j_w))
for i in nb.prange(0, i_w):
for j in range(0, j_w):
out[i,j]=np.std(img[i*step:i*step+N, j*step:j*step+N])
return out
def calc_n(img,N,step):
h, w = img.shape
out = []
for i in range(0, h - N, step):
outr = []
for j in range(0, w - N, step):
outr.append(np.std(img[i:i+N, j:j+N]))
out.append(outr)
return(np.array(out))
Timings
All timings are without compilation overhead of about 0.5s (the first call to the function is excluded from the timings).
#Data
img = np.random.randn(100, 100)
N = 4
step = 2
calc_n :17ms
calc_p :0.033ms
Because this is actually a rolling mean there is further room for improvement if N gets larger.
You could use scikit-image block_reduce:
So your code becomes:
import numpy as np
import skimage.measure
N = 4
# Your main array
a = np.arange(9).reshape(3,3)
mean = skimage.measure.block_reduce(a, (N,N), np.mean)
std_dev = skimage.measure.block_reduce(a, (N,N), np.std)
median = skimage.measure.block_reduce(a, (N,N), np.median)
However, the above code only works for strides/steps of size 1.
For mean, you could use mean pooling which is available in any modern day ML package. As for median and standard deviation, this seems the right approach.
The general case can be solved using scipy.ndimage.generic_filter:
import numpy as np
from scipy.ndimage import generic_filter
img = np.random.randn(100, 100)
N = 4
filtered = generic_filter(img.astype(np.float), np.std, size=N)
step = 2
output = filtered[::step, ::step]
However, this may actually run not much faster than a simple for loop.
To apply a mean and median filter you can use skimage.rank.mean and skimage.rank.median, respectively, which should be faster. There is also scipy.ndimage.median_filter. Otherwise, the mean can also be effectively computed through simple convolution with an (N, N) array with values 1./N^2. For the standard deviation you probably have to bite the bullet and use generic_filter unless your step size is larger or equal to N.
Given an array x of length 1000, and y of length 500k, we can compute the index k for which x is the closest to "y-shifted by k indices":
mindistance = np.inf # infinity
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index
# x is close to y[index:index+N]
According to my tests, this seems to be numerically costly. Is there a clever numpy way to compute it faster?
Note: It seems that if I replace the length of x from 1000 to 100, it doesn't change much the time taken for the computation. The slowness seems to come mostly from the for k in range(...) loop. How to speed it up?
This can be done with np.correlate which computes not the coefficient of correlation (as one might guess), but simply the sum of products like x[n]*y[m] (here m is n plus some shift). Since
(x[n] - y[m])**2 = x[n]**2 - 2*x[n]*y[m] + y[m]**2
we can get the sum of squares of differences from this, by adding the sums of squares of x and of a part of y. (Actually, the sum of x[n]**2 will not depend on the shift, since we'll always get just np.sum(x**2), but I'll include it all the same.) The sum of a part of y**2 can also be found in this way, by replacing x with an all-ones array of the same size, and y with y**2.
Here is an example.
import numpy as np
x = np.array([3.1, 1.2, 4.2])
y = np.array([8, 5, 3, -2, 3, 1, 4, 5, 7])
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
print(diff_sq)
This prints [39.89 45.29 11.69 39.49 0.09 12.89 23.09] which are indeed the required distances from x to various parts of y. Pick the smallest with argmin.
A little benchmark in addition to user6655984's wonderful answer:
import numpy as np
import time
x = np.random.rand(1000) # random array of size 1k
y = np.random.rand(100*1000) # random array of size 100k
print "Naive method"
start = time.time()
mindistance = np.inf
for k in range(len(y)-1000):
t = np.sum(np.power(x-y[k:k+1000],2))
if t < mindistance:
mindistance = t
index = k
print index, mindistance
print "%.2f seconds\n" % (time.time() - start)
print "Correlation method"
start = time.time()
diff_sq = np.sum(x**2) - 2*np.correlate(y, x) + np.correlate(y**2, np.ones_like(x))
i = np.argmin(diff_sq)
print i, diff_sq[i]
print "%.2f seconds\n" % (time.time() - start)
We get a x 145 speed improvement factor :)
Naive method
60911 143.6153965841267
8.75 seconds
Correlation method
60911 143.6153965841267
0.06 seconds
The minimum of the SSD distance ("sum of squared difference") is the maximum of the correlation.
Correlations are known to be computed efficiently (in time N Log N instead of NM), by the famous FFT.
With N=1000 and M=500000 you can expect a speedup.
I have a np array, X that is size 1000 x 1000 where each element is a real number. I want to find the 5 closest points for every point in each row of this np array. Here the distance metric can just be abs(x-y). I have tried to do
for i in range(X.shape[0]):
knn = NearestNeighbors(n_neighbors=5)
knn.fit(X[i])
for j in range(X.shape[1])
d = knn.kneighbors(X[i,j], return_distance=False)
However, this does not work for me and I am not sure how efficient this is. Is there a way around this? I have seen a lot of methods for comparing vectors but not any for comparing single elements. I know that I could use a for loop and loop over and find the k smallest, but this would be computationally expensive. Could a KD tree work for this? I have tried a method similar to
Finding index of nearest point in numpy arrays of x and y coordinates
However, I can not get this to work. Is there some numpy function I don't know about that could accomplish this?
Construct a kdtree with scipy.spatial.cKDTree for each row of your data.
import numpy as np
import scipy.spatial
def nearest_neighbors(arr, k):
k_lst = list(range(k + 2))[2:] # [2,3]
neighbors = []
for row in arr:
# stack the data so each element is in its own row
data = np.vstack(row)
# construct a kd-tree
tree = scipy.spatial.cKDTree(data)
# find k nearest neighbors for each element of data, squeezing out the zero result (the first nearest neighbor is always itself)
dd, ii = tree.query(data, k=k_lst)
# apply an index filter on data to get the nearest neighbor elements
closest = data[ii].reshape(-1, k)
neighbors.append(closest)
return np.stack(neighbors)
N = 1000
k = 5
A = np.random.random((N, N))
nearest_neighbors(A, k)
I'm not really sure how you want the final results. But this definitely gets you what you need.
np.random.seed([3,1415])
X = np.random.rand(1000, 1000)
Grab upper triangle indices to track every combination of points per row
x1, x2 = np.triu_indices(X.shape[1], 1)
generate an array of all distances
d = np.abs(X[:, x1] - X[:, x2])
Find the closest 5 for every row
tpos = np.argpartition(d, 5)[:, :5]
Then x1[tpos] gives the row-wise positions of the first point in the closest pairs while x2[tpos] gives the second position of the closest pairs.
Here is an argsorting solution that strives to take advantage of the simple metric:
def nn(A, k):
out = np.zeros((A.shape[0], A.shape[1] + 2*k), dtype=int)
out[:, k:-k] = np.argsort(A, axis=-1)
out[:, :k] = out[:, -k-1, None]
out[:, -k:] = out[:, k, None]
strd = stride_tricks.as_strided(
out, strides=out.strides + (out.strides[-1],), shape=A.shape + (2*k+1,))
delta = A[np.arange(A.shape[0])[:, None, None], strd]
delta -= delta[..., k, None]
delta = np.abs(delta)
s = np.argpartition(delta,(0, k), axis = -1)[..., 1:k+1]
inds = tuple(np.ogrid[:strd.shape[0], :strd.shape[1], :0][:2])
res = np.empty(A.shape + (k,), dtype=int)
res[np.arange(strd.shape[0])[:, None, None], out[:, k:-k, None],
np.arange(k)[None, None, :]] = strd[inds + (s,)]
return res
N = 1000
k = 5
r = 10
A = np.random.random((N, N))
# crude test
print(np.abs(A[np.arange(N)[:, None, None], res]-A[..., None]).mean())
# timings
print(timeit(lambda: nn(A, k), number=r) / r)
Output:
# 0.00150537172454
# 0.4567880852999224
I am using Python but since I am noob I can't figure out how to compute the average of a vector each, let's say, 100 elements in a larger for-loop.
My trial so far, which is not what I want is
import numpy as np
r = np.zeros(10000) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
My code do not work as I want because I would like to make the average of 100 elements only then pass to the other 100, compute the mean and go on.
I try to reduce the dimension of the vector and clean it into the if:
import numpy as np
r = np.zeros(100) # declare my vector
for i in range(0,2000): # start the loop
r[i] = i**2 # some function to compute and save
if (i%100 == 0): # each time I save 100 elements I want the mean
av_r = np.mean(r)
print(av_r)
r = np.zeros(100)
naively, I thought you may save 100 elements, compute the average clean the vector and continue the calculation saving the other elements from 100+1 to 200+1 but it give me errors. In particular:
IndexError: index 100 is out of bounds for axis 0 with size 100
Many thanks for your help.
Is this what you're looking for? This code will iterate from 0 to 2000 in intervals of 100, mapping some function (x -> x**2) over each interval, calculating the mean and printing the result.
import numpy as np
r = np.zeros(10000)
for i in range(0, 2000, 100):
interval = [x ** 2 for x in r[i:i + 100]]
av_r = np.mean(interval)
print(av_r)
The output from this is just a series of 20 0.0.
the error you probably have encountered is an arrays out of bounds (IndexError: index 100 is out of bounds for axis 0 with size 100), because your index ranges from 0 to 1999 and you're doing
r[i] = i**2 # some function to compute and save
on a 100-sized array.
Fix:
r[i%100] = i**2 # some function to compute and save
I need help vectorizing this code. Right now, with N=100, its takes a minute or so to run. I would like to speed that up. I have done something like this for a double loop, but never with a 3D loop, and I am having difficulties.
import numpy as np
N = 100
n = 12
r = np.sqrt(2)
x = np.arange(-N,N+1)
y = np.arange(-N,N+1)
z = np.arange(-N,N+1)
C = 0
for i in x:
for j in y:
for k in z:
if (i+j+k)%2==0 and (i*i+j*j+k*k!=0):
p = np.sqrt(i*i+j*j+k*k)
p = p/r
q = (1/p)**n
C += q
print '\n'
print C
The meshgrid/where/indexing solution is already extremely fast. I made it about 65 % faster. This is not too much, but I explain it anyway, step by step:
It was easiest for me to approach this problem with all 3D vectors in the grid being columns in one large 2D 3 x M array. meshgrid is the right tool for creating all the combinations (note that numpy version >= 1.7 is required for a 3D meshgrid), and vstack + reshape bring the data into the desired form. Example:
>>> np.vstack(np.meshgrid(*[np.arange(0, 2)]*3)).reshape(3,-1)
array([[0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1]])
Each column is one 3D vector. Each of these eight vectors represents one corner of a 1x1x1 cube (a 3D grid with step size 1 and length 1 in all dimensions).
Let's call this array vectors (it contains all 3D vectors representing all points in the grid). Then, prepare a bool mask for selecting those vectors fulfilling your mod2 criterion:
mod2bool = np.sum(vectors, axis=0) % 2 == 0
np.sum(vectors, axis=0) creates an 1 x M array containing the element sum for each column vector. Hence, mod2bool is a 1 x M array with a bool value for each column vector. Now use this bool mask:
vectorsubset = vectors[:,mod2bool]
This selects all rows (:) and uses boolean indexing for filtering the columns, both are fast operations in numpy. Calculate the lengths of the remaining vectors, using the native numpy approach:
lengths = np.sqrt(np.sum(vectorsubset**2, axis=0))
This is quite fast -- however, scipy.stats.ss and bottleneck.ss can perform the squared sum calculation even faster than this.
Transform the lengths using your instructions:
with np.errstate(divide='ignore'):
p = (r/lengths)**n
This involves finite number division by zero, resulting in Infs in the output array. This is entirely fine. We use numpy's errstate context manager for making sure that these zero divisions do not throw an exception or a runtime warning.
Now sum up the finite elements (ignore the infs) and return the sum:
return np.sum(p[np.isfinite(p)])
I have implemented this method two times below. Once exactly like just explained, and once involving bottleneck's ss and nansum functions. I have also added your method for comparison, and a modified version of your method that skips the np.where((x*x+y*y+z*z)!=0) indexing, but rather creates Infs, and finally sums up the isfinite way.
import sys
import numpy as np
import bottleneck as bn
N = 100
n = 12
r = np.sqrt(2)
x,y,z = np.meshgrid(*[np.arange(-N, N+1)]*3)
gridvectors = np.vstack((x,y,z)).reshape(3, -1)
def measure_time(func):
import time
def modified_func(*args, **kwargs):
t0 = time.time()
result = func(*args, **kwargs)
duration = time.time() - t0
print("%s duration: %.3f s" % (func.__name__, duration))
return result
return modified_func
#measure_time
def method_columnvecs(vectors):
mod2bool = np.sum(vectors, axis=0) % 2 == 0
vectorsubset = vectors[:,mod2bool]
lengths = np.sqrt(np.sum(vectorsubset**2, axis=0))
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return np.sum(p[np.isfinite(p)])
#measure_time
def method_columnvecs_opt(vectors):
# On my system, bn.nansum is even slightly faster than np.sum.
mod2bool = bn.nansum(vectors, axis=0) % 2 == 0
# Use ss from bottleneck or scipy.stats (axis=0 is default).
lengths = np.sqrt(bn.ss(vectors[:,mod2bool]))
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return bn.nansum(p[np.isfinite(p)])
#measure_time
def method_original(x,y,z):
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
ind = np.where((x*x+y*y+z*z)!=0)
x = x[ind]
y = y[ind]
z = z[ind]
p=np.sqrt(x*x+y*y+z*z)/r
return np.sum((1/p)**n)
#measure_time
def method_original_finitesum(x,y,z):
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
lengths = np.sqrt(x*x+y*y+z*z)
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return np.sum(p[np.isfinite(p)])
print method_columnvecs(gridvectors)
print method_columnvecs_opt(gridvectors)
print method_original(x,y,z)
print method_original_finitesum(x,y,z)
This is the output:
$ python test.py
method_columnvecs duration: 1.295 s
12.1318801965
method_columnvecs_opt duration: 1.162 s
12.1318801965
method_original duration: 1.936 s
12.1318801965
method_original_finitesum duration: 1.714 s
12.1318801965
All methods produce the same result. Your method becomes a bit faster when doing the isfinite style sum. My methods are faster, but I would say that this is an exercise of academic nature rather than an important improvement :-)
I have one question left: you were saying that for N=3, the calculation should produce a 12. Even yours doesn't do this. All methods above produce 12.1317530867 for N=3. Is this expected?
Thanks to #Bill, I was able to get this to work. Very fast now. Perhaps could be done better, especially with the two masks to get rid of the two conditions that I originally had for loops for.
from __future__ import division
import numpy as np
N = 100
n = 12
r = np.sqrt(2)
x, y, z = np.meshgrid(*[np.arange(-N, N+1)]*3)
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
ind = np.where((x*x+y*y+z*z)!=0)
x = x[ind]
y = y[ind]
z = z[ind]
p=np.sqrt(x*x+y*y+z*z)/r
ans = (1/p)**n
ans = np.sum(ans)
print 'ans'
print ans