I have an array of 3 dimensional vectors vec and I want to find a perpendicular vector res_vec to each of those vectors respectively.
Using other methods I got some numerically unstable behaviour so I just check for the smallest component of that vector and set it to zero, while exchanging the two components that are left and negating one of them. However, this is not the main concern, it seems to work just right but it is slow.
So my question is, if my code/functionality can be rewritten so I can eliminate the for-loop and vectorize it using some clever numpy-tricks.
So far I failed at all attempts doing so.
This is the code:
for i in range(10000):
index_min = np.argsort(np.abs(vec[i]))
if index_min[0] == 0: # x smallest magnitude
res_vec = np.array([0, -vec[i][2], vec[i][1]])
elif index_min[0] == 1: # y smallest magnitude
res_vec = np.array([vec[i][2], 0, -vec[i][0]])
elif index_min[0] == 2: # z smallest magnitude
res_vec = np.array([-vec[i][1], vec[i][0], 0])
The array vec contains data of the form (3D row-vectors):
print(vec) -->
[[ 0.57743925 0.57737595 -0.5772355 ]
[ 0.5776141 0.5777615 -0.57667464]
[ 0.5772779 0.5785899 -0.57618046]
...
[ 0.5764752 0.5781902 -0.5773842 ]
[ 0.5764985 0.578053 -0.57749826]
[ 0.5764546 0.5784942 -0.57710016]]
print(vec.ndim) -->
2
print(vec.shape) -->
(32000, 3)
As your question is about vectorizing your code, you can look at the code below that compares your for loop version (Timer 1, see code below) with Feri's vectorized version (Timer 2) and the performance is improved significantly. I also found that using boolean indexing (Timer 3) can speed-up your code even more but the code is a little less aesthetic:
import numpy as np
import time
# Preparation of testdata
R = 32000
vec = 2 * np.random.rand(R,3) - 1
# For loop verion
t_start = time.time()
res_vec = np.zeros(vec.shape)
for i in range(R):
index_min = np.argsort(np.abs(vec[i]))
if index_min[0] == 0: # x smallest magnitude
res_vec[i,:] = np.array([0, -vec[i][2], vec[i][1]])
elif index_min[0] == 1: # y smallest magnitude
res_vec[i,:] = np.array([vec[i][2], 0, -vec[i][0]])
elif index_min[0] == 2: # z smallest magnitude
res_vec[i,:] = np.array([-vec[i][1], vec[i][0], 0])
print(f'Timer 1: {time.time()-t_start}s')
# Feri's formula
t_start = time.time()
res_vec2 = np.zeros(vec.shape)
index_min = np.argmin(np.abs(vec), axis=1)
res_vec2[range(R),(index_min+1)%3] = -vec[range(R),(index_min+2)%3]
res_vec2[range(R),(index_min+2)%3] = vec[range(R),(index_min+1)%3]
print(f'Timer 2: {time.time()-t_start}s')
# Boolean indexing
t_start = time.time()
res_vec3 = np.zeros(vec.shape)
index_min = np.argmin(np.abs(vec), axis=1)
res_vec3[index_min == 0,1] = -vec[index_min == 0,2]
res_vec3[index_min == 0,2] = vec[index_min == 0,1]
res_vec3[index_min == 1,0] = vec[index_min == 1,2]
res_vec3[index_min == 1,2] = -vec[index_min == 1,0]
res_vec3[index_min == 2,0] = -vec[index_min == 2,1]
res_vec3[index_min == 2,1] = vec[index_min == 2,0]
print(f'Timer 3: {time.time()-t_start}s')
print('Results 1&2 are equal' if np.linalg.norm(res_vec-res_vec2)==0 else 'Results 1&2 differ')
print('Results 1&3 are equal' if np.linalg.norm(res_vec-res_vec3)==0 else 'Results 1&3 differ')
Output:
% python3 script.py
Timer 1: 0.24681901931762695s
Timer 2: 0.020949125289916992s
Timer 3: 0.0034308433532714844s
Results 1&2 are equal
Results 1&3 are equal
index_min = np.argmin(np.abs(vec), axis=1)
vec_c = vec.copy()
vec[range(len(vec)), index_min] = 0.
vec[range(len(vec)), (index_min + 1) % 3] = -vec_c[range(len(vec)), (index_min + 2) % 3]
vec[range(len(vec)), (index_min + 2) % 3] = vec_c[range(len(vec)), (index_min + 1) % 3]
Sorting each entire 3D array is unnecessary when you only care about the index of the smallest one. Do this:
for i in range(10000):
index_min = np.argmin(np.abs(vec[i]))
if index_min == 0: # x smallest magnitude
res_vec = np.array([0, -vec[i][2], vec[i][1]])
elif index_min == 1: # y smallest magnitude
res_vec = np.array([vec[i][2], 0, -vec[i][0]])
else:
res_vec = np.array([-vec[i][1], vec[i][0], 0])
You could improve this further by using Numba to JIT compile the loop. That would also let you avoid creating the unnecessary temporary array from np.abs() because you could write a custom argmin() that uses the absolute value of each element as it goes.
You can also avoid temporaries produced by - if you do this:
for i in range(10000):
index_min = np.argmin(np.abs(vec[i]))
res_vec = np.empty_like(vec[i])
if index_min == 0: # x smallest magnitude
res_vec[0] = 0
np.negative(vec[i][2], out=res_vec[1])
res_vec[2] = vec[i][1]
# etc
The idea being that np.negative will write the negated values directly into res_vec whereas - on its own will always produce a new allocated array that you don't need.
Although you say it's not the main issue, I thought I'd add this in case it is of interest.
A method I've found to have good stability to find a (unit) vector orthogonal to a given (non-zero) vector is to use Householder reflectors. These are orthogonal and symmetric (hence their own inverses) matrices defined by a non-zero vector h as
Q = I - 2*h*h'/(h'*h)
Given a non-zero vector v there is an algorithm to compute (the h defining) a Householder reflector Q that maps v to a multiple of (1,0,0)'. It follows that Q*(0,1,0)' is orthogonal to v.
In case this sounds expensive here is C code (sorry, I don't speak python) that given v, fills u with a vector orthogonal to v
static void ovec( const double* v, double* restrict u)
{
double lv = sqrt( v[0]*v[0] + v[1]*v[1] + v[2]*v[2]); // length of v
double s = copysign ( lv, v[0]); // s has abs value lv, sign of v[0]
double h = v[0] + s; // first component of householder vector for Q
// other components are v[1] and v[2]
double a = -1.0/(s*h); // householder scale
// apply reflector to (0,1,0)'
double b = a*v[1];
u[0] = b*h; u[1] = 1.0 + b*v[1]; u[2] = b*v[2];
}
A couple of things I like about this are that the same method can be used in higher dimensions, and that it is easy to extend it to make an orthogonal basis, where one vector is parallel to v, and the others are mutually orthogonal and orthogonal to v.
Related
I am looking for a method in which I can smooth a scattered dataset. The scattered dataset comes from sampling a very large array that represents a raster. I have to vectorize this array in order to downsample it. I have done so using the matplotlib.pyplot.contour() function, and I get reasonable set of point value pairs.
The problem is that this signal is noisy, and I need to smooth it. Smoothing the original array is no good, I need to smooth the scattered data. The best I could find is the function below, which I rewrote from a Matlab counterpart. While this function does the job, it is very slow. I am looking either for alternative functions to smooth this data or a way to make the function below faster.
def limgrad(self, triangulation, values, dfdx, imax=100):
"""
See https://github.com/dengwirda/mesh2d/blob/master/hjac-util/limgrad.m
for original source code.
"""
# triangulation is a matplotlib.tri.Triangulation instance
edge = triangulation.edges
dx = np.subtract(
triangulation.x[edge[:, 0]], triangulation.x[edge[:, 1]])
dy = np.subtract(
triangulation.y[edge[:, 0]], triangulation.y[edge[:, 1]])
elen = np.sqrt(dx**2+dy**2)
aset = np.zeros(values.shape)
ftol = np.min(values) * np.sqrt(np.finfo(float).eps)
for i in range(1, imax + 1):
aidx = np.where(aset == i-1)[0]
if len(aidx) == 0.:
break
active_idxs = np.argsort(values[aidx])
for active_idx in active_idxs:
adj_edges_idxs = np.where(
np.any(edge == active_idx, axis=1))[0]
adjacent_edges = edge[adj_edges_idxs]
for nod1, nod2 in adjacent_edges:
if values[nod1] > values[nod2]:
fun1 = values[nod2] + elen[active_idx] * dfdx
if values[nod1] > fun1+ftol:
values[nod1] = fun1
aset[nod1] = i
else:
fun2 = values[nod1] + elen[active_idx] * dfdx
if values[nod2] > fun2+ftol:
values[nod2] = fun2
aset[nod2] = i
return values
I found the answer to my own question and I am posting here for reference. The algorith above is slow because calling np.where() to generate adj_edges_idxs has a heavy overhead. Instead, I precompute the node neighbors and that eliminates the overhead. It went from ~80 iterations per second to 80,000 it/s.
The final version looks like this:
def limgrad(tri, values, dfdx=0.2, imax=100):
"""
See https://github.com/dengwirda/mesh2d/blob/master/hjac-util/limgrad.m
for original source code.
"""
xy = np.vstack([tri.x, tri.y]).T
edge = tri.edges
dx = np.subtract(xy[edge[:, 0], 0], xy[edge[:, 1], 0])
dy = np.subtract(xy[edge[:, 0], 1], xy[edge[:, 1], 1])
elen = np.sqrt(dx**2+dy**2)
ffun = values.flatten()
aset = np.zeros(ffun.shape)
ftol = np.min(ffun) * np.sqrt(np.finfo(float).eps)
# precompute neighbor table
point_neighbors = defaultdict(set)
for simplex in tri.triangles:
for i, j in permutations(simplex, 2):
point_neighbors[i].add(j)
# iterative smoothing
for _iter in range(1, imax+1):
aidx = np.where(aset == _iter-1)[0]
if len(aidx) == 0.:
break
active_idxs = np.argsort(ffun[aidx])
for active_idx in active_idxs:
adjacent_edges = point_neighbors[active_idx]
for adj_edge in adjacent_edges:
if ffun[adj_edge] > ffun[active_idx]:
fun1 = ffun[active_idx] + elen[active_idx] * dfdx
if ffun[adj_edge] > fun1+ftol:
ffun[adj_edge] = fun1
aset[adj_edge] = _iter
else:
fun2 = ffun[adj_edge] + elen[active_idx] * dfdx
if ffun[active_idx] > fun2+ftol:
ffun[active_idx] = fun2
aset[active_idx] = _iter
flag = _iter < imax
return ffun, flag
I'm testing new metrics to measure distance between weight matrices in Pytorch, right now I'm trying to use Mahalanobis. For that I reshape every matrix into a vector and concat then into one matrix and then use this matrix to calculate the mahalanobis distance between any two rows of this matrix. Problem is some of those are getting me negative results and the square root of a negative throws me NaN.
I know the covariance matrix has to be Positive Defined, I guess I'm messing up there, or maybe my ideia of using mahalanobis in this case is not possible?
Here's the code I'm using, where I'm passing to it a X with shape (64,121), that being 64 (11x11) matrices
def _mahalanobis(X):
VI = torch.inverse(cov(X)) #covariance matrix
total_dist = 0
for i,v in enumerate(X):
dist = 0
for j,u in enumerate(X):
if i == j:
continue
x = (v-u).unsqueeze(0).t()
y = (v - u).unsqueeze(0)
dist = torch.sqrt(torch.mm(torch.mm(y,VI),x))
total_dist +=dist
print(dist)
return total_dist
## Returns the covariance matrix of m
def cov(m, rowvar=False):
if m.dim() > 2:
raise ValueError('m has more than 2 dimensions')
if m.dim() < 2:
m = m.view(1, -1)
if not rowvar and m.size(0) != 1:
m = m.t()
# m = m.type(torch.double) # uncomment this line if desired
fact = 1.0 / (m.size(1) - 1)
m -= torch.mean(m, dim=1, keepdim=True)
mt = m.t() # if complex: mt = m.t().conj()
return fact * m.matmul(mt).squeeze()
In the following code I have implemented Gaussian elimination with partial pivoting for a general square linear system Ax=b. I have tested my code and it produced the right output. However now I am trying to do the following but I am not quite sure how to code it, looking for some help with this!
I want to test my implementation by solving Ax=b where A is a random 100x100 matrix and b is a random 100x1 vector.
In my code I have put in the matrices
A = np.array([[3.,2.,-4.],[2.,3.,3.],[5.,-3.,1.]])
b = np.array([[3.],[15.],[14.]])
and gotten the following correct output:
[3. 1. 2.]
[3. 1. 2.]
but now how do I change it to generate the random matrices?
here is my code below:
import numpy as np
def GEPP(A, b, doPricing = True):
'''
Gaussian elimination with partial pivoting.
input: A is an n x n numpy matrix
b is an n x 1 numpy array
output: x is the solution of Ax=b
with the entries permuted in
accordance with the pivoting
done by the algorithm
post-condition: A and b have been modified.
'''
n = len(A)
if b.size != n:
raise ValueError("Invalid argument: incompatible sizes between"+
"A & b.", b.size, n)
# k represents the current pivot row. Since GE traverses the matrix in the
# upper right triangle, we also use k for indicating the k-th diagonal
# column index.
# Elimination
for k in range(n-1):
if doPricing:
# Pivot
maxindex = abs(A[k:,k]).argmax() + k
if A[maxindex, k] == 0:
raise ValueError("Matrix is singular.")
# Swap
if maxindex != k:
A[[k,maxindex]] = A[[maxindex, k]]
b[[k,maxindex]] = b[[maxindex, k]]
else:
if A[k, k] == 0:
raise ValueError("Pivot element is zero. Try setting doPricing to True.")
#Eliminate
for row in range(k+1, n):
multiplier = A[row,k]/A[k,k]
A[row, k:] = A[row, k:] - multiplier*A[k, k:]
b[row] = b[row] - multiplier*b[k]
# Back Substitution
x = np.zeros(n)
for k in range(n-1, -1, -1):
x[k] = (b[k] - np.dot(A[k,k+1:],x[k+1:]))/A[k,k]
return x
if __name__ == "__main__":
A = np.array([[3.,2.,-4.],[2.,3.,3.],[5.,-3.,1.]])
b = np.array([[3.],[15.],[14.]])
print (GEPP(np.copy(A), np.copy(b), doPricing = False))
print (GEPP(A,b))
You're already using numpy. Have you considered np.random.rand?
np.random.rand(m, n) will get you a random matrix with values in [0, 1). You can further process it by multiplying random values or rounding.
EDIT: Something like this
if __name__ == "__main__":
A = np.round(np.random.rand(100, 100)*10)
b = np.round(np.random.rand(100)*10)
print (GEPP(np.copy(A), np.copy(b), doPricing = False))
print (GEPP(A,b))
So I would use np.random.randint for this.
numpy.random.randint(low, high=None, size=None, dtype='l')
which outputs a size-shaped array of random integers from the appropriate distribution, or a single such random int if size not provided.
low is the lower bound of the ints you want in your range
high is one greater than the upper bound in your desired range
size is the dimensions of your output array
dtype is the dtype of the result
so if I was you I would write
A = np.random.randint(0, 11, (100, 100))
b = np.random.randint(0, 11, 100)
Basically you could create the desired matrices with ones and then iterate over them, setting each value to random.randint(0,100) for example.
Empty matrix with ones is:
one_array = np.ones((100, 100))
EDIT:
like:
for x in one_array.shape[0]:
for y in one_array.shape[1]:
one_array[x][y] = random.randint(0, 100)
A = np.random.normal(size=(100,100))
b = np.random.normal(size=(100,1))
x = np.linalg.solve(A,b)
assert max(abs(A#x - b)) < 1e-12
Clearly, you can use different distributions than normal, like uniform.
You can use numpy's native rand function:
np.random.rand()
In your code just define A and b as:
A = np.random.rand(100, 100)
b = np.random.rand(100)
This will generate 100x100 matrix and 100x1 vector (both numpy arrays) filled with random values between 0 and 1.
See the docs for this function to learn more.
I need help vectorizing this code. Right now, with N=100, its takes a minute or so to run. I would like to speed that up. I have done something like this for a double loop, but never with a 3D loop, and I am having difficulties.
import numpy as np
N = 100
n = 12
r = np.sqrt(2)
x = np.arange(-N,N+1)
y = np.arange(-N,N+1)
z = np.arange(-N,N+1)
C = 0
for i in x:
for j in y:
for k in z:
if (i+j+k)%2==0 and (i*i+j*j+k*k!=0):
p = np.sqrt(i*i+j*j+k*k)
p = p/r
q = (1/p)**n
C += q
print '\n'
print C
The meshgrid/where/indexing solution is already extremely fast. I made it about 65 % faster. This is not too much, but I explain it anyway, step by step:
It was easiest for me to approach this problem with all 3D vectors in the grid being columns in one large 2D 3 x M array. meshgrid is the right tool for creating all the combinations (note that numpy version >= 1.7 is required for a 3D meshgrid), and vstack + reshape bring the data into the desired form. Example:
>>> np.vstack(np.meshgrid(*[np.arange(0, 2)]*3)).reshape(3,-1)
array([[0, 0, 1, 1, 0, 0, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1]])
Each column is one 3D vector. Each of these eight vectors represents one corner of a 1x1x1 cube (a 3D grid with step size 1 and length 1 in all dimensions).
Let's call this array vectors (it contains all 3D vectors representing all points in the grid). Then, prepare a bool mask for selecting those vectors fulfilling your mod2 criterion:
mod2bool = np.sum(vectors, axis=0) % 2 == 0
np.sum(vectors, axis=0) creates an 1 x M array containing the element sum for each column vector. Hence, mod2bool is a 1 x M array with a bool value for each column vector. Now use this bool mask:
vectorsubset = vectors[:,mod2bool]
This selects all rows (:) and uses boolean indexing for filtering the columns, both are fast operations in numpy. Calculate the lengths of the remaining vectors, using the native numpy approach:
lengths = np.sqrt(np.sum(vectorsubset**2, axis=0))
This is quite fast -- however, scipy.stats.ss and bottleneck.ss can perform the squared sum calculation even faster than this.
Transform the lengths using your instructions:
with np.errstate(divide='ignore'):
p = (r/lengths)**n
This involves finite number division by zero, resulting in Infs in the output array. This is entirely fine. We use numpy's errstate context manager for making sure that these zero divisions do not throw an exception or a runtime warning.
Now sum up the finite elements (ignore the infs) and return the sum:
return np.sum(p[np.isfinite(p)])
I have implemented this method two times below. Once exactly like just explained, and once involving bottleneck's ss and nansum functions. I have also added your method for comparison, and a modified version of your method that skips the np.where((x*x+y*y+z*z)!=0) indexing, but rather creates Infs, and finally sums up the isfinite way.
import sys
import numpy as np
import bottleneck as bn
N = 100
n = 12
r = np.sqrt(2)
x,y,z = np.meshgrid(*[np.arange(-N, N+1)]*3)
gridvectors = np.vstack((x,y,z)).reshape(3, -1)
def measure_time(func):
import time
def modified_func(*args, **kwargs):
t0 = time.time()
result = func(*args, **kwargs)
duration = time.time() - t0
print("%s duration: %.3f s" % (func.__name__, duration))
return result
return modified_func
#measure_time
def method_columnvecs(vectors):
mod2bool = np.sum(vectors, axis=0) % 2 == 0
vectorsubset = vectors[:,mod2bool]
lengths = np.sqrt(np.sum(vectorsubset**2, axis=0))
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return np.sum(p[np.isfinite(p)])
#measure_time
def method_columnvecs_opt(vectors):
# On my system, bn.nansum is even slightly faster than np.sum.
mod2bool = bn.nansum(vectors, axis=0) % 2 == 0
# Use ss from bottleneck or scipy.stats (axis=0 is default).
lengths = np.sqrt(bn.ss(vectors[:,mod2bool]))
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return bn.nansum(p[np.isfinite(p)])
#measure_time
def method_original(x,y,z):
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
ind = np.where((x*x+y*y+z*z)!=0)
x = x[ind]
y = y[ind]
z = z[ind]
p=np.sqrt(x*x+y*y+z*z)/r
return np.sum((1/p)**n)
#measure_time
def method_original_finitesum(x,y,z):
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
lengths = np.sqrt(x*x+y*y+z*z)
with np.errstate(divide='ignore'):
p = (r/lengths)**n
return np.sum(p[np.isfinite(p)])
print method_columnvecs(gridvectors)
print method_columnvecs_opt(gridvectors)
print method_original(x,y,z)
print method_original_finitesum(x,y,z)
This is the output:
$ python test.py
method_columnvecs duration: 1.295 s
12.1318801965
method_columnvecs_opt duration: 1.162 s
12.1318801965
method_original duration: 1.936 s
12.1318801965
method_original_finitesum duration: 1.714 s
12.1318801965
All methods produce the same result. Your method becomes a bit faster when doing the isfinite style sum. My methods are faster, but I would say that this is an exercise of academic nature rather than an important improvement :-)
I have one question left: you were saying that for N=3, the calculation should produce a 12. Even yours doesn't do this. All methods above produce 12.1317530867 for N=3. Is this expected?
Thanks to #Bill, I was able to get this to work. Very fast now. Perhaps could be done better, especially with the two masks to get rid of the two conditions that I originally had for loops for.
from __future__ import division
import numpy as np
N = 100
n = 12
r = np.sqrt(2)
x, y, z = np.meshgrid(*[np.arange(-N, N+1)]*3)
ind = np.where((x+y+z)%2==0)
x = x[ind]
y = y[ind]
z = z[ind]
ind = np.where((x*x+y*y+z*z)!=0)
x = x[ind]
y = y[ind]
z = z[ind]
p=np.sqrt(x*x+y*y+z*z)/r
ans = (1/p)**n
ans = np.sum(ans)
print 'ans'
print ans
I want to build a grid from sampled data. I could use a machine learning - clustering algorithm, like k-means, but I want to restrict the centres to be roughly uniformly distributed.
I have come up with an approach using the scikit-learn nearest neighbours search: pick a point at random, delete all points within radius r then repeat. This works well, but wondering if anyone has a better (faster) way of doing this.
In response to comments I have tried two alternate methods, one turns out much slower the other is about the same...
Method 0 (my first attempt):
def get_centers0(X, r):
N = X.shape[0]
D = X.shape[1]
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
while N > 0:
nearest.fit(X)
x = X[int(random()*N), :]
_, del_x = nearest.radius_neighbors(x)
X = np.delete(X, del_x[0], axis = 0)
grid = np.vstack([grid, x])
N = X.shape[0]
return grid
Method 1 (using the precomputed graph):
def get_centers1(X, r):
N = X.shape[0]
D = X.shape[1]
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
nearest.fit(X)
graph = nearest.radius_neighbors_graph(X)
#This method is very slow even before doing any 'pruning'
Method 2:
def get_centers2(X, r, k):
N = X.shape[0]
D = X.shape[1]
k = k
grid = np.zeros([0,D])
nearest = near.NearestNeighbors(radius = r, algorithm = 'auto')
while N > 0:
nearest.fit(X)
x = X[np.random.randint(0,N,k), :]
#min_dist = near.NearestNeighbors().fit(x).kneighbors(x, n_neighbors = 1, return_distance = True)
min_dist = dist(x, k, 2, np.ones(k)) # where dist is a cython compiled function
x = x[min_dist < 0.1,:]
_, del_x = nearest.radius_neighbors(x)
X = np.delete(X, del_x[0], axis = 0)
grid = np.vstack([grid, x])
N = X.shape[0]
return grid
Running these as follows:
N = 50000
r = 0.1
x1 = np.random.rand(N)
x2 = np.random.rand(N)
X = np.vstack([x1, x2]).T
tic = time.time()
grid0 = get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)
tic = time.time()
get_centers1(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)
tic = time.time()
grid2 = get_centers2(X, r)
toc = time.time()
print 'Method 1: ' + str(toc - tic)
Method 0 and 2 are about the same...
Method 0: 0.840130090714
Method 1: 2.23365592957
Method 2: 0.774812936783
I'm not sure from the question exactly what you are trying to do. You mention wanting to create an "approximate grid", or a "uniform distribution", while the code you provide selects a subset of points such that no pairwise distance is greater than r.
A couple possible suggestions:
if what you want is an approximate grid, I would construct the grid you want to approximate, and then query for the nearest neighbor of each grid point. Depending on your application, you might further trim these results to cut-out points whose distance from the grid point is larger than is useful for you.
if what you want is an approximately uniform distribution drawn from among the points, I would do a kernel density estimate (sklearn.neighbors.KernelDensity) at each point, and do a randomized sub-selection from the dataset weighted by the inverse of the local density at each point.
if what you want is a subset of points such that no pairwise distance is greater than r, I would start by constructing a radius_neighbors_graph with radius r, which will, in one go, give you a list of all points which are too close together. You can then use a pruning algorithm similar to the one you wrote above to remove points based on these sparse graph distances.
I hope that helps!
I have come up with a very simple method which is much more efficient than my previous attempts.
This one simply loops over the data set and adds the current point to the list of grid points only if it is greater than r distance from all existing centers. This method is around 20 times faster than my previous attempts. Because there are no external libraries involved I can run this all in cython...
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def get_centers_fast(np.ndarray[DTYPE_t, ndim = 2] x, double radius):
cdef int N = x.shape[0]
cdef int D = x.shape[1]
cdef int m = 1
cdef np.ndarray[DTYPE_t, ndim = 2] xc = np.zeros([10000, D])
cdef double r = 0
cdef double r_min = 10
cdef int i, j, k
for k in range(D):
xc[0,k] = x[0,k]
for i in range(1, N):
r_min = 10
for j in range(m):
r = 0
for k in range(D):
r += (x[i, k] - xc[j, k])**2
r = r**0.5
if r < r_min:
r_min = r
if r_min > radius:
m = m + 1
for k in range(D):
xc[m - 1,k] = x[i,k]
nonzero = np.nonzero(xc[:,0])[0]
xc = xc[nonzero,:]
return xc
Running these methods as follows:
N = 40000
r = 0.1
x1 = np.random.normal(size = N)
x1 = (x1 - min(x1)) / (max(x1)-min(x1))
x2 = np.random.normal(size = N)
x2 = (x2 - min(x2)) / (max(x2)-min(x2))
X = np.vstack([x1, x2]).T
tic = time.time()
grid0 = gt.get_centers0(X, r)
toc = time.time()
print 'Method 0: ' + str(toc - tic)
tic = time.time()
grid2 = gt.get_centers2(X, r, 10)
toc = time.time()
print 'Method 2: ' + str(toc - tic)
tic = time.time()
grid3 = gt.get_centers_fast(X, r)
toc = time.time()
print 'Method 3: ' + str(toc - tic)
The new method is around 20 times faster. It could be made even faster, if I stopped looping early (e.g. if k successive iterations fail to produce a new center).
Method 0: 0.219595909119
Method 2: 0.191949129105
Method 3: 0.0127329826355
Maybe you could only re-fit the nearest object every k << N deletions to speedup the process. Most of the time the neighborhood structure should not change much.
Sounds like you are trying to reinvent one of the following:
cluster features (see BIRCH)
data bubbles (see "Data bubbles: Quality preserving performance boosting for hierarchical clustering")
canopy pre-clustering
i.e. this concept has already been invented at least three times with small variations.
Technically, it is not clustering. K-means isn't really clustering either.
It is much more adequately described as vector quantization.