Cluster data based on distance threshold

Cluster data based on distance threshold - python

I would like to delete a data on which is 10cm close from the previous data.
This is what i have but it takes a large computational time because my dataset is very huge
for i in range(len(data)):
for j in range(i, len(data)):
if (i == j):
continue
elif np.sqrt((data[i, 0]-data[j, 0])**2 + (data[i, 1]-data[i, 1])**2) <= 0.1:
data[j, 0] = np.nan
data = data[~np.isnan(data).any(axis=1)]
Is there a pythonic way to do this?

Here is an approach using a KDTree:
import numpy as np
from scipy.spatial import cKDTree as KDTree
def cluster_data_KDTree(a, thr=0.1):
t = KDTree(a)
mask = np.ones(a.shape[:1], bool)
idx = 0
nxt = 1
while nxt:
mask[t.query_ball_point(a[idx], thr)] = False
nxt = mask[idx:].argmax()
mask[idx] = True
idx += nxt
return a[mask]
Borrowing #Divakar's test case we see that this delivers another 100x speedup on top of the 400x Divakar reports. Compared to OP we extrapolate a ridiculous 40,000x:
np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()
from timeit import timeit
kwds = dict(globals=globals(), number=10)
print(timeit("cluster_data_KDTree(data1)", **kwds))
print(timeit("cluster_data_pdist_v1(data2)", **kwds))
np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()
out1 = cluster_data_KDTree(data1, thr=0.1)
out2 = cluster_data_pdist_v1(data2, dist_thresh = 0.1)
print(np.allclose(out1, out2))
Sample output:
0.05073001119308174
5.646531613077968
True
It turns out that this test case happens to be quite favorable to my approach because there are very few clusters and thus very few iterations.
If we drastically increase the number of clusters to about 3800 by changing the threshold to 0.01 KDTree still wins but the speedup is reduced from 100x to 15x:
0.33647687803022563
5.28947562398389
True

We can use pdist with one-loop -
from scipy.spatial.distance import pdist
def cluster_data_pdist_v1(a, dist_thresh = 0.1):
d = pdist(a)
mask = d<=dist_thresh
n = len(a)
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
idx_out = np.zeros(mask.sum(), dtype=int) # use np.empty for bit more speedup
cur_start = 0
for iterID,(i,j) in enumerate(zip(start, stop)):
if iterID not in idx_out[:cur_start]:
rm_idx = np.flatnonzero(mask[i:j])+iterID+1
L = len(rm_idx)
idx_out[cur_start:cur_start+L] = rm_idx
cur_start += L
return np.delete(a, idx_out[:cur_start], axis=0)
Benchmarking
Original approach -
def cluster_data_org(data, dist_thresh = 0.1):
for i in range(len(data)):
for j in range(i, len(data)):
if (i == j):
continue
elif np.sqrt((data[i, 0]-data[j, 0])**2 +
(data[i, 1]-data[j, 1])**2) <= 0.1:
data[j, 0] = np.nan
return data[~np.isnan(data).any(axis=1)]
Runtime test, verification on random data in the range : [0,1) with 10,000 points -
In [207]: np.random.seed(0)
...: data1 = np.random.rand(10000,2)
...: data2 = data1.copy()
...:
...: out1 = cluster_data_org(data1, dist_thresh = 0.1)
...: out2 = cluster_data_pdist_v1(data2, dist_thresh = 0.1)
...: print np.allclose(out1, out2)
True
In [208]: np.random.seed(0)
...: data1 = np.random.rand(10000,2)
...: data2 = data1.copy()
In [209]: %timeit cluster_data_org(data1, dist_thresh = 0.1)
1 loop, best of 3: 1min 50s per loop
In [210]: %timeit cluster_data_pdist_v1(data2, dist_thresh = 0.1)
1 loop, best of 3: 287 ms per loop
Around 400x speedup for such a setup!

Related

Vectorize the following python code?

I am trying to vectorize the following operations with two matrix in python.
f= matrix([[ 96],
[192],
[288],
[384]], dtype=int32)
g = matrix([[ 0.],
[ 70.],
[ 200.],
[ 60.]])
Need to create z without creating loops such that z is maximum of cumulative sum of first column and sum of last value of z and another matrix g. This loop is called thousands of time, therefore slowing the run time.
for i in range(4):
if i != 0:
z[i] = max(f[i], z[i-1] + g[i])
else:
z[0] = f[i]
Any guidance on how to vectorize this code would be really helpful.
Thanks in advance.

Here is a vectorized version. It uses the cumulative maximum on the difference between f and cumsum(g) to predict the points where f[i] is larger than z[i]:
Timings:
N = 10
loopy 0.00594156 ms
vect 0.03193051 ms
N = 100
loopy 0.05560229 ms
vect 0.03186400 ms
N = 1000
loopy 0.57484017 ms
vect 0.04492043 ms
N = 10000
loopy 5.75115310 ms
vect 0.15519847 ms
N = 100000
loopy 57.30253551 ms
vect 1.69428380 ms
Code:
import numpy as np
import types
from timeit import timeit
def setup_data(N):
g = np.random.random((N,))
f = 2 + np.cumsum(np.random.random(N,))
return f, g
def f_loopy(f, g):
N, = f.shape
z = np.empty_like(f)
for i in range(N):
if i != 0:
z[i] = max(f[i], z[i-1] + g[i])
else:
z[0] = f[i]
return z
def f_vect(f, g):
N, = f.shape
gg = np.cumsum(g)
rmx = np.maximum.accumulate(f - gg)
sw = np.r_[0, 1 + np.flatnonzero(rmx[:-1] != rmx[1:]), N]
return gg + np.repeat(f[sw[:-1]]-gg[sw[:-1]], np.diff(sw))
for N in [10, 100, 1000, 10000, 100000]:
data = setup_data(N)
ref = f_loopy(*data)
print(f'N = {N}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.allclose(ref, func(*data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(*data)', globals={'f':func, 'data':data}, number=100)*10))
except:
print("{:16s} apparently failed".format(name[2:]))

SVD++ vectorization with numpy or tensorflow

I want to implement SVD++ with numpy or tensorflow.
( https://pdfs.semanticscholar.org/8451/c2812a1476d3e13f2a509139322cc0adb1a2.pdf )
(4p equation 4)
I want to implement above equation without any for loop.
But summation of y_j with index set R(u) makes it hard.
So my question is...
I want to implement below equation (q_v multiply sum of y_j) without any for loop
1. Is it possible to implement it with numpy without for loop?!
2. Is it possible to implement it with tensorflow without for loop?!
My implementation is below... but I want to remove for loop further more
import numpy as np
num_users = 3
num_items = 5
latent_dim = 2
p = 0.1
r = np.random.binomial(1, 1 - p,(num_users, num_items))
r_hat = np.zeros([num_users,num_items])
q = np.random.randn(latent_dim,num_items)
y = np.random.randn(latent_dim,num_items)
## First Try
for user in range(num_users):
for item in range(num_items):
q_j = q[:,item]
user_item_list = [i for i, e in enumerate(r[user,:]) if e != 0] # R_u
sum_y_j = 0 # to make sum of y_i
for user_item in user_item_list:
sum_y_j = sum_y_j + y[:,user_item]
sum_y_j = np.asarray(sum_y_j)
r_hat[user,item] = np.dot(np.transpose(q_j),sum_y_j)
print r_hat
print "=" * 100
## Second Try
for user in range(num_users):
for item in range(num_items):
q_j = q[:,item]
user_item_list = [i for i, e in enumerate(r[user,:]) if e != 0] # R_u
sum_y_j = np.sum(y[:,user_item_list],axis=1) # to make sum of y_i
r_hat[user,item] = np.dot(np.transpose(q_j),sum_y_j)
print r_hat
print "=" * 100
## Third Try
for user in range(num_users):
user_item_list = [i for i, e in enumerate(r[user,:]) if e != 0] # R_u
sum_y_j = np.sum(y[:,user_item_list],axis=1) # to make sum of y_i
r_hat[user,:] = np.dot(np.transpose(q),sum_y_j)
print r_hat

Try this.
sum_y = []
for user in range(num_users):
mask = np.repeat(r[user,:][None,:],latent_dim, axis=0)
sum_y.append(np.sum(np.multiply(y, mask),axis=1))
sum_y = np.asarray(sum_y)
r_hat = (np.dot(q.T,sum_y.T)).T
print r_hat
It eliminates the enumerate loop, and also the dot product can be done in single go. I don't think it can be reduced beyond this.

Simply use two matrix-multiplications there with np.dot for the final output -
r_hat = r.dot(y.T).dot(q)
Sample run to verify results -
OP's sample setup :
In [68]: import numpy as np
...:
...: num_users = 3
...: num_items = 5
...: latent_dim = 2
...: p = 0.1
...:
...: r = np.random.binomial(1, 1 - p,(num_users, num_items))
...: r_hat = np.zeros([num_users,num_items])
...:
...: q = np.random.randn(latent_dim,num_items)
...: y = np.random.randn(latent_dim,num_items)
...:
In [69]: ## Second Try from OP
...: for user in range(num_users):
...: for item in range(num_items):
...: q_j = q[:,item]
...: user_item_list = [i for i, e in enumerate(r[user,:]) if e != 0] # R_u
...: sum_y_j = np.sum(y[:,user_item_list],axis=1) # to make sum of y_i
...: r_hat[user,item] = np.dot(np.transpose(q_j),sum_y_j)
...:
Let's print out the result from OP's solution -
In [70]: r_hat
Out[70]:
array([[ 4.06866107e+00, 2.91099460e+00, -6.50447668e+00,
7.44275731e-03, -2.14857566e+00],
[ 4.06866107e+00, 2.91099460e+00, -6.50447668e+00,
7.44275731e-03, -2.14857566e+00],
[ 5.57369599e+00, 3.76169533e+00, -8.47503476e+00,
1.48615948e-01, -2.82792374e+00]])
Now, I am using my proposed solution -
In [71]: r.dot(y.T).dot(q)
Out[71]:
array([[ 4.06866107e+00, 2.91099460e+00, -6.50447668e+00,
7.44275731e-03, -2.14857566e+00],
[ 4.06866107e+00, 2.91099460e+00, -6.50447668e+00,
7.44275731e-03, -2.14857566e+00],
[ 5.57369599e+00, 3.76169533e+00, -8.47503476e+00,
1.48615948e-01, -2.82792374e+00]])
Value check seems successful!

Avoiding numpy loops while calculating intersections

I'd like to speed up the following calculations handling r rays and n spheres. Here is what I got so far:
# shape of mu1 and mu2 is (r, n)
# shape of rays is (r, 3)
# note that intersections has 2n columns because for every sphere one can
# get up to two intersections (secant, tangent, no intersection)
intersections = np.empty((r, 2*n, 3))
for col in range(n):
intersections[:, col, :] = rays * mu1[:, col][:, np.newaxis]
intersections[:, col + n, :] = rays * mu2[:, col][:, np.newaxis]
# [...]
# calculate euclidean distance from the center of gravity (0,0,0)
distances = np.empty((r, 2 * n))
for col in range(n):
distances[:, col] = np.linalg.norm(intersections[:, col], axis=1)
distances[:, col + n] = np.linalg.norm(intersections[:, col + n], axis=1)
I tried speeding things up by avoiding the for-Loops, but couldn't figure out how to broadcast the arrays properly so that I only need a single function call. Any help is much appreciated.

Here's a vectorized way using broadcasting -
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
distances = np.sqrt((intersections**2).sum(2))
The last step could be replaced with an use of np.einsum like so -
distances = np.sqrt(np.einsum('ijk,ijk->ij',intersections,intersections))
Or replace almost the whole thing with np.einsum for another vectorized way, like so -
mu = np.hstack((mu1,mu2))
distances = np.sqrt(np.einsum('ij,ij,ik,ik->ij',mu,mu,rays,rays))
Runtime tests and verify outputs -
def original_app(mu1,mu2,rays):
intersections = np.empty((r, 2*n, 3))
for col in range(n):
intersections[:, col, :] = rays * mu1[:, col][:, np.newaxis]
intersections[:, col + n, :] = rays * mu2[:, col][:, np.newaxis]
distances = np.empty((r, 2 * n))
for col in range(n):
distances[:, col] = np.linalg.norm(intersections[:, col], axis=1)
distances[:, col + n] = np.linalg.norm(intersections[:, col + n], axis=1)
return distances
def vectorized_app1(mu1,mu2,rays):
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
return np.sqrt((intersections**2).sum(2))
def vectorized_app2(mu1,mu2,rays):
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
return np.sqrt(np.einsum('ijk,ijk->ij',intersections,intersections))
def vectorized_app3(mu1,mu2,rays):
mu = np.hstack((mu1,mu2))
return np.sqrt(np.einsum('ij,ij,ik,ik->ij',mu,mu,rays,rays))
Timings -
In [101]: # Inputs
...: r = 1000
...: n = 1000
...: mu1 = np.random.rand(r, n)
...: mu2 = np.random.rand(r, n)
...: rays = np.random.rand(r, 3)
In [102]: np.allclose(original_app(mu1,mu2,rays),vectorized_app1(mu1,mu2,rays))
Out[102]: True
In [103]: np.allclose(original_app(mu1,mu2,rays),vectorized_app2(mu1,mu2,rays))
Out[103]: True
In [104]: np.allclose(original_app(mu1,mu2,rays),vectorized_app3(mu1,mu2,rays))
Out[104]: True
In [105]: %timeit original_app(mu1,mu2,rays)
...: %timeit vectorized_app1(mu1,mu2,rays)
...: %timeit vectorized_app2(mu1,mu2,rays)
...: %timeit vectorized_app3(mu1,mu2,rays)
...:
1 loops, best of 3: 306 ms per loop
1 loops, best of 3: 215 ms per loop
10 loops, best of 3: 140 ms per loop
10 loops, best of 3: 136 ms per loop

Optimize a numpy ndarray indexing operation

I have a numpy operation that looks like the following:
for i in range(i_max):
for j in range(j_max):
r[i, j, x[i, j], y[i, j]] = c[i, j]
where x, y and c have the same shape.
Is it possible to use numpy's advanced indexing to speed this operation up?
I tried using:
i = numpy.arange(i_max)
j = numpy.arange(j_max)
r[i, j, x, y] = c
However, I didn't get the result I expected.

Using linear indexing -
d0,d1,d2,d3 = r.shape
np.put(r,np.arange(i_max)[:,None]*d1*d2*d3 + np.arange(j_max)*d2*d3 + x*d3 +y,c)
Benchmarking and verification
Define functions -
def linear_indx(r,x,y,c,i_max,j_max):
d0,d1,d2,d3 = r.shape
np.put(r,np.arange(i_max)[:,None]*d1*d2*d3 + np.arange(j_max)*d2*d3 + x*d3 +y,c)
return r
def org_app(r,x,y,c,i_max,j_max):
for i in range(i_max):
for j in range(j_max):
r[i, j, x[i,j], y[i,j]] = c[i,j]
return r
Setup input arrays and benchmark -
In [134]: # Setup input arrays
...: i_max = 40
...: j_max = 50
...: D0 = 60
...: D1 = 70
...: N = 80
...:
...: r = np.zeros((D0,D1,N,N))
...: c = np.random.rand(i_max,j_max)
...:
...: x = np.random.randint(0,N,(i_max,j_max))
...: y = np.random.randint(0,N,(i_max,j_max))
...:
In [135]: # Make copies for testing, as both functions make in-situ changes
...: r1 = r.copy()
...: r2 = r.copy()
...:
In [136]: # Verify results by comparing with original loopy approach
...: np.allclose(linear_indx(r1,x,y,c,i_max,j_max),org_app(r2,x,y,c,i_max,j_max))
Out[136]: True
In [137]: # Make copies for testing, as both functions make in-situ changes
...: r1 = r.copy()
...: r2 = r.copy()
...:
In [138]: %timeit linear_indx(r1,x,y,c,i_max,j_max)
10000 loops, best of 3: 115 µs per loop
In [139]: %timeit org_app(r2,x,y,c,i_max,j_max)
100 loops, best of 3: 2.25 ms per loop

The indexing arrays need to be broadcastable for this to work. The only change needed is to add an axis to the first index i to match the shape with the rest. The quick way to accomplish this is by indexing with None (which is equivalent to numpy.newaxis):
i = numpy.arange(i_max)
j = numpy.arange(j_max)
r[i[:,None], j, x, y] = c

How to speed up my numpy loop using numpy.where()

I have written a function about ordered logit model, recently.
But it takes me lots of time when running big data.
So I want to rewrite the code and substitute numpy.where function to if statement.
There have some problem about my new code, I don't know how to do it.
If you know, Please help me. Thank you very much!
This is my original function.
import numpy as np
from scipy.stats import logistic
def func(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
for i in xrange(1, len(thresholds)):
if row[0] == i:
diff_prob = logistic.cdf(thresholds[i] - row[1]) - logistic.cdf(thresholds[i - 1] - row[1])
if diff_prob <= 10 ** -5:
ll += np.log(10 ** -5)
else:
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print func(y, X, thresholds)
This is the new but not perfect code.
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
ll = np.where(y == 0, logistic.logcdf(thresholds[0] - X),
np.where(y == len(thresholds), logistic.logcdf(X - thresholds[-1]),
np.log(logistic.cdf(thresholds[1] - X) - logistic.cdf(thresholds[0] - X))))
print ll.sum()
The problem is that I don't know how to rewrite the sub-loop(for i in xrange(1, len(thresholds)):) function.

I think asking how to implement it just using np.where is a bit of an X/Y problem.
So I'll try to explain how I would approach optimizing this function.
My first instinct is to get rid of the for loop, which was the pain point anyway:
import numpy as np
from scipy.stats import logistic
def func1(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
diff_prob = logistic.cdf(thresholds[row[0]] - row[1]) - \
logistic.cdf(thresholds[row[0] - 1] - row[1])
diff_prob = 10 ** -5 if diff_prob < 10 ** -5 else diff_prob
ll += np.log(diff_prob)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func1(y, X, thresholds))
I have just replaced i with row[0], without changing the semantics of the loop. So that's one for loop less.
Now I would like to have the form of the statements in the different branches of the if-else to be the same. To that end:
import numpy as np
from scipy.stats import logistic
def func2(y, X, thresholds):
ll = 0.0
for row in zip(y, X):
if row[0] == 0:
ll += logistic.logcdf(thresholds[0] - row[1])
elif row[0] == len(thresholds):
ll += logistic.logcdf(row[1] - thresholds[-1])
else:
ll += np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[row[0]] - row[1]) -
logistic.cdf(thresholds[row[0] - 1] - row[1])
)
)
return ll
y = np.array([0, 1, 2])
X = [2, 2, 2]
thresholds = np.array([2, 3])
print(func2(y, X, thresholds))
Now the expression in each branch is of the form ll += expr.
At this piont there are a couple of different paths the optimization can take. You can try to optimize the loop away by writing it as a comprehension, but I suspect that it'll not give you much increase in speed.
An alternate path is to pull the if conditions out of the loop. That is what your intent with np.where was as well:
import numpy as np
from scipy.stats import logistic
def func3(y, X, thresholds):
y_0 = y == 0
y_end = y == len(thresholds)
y_rest = ~(y_0 | y_end)
ll_1 = logistic.logcdf(thresholds[0] - X[ y_0 ])
ll_2 = logistic.logcdf(X[ y_end ] - thresholds[-1])
ll_3 = np.log(
np.maximum(
10 ** -5,
logistic.cdf(thresholds[y[ y_rest ]] - X[ y_rest ]) -
logistic.cdf(thresholds[ y[y_rest] - 1 ] - X[ y_rest])
)
)
return np.sum(ll_1) + np.sum(ll_2) + np.sum(ll_3)
y = np.array([0, 1, 2])
X = np.array([2, 2, 2])
thresholds = np.array([2, 3])
print(func3(y, X, thresholds))
Note that I turned X into an np.array to be able to use fancy indexing on it.
At this point, I'd wager that it is fast enough for my purposes. However, you can stop earlier or beyond this point, depending on your requirements.
On my computer, I get the following results:
y = np.random.random_integers(0, 10, size=(10000,))
X = np.random.random_integers(0, 10, size=(10000,))
thresholds = np.cumsum(np.random.rand(10))
%timeit func(y, X, thresholds) # Original
1 loops, best of 3: 1.51 s per loop
%timeit func1(y, X, thresholds) # Removed for-loop
1 loops, best of 3: 1.46 s per loop
%timeit func2(y, X, thresholds) # Standardized if statements
1 loops, best of 3: 1.5 s per loop
%timeit func3(y, X, thresholds) # Vectorized ~ 500x improvement
100 loops, best of 3: 2.74 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cluster data based on distance threshold - python

Related

Vectorize the following python code?

SVD++ vectorization with numpy or tensorflow

Avoiding numpy loops while calculating intersections

Optimize a numpy ndarray indexing operation

How to speed up my numpy loop using numpy.where()

Categories

Resources