Background information on my question:
Two objects are called k-reciprocal nearest neighbours of each other if they are among each other's k-nearest neighbours. I'm only interested in objects belonging to disjoint groups. For instance, consider two sets of numbers S = {0, 1, 2}, T = {0.1, 1.1, 1.9} and k=2.
For group S,
the k-nearest neighbours of 0 in T are 0.1, 1.1.
the k-nearest neighbours of 1 in T are 1.1, 1.9.
the k-nearest neighbours of 2 in T are 1.9, 2.1.
And for group T,
the k-nearest neighbours of 0.1 in S are 0, 1.
the k-nearest neighbours of 1.1 in S are 1, 2.
the k-nearest neighbours of 1.9 in S are 1, 2.
Therefore the pairs of k-reciprocal nearest neighbours are (0, 0.1), (1, 1.1), (1, 1.9), (2, 1.9).
Let {A, B, C, D, E} and {W, X, Y, Z} be two disjoint groups of some objects. Suppose that the Euclidean metric makes sense between these groups, and that we have the following 5x4 distance matrix:
distmat = np.array([[5, 1, 4, 7.5],
[3, 10, 2, 11],
[9, 2.5, 8, 3],
[1, 3, 5.5, 5],
[4, 6, 3.5, 8]])
The five rows represent the distance of objects A, B, C, D, E from W, X, Y, Z correspondingly.
Question: What is an efficient way of obtaining the k-reciprocal nearest neighbours of A and of B?
Obtaining the k-nearest neighbours is ok, I used np.argsort(distmat) and then retrieved the objects with indices less than k.
Here's what I tried for the reciprocal part. wlog consider object A. For each k-nearest neighbour N of A, transpose distmat and check the N-th row. If A is a k-nearest neighbour of N, then they are reciprocals; otherwise they are not. Some rough code:
for index_N, N in enumerate(knn_A):
knn_N = get_knn(distmat.T[index_N]
if A in knn_N:
print("{} and {} are {}-reciprocals".format(A, N, k))
Any suggestions for improvement? This is pretty slow because I have many nested for-loops already, and the size of the two groups is possibly large.
You will have to check if this is faster since I don't see any nested for loops in the code you provided. Using your example ( which I think has the wrong reciprocal neighbors due to the line "the k-nearest neighbours of 2 in T are 1.9, 2.1." - where 2.1 is not in the set, and if you mean 1.1 then (2, 1.1) is also a reciprocal neighbor.
import numpy as np
import itertools
# set k and make the example set
k = 2
s1 = [0, 1, 2]
s2 = [.1, 1.1, 1.9]
#create the distance matrix
newarray = [ [ abs(s2j-s1i) for s2j in s2] for s1i in s1]
distmat = np.array( newarray )
#get the nearest neighbors for each set
neighbors_si = np.argsort( distmat )
neighbors_sj = np.argsort( distmat.T )
#map element of each set to k nearest neighbors
neighbors_si = { i: neighbors_si[i][0:k] for i in range(len(neighbors_si)) }
neighbors_sj = { j: neighbors_sj[j][0:k] for j in range(len(neighbors_sj)) }
#for each combination of i and j determine if they are in each others neighbor list
for i, j in itertools.product( neighbors_si.keys(), neighbors_sj.keys() ):
if j in neighbors_si[i] and i in neighbors_sj[j]:
print( '{} and {} are {}-reciprocals'.format( s1[i], s2[j], k ))
Related
Let's consider, there are two arrays I and J which determine the neighbor pairs:
I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])
Which means element 0 has two neighbors 1 and 2. Element 1 has only 0 as a neighbor and so on.
What is the most efficient way to create arrays of all neighbor triples I', J', K' such that j is neighbor of i and k is neighbor of j given the condition i, j, and k are different elements (i != j != k)?
Ip = np.array([0, 0, 2, 3])
Jp = np.array([2, 2, 0, 2])
Kp = np.array([0, 3, 1, 0])
Of course, one way is to loop over each element. Is there a more efficient algorithm? (working with 10-500 million elements)
I would go with a very simple approach and use pandas (I and J are your numpy arrays):
import pandas as pd
df1 = pd.DataFrame({'I': I, 'J': J})
df2 = df1.rename(columns={'I': 'K', 'J': 'I'})
result = pd.merge(df2, df1, on='I').query('K != J')
The advantage is that pandas.merge relies on a very fast underlying numerical implementation. Also, you can make the computation even faster for example by merging using indexes.
To reduce the memory that this approach needs, it would be probably very useful to reduce the size of df1 and df2 before merging them (for example, by changing the dtype of their columns to something that suits your need).
Here is an example of how to optimize speed and memory of the computation:
from timeit import timeit
import numpy as np
import pandas as pd
I = np.random.randint(0, 10000, 1000000)
J = np.random.randint(0, 10000, 1000000)
df1_64 = pd.DataFrame({'I': I, 'J': J})
df1_32 = df1_64.astype('int32')
df2_64 = df1_64.rename(columns={'I': 'K', 'J': 'I'})
df2_32 = df1_32.rename(columns={'I': 'K', 'J': 'I'})
timeit(lambda: pd.merge(df2_64, df1_64, on='I').query('K != J'), number=1)
# 18.84
timeit(lambda: pd.merge(df2_32, df1_32, on='I').query('K != J'), number=1)
# 9.28
There is no particularly magic algorithm to generate all of the triples. You can avoid re-fetching a node's neighbors by an orderly search, but that's about it.
Make an empty list, N, of nodes to check.
Add some start node, S, to N
While N is not empty
Pop a node off the list; call it A.
Make a set of its neighbors, A'.
for each neighbor B of A
for each element a of A'
Generate the triple (a, A, B)
Add B to the list of nodes to check, if it has not already been checked.
Does that help? There are still several details to handle in the algorithm above, such as avoiding duplicate generation, and fine points of moving through cliques.
What you are looking for is all paths of length 3 in the graph. You can achieve this simply with the following recursive algorithm:
import networkx as nx
def findPaths(G,u,n):
"""Returns a list of all paths of length `n` starting at vertex `u`."""
if n==1:
return [[u]]
paths = [[u]+path for neighbor in G.neighbors(u) for path in findPaths(G,neighbor,n-1) if u not in path]
return paths
# Generating graph
vertices = np.unique(I)
edges = list(zip(I,J))
G = nx.Graph()
G.add_edges_from(edges)
# Grabbing all 3-paths
paths = [path for v in vertices for path in findPaths(G,v,3)]
paths
>>> [[0, 2, 3], [1, 0, 2], [2, 0, 1], [3, 2, 0]]
This is an initial solution to your problem using networkx, an optimized library for graph computations:
import numpy as np
import networkx as nx
I = np.array([0, 0, 1, 2, 2, 3])
J = np.array([1, 2, 0, 0, 3, 2])
I_, J_, K_ = [], [], [],
num_nodes = np.max(np.concatenate([I,J])) + 1
A = np.zeros((num_nodes, num_nodes))
A[I,J] = 1
print("Adjacency Matrix:")
print(A)
G = nx.from_numpy_matrix(A)
for i in range(num_nodes):
first_neighbors = list(G.neighbors(i))
for j in first_neighbors:
second_neighbor = list(G.neighbors(j))
second_neighbor_no_circle = list(filter(lambda node: node != i, second_neighbor))
num_second_neighbors = len(second_neighbor_no_circle)
if num_second_neighbors > 0:
I_.extend(num_second_neighbors * [i])
J_.extend(num_second_neighbors * [j])
K_.extend(second_neighbor_no_circle)
I_, J_, K_ = np.array(I_), np.array(J_), np.array(K_)
print("result:")
print(I_)
print(J_)
print(K_)
####### Output #######
Adjacency Matrix:
[[0. 1. 1. 0.]
[1. 0. 0. 0.]
[1. 0. 0. 1.]
[0. 0. 1. 0.]]
result:
[0 1 2 3]
[2 0 0 2]
[3 2 1 0]
I used %%timeit on the code above without print statements to check the running time:
49 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Complexity analysis:
Finding all the neighbors of all the neighbors is essentially taking 2 steps in a Depth First Search algorithm. This could take, depending on the graph's topology, up to O(|V| + |E|) where |E| is the number of edges in the graph and |V| is the number of vertices.
To the best of my knowledge, there is no better algorithm on a general graph.
However, if you do know some special properties about the graph, the running time could be more tightly bounded or perhaps alter the current algorithm based on this knowledge.
For instance, if you know all the vertices have at most d edges, and the graph has one connected component, the bound of this implementation becomes O(2d) which is quite better if d << |E|.
Let me know if you have any questions.
Given a set of n vectors of dimension d stored in a (n,d) array and a second set of m vectors of the same dimension (stored in (m,d) array) I want to calculate the squared point wise distance between the vectors, scaled by some matrix A with the size (d,d).
The output should be a (n,m) array.
I expect the input range to be somewhere between 1 to 10.000 for m and n and 1 to 100 for d.
The distance between two points is given by:
In the non-optimized, but working python code this looks like this:
import numpy as np
v1 = np.array([[1, 2],
[3, 4],
[4, 5]])
v2 = np.array([[1,1],
[2, 2],
[2, 2],
[0, 0]])
A = np.array([[1,0], [2, 3]])
d = np.zeros((3, 4))
for i in range(0,3):
for j in range(0,4):
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
The squared distance between the example points is:
d = [[ 3. 1. 1. 17.]
[ 43. 17. 17. 81.]
[ 81. 43. 43. 131.]]
Is there a version of this, that avoids the nested loop in python e.g. using broadcasting black magic?
EDIT:
For the case
A = np.array([[1,0], [0, 1]])
this is the normal squared euclidean distance which can be calculated e.g.
from scipy.spatial.distance import cdist
cdist(v1,v2,'sqeuclidean')
We can use np.einsum -
V = v1[:,None,:]-v2
d_out = np.einsum('ijk,kl,ijl->ij',V,A,V)
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Explanation on the vectorized method
Original code was -
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
I. We are translating :
v1[i,:] - v2[j,:]
to the outer operation with broadcasting :
v1[:,None,:]-v2
Schematically put :
v1[:,None,:] : m x 1 x n
v2 : m x n
output, V : m x m x n
More info on outer explanation.
More info on broadcasting could be found in docs.
II. Next up, (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:]) with the new V becomes np.einsum('ijk,kl,ijl->ij',V,A,V) using einsum's string notation. More info could be found in docs.
I am looking into the NumPy correlation function
numpy.correlate(a, v, mode='valid')[source]
Cross-correlation of two 1-dimensional sequences.
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
Then for the example:
a = [1, 2, 3]
v = [0, 1, 0.5]
np.correlate([1, 2, 3], [0, 1, 0.5], "full")
array([ 0.5, 2. , 3.5, 3. , 0. ])
So the k in the output array is from 0 to 4 in this example. However, I am wondering how does a[n+k] is defined when (n+k) > 2 in this case?
Also, how is conjugate(v(n)) defined and how is each element in array computed?
The formula c_{av}[k] = sum_n a[n+k] * conj(v[n]) is a little misleading because k on the left is not necessarily the Python index of the output array. In the 'full' mode, the possible values of k are those for which there exists at least one n such that a[n+k] * conj(v[n]) is defined (that is, both n+k and n fall in the ranges of respective arrays).
In your examples, k in sum_n a[n+k] * conj(v[n]) can be -2, -1, 0, 1, 2. These generate 5 values that you see. For example, k being -2 results in a[2-2]*conj(v[2]) which is 0.5, and so on.
In general, the range of k in the 'full' mode is from 1-len(a) to len(v)-1 inclusive. So, if k is really understood as Python index, then the formula should be
c_{av}[k] = sum_n a[n+k+len(a)-1] * conj(v[n])
I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...
I have two nested lists A and B:
A = [[50,140],[51,180],[54,500],......]
B = [[50.1, 170], [51,200],[55,510].....]
The 1st element in each inner list runs from 0 to around 1e5, the 0th element runs from around 50 up to around 700, these elements are unsorted. What i want to do, is run through each element in A[n][1] and find the closest element in B[n][1], but when searching for the nearest neighbor i want to only search within an interval defined by A[n][0] plus or minus 0.5.
I have been using the function:
def find_nearest_vector(array, value):
idx = np.array([np.linalg.norm(x+y) for (x,y) in array-value]).argmin()
return array[idx]
Which finds the nearest neighbor between the coordinates A[0][:]and B[0][:], for example. However, this I need to confine the search range to a rectangle around some small shift in the value A[0][0]. Also, I do not want to reuse elements - I want a unique bijection between each value A[n][1] to B[n][1] within the interval A[n][0] +/- 0.5.
I have been trying to use Scipy's KDTree, but this reuses elements and I don't know how to confine the search range. Effectively, I want to do a one dimensional NNN search on a two dimensional nested list along a specific axis where the neighborhood in which the NNN search is within a hyper-rectangle defined by the 0th element in each inner list plus or minus some small shift.
I use numpy.argsort(), numpy.searchsorted(), numpy.argmin() to do the search.
%pylab inline
import numpy as np
np.random.seed(0)
A = np.random.rand(5, 2)
B = np.random.rand(100, 2)
xaxis_range = 0.02
order = np.argsort(B[:, 0])
bx = B[order, 0]
sidx = np.searchsorted(bx, A[:, 0] - xaxis_range, side="right")
eidx = np.searchsorted(bx, A[:, 0] + xaxis_range, side="left")
result = []
for s, e, ay in zip(sidx, eidx, A[:, 1]):
section = order[s:e]
by = B[section, 1]
idx = np.argmin(np.abs(ay-by))
result.append(B[section[idx]])
result = np.array(result)
I plot the result as following:
plot(A[:, 0], A[:, 1], "o")
plot(B[:, 0], B[:, 1], ".")
plot(result[:, 0], result[:, 1], "x")
the output:
My understanding of your problem is that you are trying to find the closest elements for each A[n][1] in another set of points (B[i][1] restricted to points where if A[n][0] is within +/- 0.5 of B[i][0]).
(I'm not familiar with numpy or scipy, and I'm sure that there's a better way to do this with their algorithms.)
That being said, here's my naive implementation in O(a*b*log(a*b)) time.
def main(a,b):
for a_bound,a_val in a:
dist_to_valid_b_points = {abs(a_val-b_val):(b_bound,b_val) for b_bound,b_val in b if are_within_bounds(a_bound,b_bound)}
print get_closest_point((a_bound, a_val),dist_to_valid_b_points)
def are_within_bounds(a_bound, b_bound):
return abs(b_bound-a_bound) < 0.5
def get_closest_point(a_point, point_dict):
return (a_point, None if not point_dict else point_dict[min(point_dict, key=point_dict.get)])
main([[50,140],[51,180],[54,500]],[[50.1, 170], [51,200],[55,510]]) yields the following output:
((50, 140), (50.1, 170))
((51, 180), (51, 200))
((54, 500), None)