Algorithm for 2D not quite Delaunay triangulation - python

I'm trying to write an algorithm to do a triangulation on a 2D sampling of grid points. The idea is similar to Delaunay triangulation but with a few custom rules.
To represent the vertices and their coordinates, the input is a sparse 2D array of 0's and 1's. A given element is 1 if it is a vertex, and 0 if it is not. So basically if it is 1, it means that point was sampled and the triangulation needs to include it in the triangulation; if it is a 0, it should not be involved in the triangulation.
1) Unlike Delaunay, in my case, all the triangles must be right triangles with horizontal or vertical orientation, e.g.:
0 0 0 0
0 1 0 1
0 1 0 0
has a right triangle that can be formed by connecting the 1's. And it has a vertical/horizontal orientation since the 2 non-hypotenuse edges are horizontal and vertical.
2) No 2 triangles can share a hypotenuse, but it's ok if they share an edge that is not a hypotenuse.
3) No vertex can be the apex of a right triangle and also a non-apex of a different right triangle. In other words,
0 1 0
0 (1) 1
0 1 0
is ok because the central a marked inside a (), is the apex of both right triangles.
But in the following case:
0 0 0 1 1
1 0 0 1 1
it would ok to do:
0 0 0 X B
A 0 0 A B
meaning 2 triangles (AAX and BBX), but the following would not be allowed:
0 0 0 A B
A 0 0 X B
since now the vertex "X" would be an apex in triangle A, but would be a non-apex in triangle B.
I'm interested in any thoughts / outline for how to develop this algorithm. The matrices are pretty big, but very sparse, so the algorithm doesn't have to be too efficient; any conceptually simple approach should work fine. The output should be a list of lists:
[[(x1a,y1a),(x1b,y1b),(x1c,y1c)], [(x2a,y2a),(x2b,y2b),(x2c,y2c)], ..., [(xNa,yNa),(xNb,yNb),(xNc,yNc)]]
for the coordinates of the 3 vertices of all of the N different triangles.

Related

Using Pandas Count number of Cells in Column that is within a given radius

To set up the question. I have a dataframe containing spots and their x,y positions. I want to iterate over each spot and check all other spots to see if they are within a radius. I then want to count the number of spots within the radius in a new column of the dataframe. I would like to iterate over the index as I have a decent understanding on how that works. I know that I am missing something simple but I have not been able to find a solution that works for me yet. Thank you in advance!
radius = 3
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
spot_id x_pos y_pos
0 1 5 4
1 2 4 10
2 3 10 8
3 4 3 6
4 5 8 3
I then want to get something that looks like this
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
To do it in a vectorized way, you can use scipy.spatial.distance_matrix to compute the distance matrix, D, between all the N row/position vectors ('x_pos', 'y_pos'). D is a N x N matrix (2D numpy.ndarray) whose entry (i, j) is the Euclidean distance between the ith and jth rows/ positions .
Then, check which positions are a distance = radius from each other (D <= radius), which will give you a boolean matrix. Finally, you can count all the True values row-wise using sum(axis=0). You have to subtract 1 in the end since the former counts the distance between a vector with itself (diagonal entries).
import pandas as pd
from scipy.spatial import distance_matrix
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
radius = 3
pos = df[['x_pos','y_pos']]
df['spots_within_radius'] = (distance_matrix(pos, pos) <= radius).sum(axis=0) - 1
Output
>>> df
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
If you don't want to use scipy.spatial.distance_matrix, you can compute D yourself using numpy's broadcasting.
import numpy as np
pos = df[['x_pos','y_pos']].to_numpy()
D = np.sum((pos - pos[:, None])**2, axis=-1) ** 0.5
df['spots_within_radius'] = (D <= radius).sum(axis=0) - 1
I would suggest using a KD Tree to answer this kind of question. It's a data structure designed to efficiently search for nearby points, and it's faster than computing a distance matrix. You can use SciKit Learn to implement this.
The code
Here's how:
import sklearn.neighbors
import pandas as pd
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
def add_points_in_range_column_kd(df, radius):
# Get positions as numpy array
positions = df[['x_pos', 'y_pos']].to_numpy(dtype='float32')
# Build KD Tree on those positions
tree = sklearn.neighbors.KDTree(positions)
# For each position, check how many points are in range.
# Return a count, and not the actual points.
return tree.query_radius(positions, r=radius, count_only=True) - 1
df['spots_within_radius'] = add_points_in_range_column_kd(df, 3)
The efficiency argument
Since a distance matrix needs to calculate distance between all points, it has a time complexity of O(N^2). In contrast, the time required to find all of the points inside the KD Tree is proportional to the depth of the tree times the number of points you need to find. On average, this is O(N log N). So this method will be more efficient for a large number of points.
Benchmarking
Theory is nice, but is it actually faster in practice?
I ran both a KD Tree method, and a distance matrix method, on dataframes of sizes ranging from N=10 to N=3000. I used the timeit module, running both methods in random order for 100 iterations for all point sizes. Here is a graph of the time it takes with each method:
For small numbers of points, the distance matrix method is faster. After you get 300 points to compare to each other, the KD Tree is faster. Note that this graph has a log axis on both scales.
Full testing details can be found here.

Numpy Interpolation Between Points Within Array (scipy.griddata)

I have a numpy array of a fixed size holding irregularly spaced data. An example would be:
[1 0 0 0 3 0 0 0 2 0
0 1 0 0 0 0 0 0 2 0
0 1 0 0 1 0 6 0 9 0
0 0 0 0 6 0 3 0 0 1]
I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points.
I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity)
data = [0, 0, 1
0, 3, 3
0, 8, 2 ....................
I then define
points = (data[:,0], data[:,1])
values = (data[:,2])
Next, I define the points to sample at (in this case, the grid I desire)
grid = np.indices((4,10))
Finally, call griddata
t = interpolate.griddata(points, values, grid, method = 'linear')
This returns the following error
ValueError: number of dimensions in xi does not match x
Am I using the wrong function?
Thanks!
Solved: You need to pass the desired points as a tuple
t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')

How can I optimize searching and matching through multi-dimensional arrays?

I'm trying to match up the elements in 2 different arrays. Array_A is a 3d map of A_Clouds, Array_B is a 3d map of B_Clouds. Each "cloud" is continuous, i.e. any isolated pixels would define a new cloud. The values of the pixels are a single, unique integer for each cloud. Non-cloud values are 0. Here's a 2D example:
[[0 0 0 0 0 0 0 0 0]
[0 0 0 1 1 1 0 0 0]
[0 0 1 1 1 1 1 1 0]
[0 0 0 1 1 1 1 1 0]
[0 0 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0]]
The output I need is simply the IDs (for both clouds) of each A_Cloud which is overlapping with a B_Cloud, and the number (locations not needed) of pixels which are overlapping between those clouds.
The problem is that these are both very large 3 dimensional arrays (~2000x2000x200, both are the same size). I'm basically doing a bunch of nested for loops, which is of course very slow. Is there a faster way that I could approach this problem? Thanks in advance.
This is what I have right now (simplified to 2d):
final_matches = []
for Acloud_id in ACloud_list:
Acloud_locs = list(set([(i,j) for j, line in enumerate(Array_A) for i,pix in enumerate(line) if pix == Acloud_id]))
matches = []
for loc in Acloud_locs:
Bcloud_pix = Array_B[loc[0]][loc[1]]
if Bcloud_pix:
matches.append(Bcloud_pix)
counter=collections.Counter(matches)
final_matches.append([Acloud_id, counter])
Thanks in advance!
Some considerations here:
for Acloud_id in ACloud_list:
Acloud_locs = list(set([(i,j) for j, line in enumerate(Array_A) for i,pix in enumerate(line) if pix == Acloud_id]))
If I've read that right, this needs to check every pixel in the array in order to generate the set, and it repeats that for every cloud in A. So if you have 500 clouds, you're checking every pixel 500 times. This is not going to scale well!
Might be more efficient to store the overlap counts in a dict, and just go through the arrays once:
overlaps=dict()
for i in possible_x_coords: # define these however you like
for j in possible_y_coords:
if (Array_A[i][j] and Array_B[i][j]):
overlaps[(Array_A[i][j],Array_B[i][j])] = 1 + overlaps.get((Array_A[i][j],Array_B[i][j]),0)
(apologies for any errors, I'm on the road and can't test my code)
update: You've clarified that the arrays are about 80% sparse. If that figure was a lot higher, and if you had control over the format of your inputs, I'd suggest looking into sparse array formats - if your input only stores the non-zero values for A, this can save you the trouble of checking for zero values in A. However, for something that's only 80% sparse, I'm not sure how much efficiency this would add.

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering technique. Sadly, I can't find examples of spectral clustering graphs in python online.
Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they're not aliases.
Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I've asked for such an example from the developers, but they're overworked and haven't gotten to it.
A good network to document this against is the Karate Club Network. It's included as a method in networkx.
I'd love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
A question much like this one has already been asked on this site.
Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get ground-truth: club-labels -> transform to 0/1 np-array
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectral-clustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅
Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)
...
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).
Alternatively, using precomputed, a user-provided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That's a pretty much perfect fit to the ground-truth!
Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let's first cluster a graph G into K=2 clusters and then generalize for all K.
We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.
Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx
import numpy as np
A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
nx.draw(G, pos=nx.drawing.layout.spring_layout(G),
with_labels=True, node_color=labels)
We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense())
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:
With SpectralClustering from sklearn.cluster we can get the exact same result:
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]
We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):
The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]])
K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

Is there any easy way to rotate the values of a matrix/array?

So, let's say I have the following matrix/array -
[0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
It would be fairly trivial to write something that would translate these values up and down. What if I wanted to rotate it by an angle that isn't a multiple of 90 degrees? I know that It is obviously impossible to get the exact same shape (made of 1s), because of the nature of the grid. The idea that comes to mind is converting each value of 1 to a coordinate vector. Then it would amount to rotating the coordinates (which should be more simple) about a point. One could then write something which would take the coordinates, and compare them to the matrix grid, and if there is a point in the right box, it will be filled. I know I'll also have to find a center around which to rotate.
Does this seem like a reasonable way to do this? If anyone has a better idea, I'm all ears. I know with a small grid like this, the shape would probably be entirely different, however if I had a large shape represented by 1s, in a large grid, the difference between representations would be smaller.
First of all, rotating a shape like that with only 1's and 0's at non 90 degree angles is not really going to look much like the original at all, when it's done at such a low "resolution". However, I would recommend looking into rotation matrices. Like you said, you would probably want to find each value as a coordinate pair, and rotate it around the center. It would probably be easier if you made this a two-dimensional array. Good luck!
I think this should work:
from math import sin, cos, atan2, radians
i0,j0 = 0,0 #point around which you'll rotate
alpha = radians(3) #3 degrees
B = np.zeros(A.shape)
for i,j in np.swapaxes(np.where(A==1),0,1):
di = i-i0
dj = j-j0
dist = (di**2 + dj**2)**0.5
ang = atan2(dj,di)
pi = round(sin(ang+alpha)*dist) + i0
pj = round(cos(ang+alpha)*dist) + j0
B[pi][pj] = 1
But, please, don't forget about segmentation fault!
B array should be much bigger than A and origin should be (optimally) in the middle of the array.

Categories