Spectral Clustering a graph in python - python
I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering technique. Sadly, I can't find examples of spectral clustering graphs in python online.
Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they're not aliases.
Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I've asked for such an example from the developers, but they're overworked and haven't gotten to it.
A good network to document this against is the Karate Club Network. It's included as a method in networkx.
I'd love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
A question much like this one has already been asked on this site.
Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get ground-truth: club-labels -> transform to 0/1 np-array
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectral-clustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅
Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)
...
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).
Alternatively, using precomputed, a user-provided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That's a pretty much perfect fit to the ground-truth!
Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let's first cluster a graph G into K=2 clusters and then generalize for all K.
We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.
Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx
import numpy as np
A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
nx.draw(G, pos=nx.drawing.layout.spring_layout(G),
with_labels=True, node_color=labels)
We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense())
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:
With SpectralClustering from sklearn.cluster we can get the exact same result:
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]
We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):
The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]])
K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above:
Related
Efficient way to find coordinates of connected blobs in binary image
I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1). The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling. Minimal Reproducible example: import timeit from skimage import measure import numpy as np binary_image = np.array([ [0,1,0,0,1,1,0,1,1,0,0,1], [0,1,0,1,1,1,0,1,1,1,0,1], [0,0,0,0,0,0,0,1,1,1,0,0], [0,1,1,1,1,0,0,0,0,1,0,0], [0,0,0,0,0,0,0,1,1,1,0,0], [0,0,1,0,0,0,0,0,0,0,0,0], [0,1,0,0,1,1,0,1,1,0,0,1], [0,0,0,0,0,0,0,1,1,1,0,0], [0,1,1,1,1,0,0,0,0,1,0,0], ]) print(f"\n\n2d array of type: {type(binary_image)}:") print(binary_image) labels = measure.label(binary_image) print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:") print(labels) def extract_blobs_from_labelled_array(labelled_array): # The goal is to obtain lists of the coordinates # Of each distinct blob. blobs = [] label = 1 while True: indices_of_label = np.where(labelled_array==label) if not indices_of_label[0].size > 0: break else: blob =list(zip(*indices_of_label)) label+=1 blobs.append(blob) if __name__ == "__main__": print("\n\nBeginning extract_blobs_from_labelled_array timing\n") print("Time taken:") print( timeit.timeit( 'extract_blobs_from_labelled_array(labels)', globals=globals(), number=1 ) ) print("\n\n") Output: 2d array of type: <class 'numpy.ndarray'>: [[0 1 0 0 1 1 0 1 1 0 0 1] [0 1 0 1 1 1 0 1 1 1 0 1] [0 0 0 0 0 0 0 1 1 1 0 0] [0 1 1 1 1 0 0 0 0 1 0 0] [0 0 0 0 0 0 0 1 1 1 0 0] [0 0 1 0 0 0 0 0 0 0 0 0] [0 1 0 0 1 1 0 1 1 0 0 1] [0 0 0 0 0 0 0 1 1 1 0 0] [0 1 1 1 1 0 0 0 0 1 0 0]] 2d array with connected blobs labelled of type <class 'numpy.ndarray'>: [[ 0 1 0 0 2 2 0 3 3 0 0 4] [ 0 1 0 2 2 2 0 3 3 3 0 4] [ 0 0 0 0 0 0 0 3 3 3 0 0] [ 0 5 5 5 5 0 0 0 0 3 0 0] [ 0 0 0 0 0 0 0 3 3 3 0 0] [ 0 0 6 0 0 0 0 0 0 0 0 0] [ 0 6 0 0 7 7 0 8 8 0 0 9] [ 0 0 0 0 0 0 0 8 8 8 0 0] [ 0 10 10 10 10 0 0 0 0 8 0 0]] Beginning extract_blobs_from_labelled_array timing Time taken: 9.346099977847189e-05 9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes. Is there a faster way to do this? Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here: while True: indices_of_label = np.where(labelled_array==label) if not indices_of_label[0].size > 0: break else: blob =list(zip(*indices_of_label)) label+=1 blobs.append(blob) First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs. Instead, you should use: for label in range(np.max(labels)): and then you can ignore the if ...: break. A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2). But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass. The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function: import timeit from skimage import measure import numpy as np binary_image = np.array([ [0,1,0,0,1,1,0,1,1,0,0,1], [0,1,0,1,1,1,0,1,1,1,0,1], [0,0,0,0,0,0,0,1,1,1,0,0], [0,1,1,1,1,0,0,0,0,1,0,0], [0,0,0,0,0,0,0,1,1,1,0,0], [0,0,1,0,0,0,0,0,0,0,0,0], [0,1,0,0,1,1,0,1,1,0,0,1], [0,0,0,0,0,0,0,1,1,1,0,0], [0,1,1,1,1,0,0,0,0,1,0,0], ]) print(f"\n\n2d array of type: {type(binary_image)}:") print(binary_image) labels = measure.label(binary_image) print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:") print(labels) def extract_blobs_from_labelled_array(labelled_array): """Return a list containing coordinates of pixels in each blob.""" props = measure.regionprops(labelled_array) blobs = [p.coords for p in props] return blobs if __name__ == "__main__": print("\n\nBeginning extract_blobs_from_labelled_array timing\n") print("Time taken:") print( timeit.timeit( 'extract_blobs_from_labelled_array(labels)', globals=globals(), number=1 ) ) print("\n\n")
Numpy Interpolation Between Points Within Array (scipy.griddata)
I have a numpy array of a fixed size holding irregularly spaced data. An example would be: [1 0 0 0 3 0 0 0 2 0 0 1 0 0 0 0 0 0 2 0 0 1 0 0 1 0 6 0 9 0 0 0 0 0 6 0 3 0 0 1] I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points. I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity) data = [0, 0, 1 0, 3, 3 0, 8, 2 .................... I then define points = (data[:,0], data[:,1]) values = (data[:,2]) Next, I define the points to sample at (in this case, the grid I desire) grid = np.indices((4,10)) Finally, call griddata t = interpolate.griddata(points, values, grid, method = 'linear') This returns the following error ValueError: number of dimensions in xi does not match x Am I using the wrong function? Thanks!
Solved: You need to pass the desired points as a tuple t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')
Create a matrix for datapoints in same or different clusters
I want to iterate through my datapoints and check whether they are in the same cluster, after using KMeans to cluster them. And then I need to create a matrix for all the datapoints, and have 1 if they belong on the same cluster, and 0 if they don't. After using Kmeans, I'm not sure how to retrieve which cluster every datapoint belongs to so I can create such matrix. Do I do that using labels_ argument? k_means = KMeans(n_clusters=5).fit(X) labels_columns = k_means.labels_ labels_row = k_means.labels_ for row in labels_row: for column in labels_columns: if row == columns: --add 1 in matrix position else: --add 0 in matrix position How to best create this matrix? Or do they labels_ provide different information from what my understanding? Any help is appreciated!
You are on the right track. Kmeans.labels_ returns a vector of n elements which tells you that the cluster each point belongs to: [3, 4, 10, ...] tells you that point 0 belongs to cluster 3, point 1 belongs to cluster 4 and so on. You can build the matrix you want in many ways. One possibility I thought which is a bit more elegant than 2 for loops would be the following: import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs n_samples, n_features = 10, 2 X, y = make_blobs(n_samples, n_features) plt.scatter(X[:, 0], X[:, 1], c=y) plt.show() kmeans = KMeans(n_clusters=3).fit(X) plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_) plt.show() neighbour_matrix = np.zeros(n_samples) repeat_labels = np.repeat(kmeans.labels_.T, n_samples, axis=0).reshape(n_samples, n_samples) print(kmeans.labels_) print(repeat_labels) proximity_matrix = (repeat_labels == repeat_labels.T).astype(int) print(proximity_matrix) I use the vector of labels as my starting point. Let's say that it is the following: [1 0 0 1 1 2 2 2 2 0] I transform it in a 2D matrix with np.repeat which has the following shape: [[1 1 1 1 1 1 1 1 1 1] [0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [1 1 1 1 1 1 1 1 1 1] ..... So I repeat the labels as many times as is the number of points n. Then I can just check where this matrix and its transpose are equal. That will be true only if two points belong to the same cluster: [[1 0 0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0 0 1] [0 1 1 0 0 0 0 0 0 1] [1 0 0 1 1 0 0 0 0 0] ..... I casted the matrix to int, but mind you that the original output is actually a boolean array. I left the print statements and the plots in the code to hopefully make it more clear. Hope it helps!
sklearn agglomerative clustering input data
I have a similarity matrix between four users. I want to do an agglomerative clustering. the code is like this: lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1') X = np.reshape(lena, (-1, 1)) print("Compute structured hierarchical clustering...") st = time.time() n_clusters = 3 # number of regionsle ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='complete').fit(X) print ward label = np.reshape(ward.labels_, lena.shape) print("Elapsed time: ", time.time() - st) print("Number of pixels: ", label.size) print("Number of clusters: ", np.unique(label).size) print label the print result of label is like: [[1 1 0 0] [1 1 0 0] [0 0 1 2] [0 0 2 1]] Does this mean it gives a lists of possible cluster result, we can choose one from them? like choosing: [0,0,2,1]. If is wrong, could you tell me how to do the agglomerative algorithm based on similarity? If it'ss right, the similarity matrix is huge, how can i choose the optimal clustering result from a huge list? Thanks
I think the problem here is that you fit your model with the wrong data # This will return a 4x4 matrix (similarity matrix) lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1') # However this will return 16x1 matrix X = np.reshape(lena, (-1, 1)) The true result you get is this: ward.labels_ >> array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 2, 1]) Which is the label of each element in the X vector and it don't make sens If I well understood your problem, you need to classify your users by distance between them (similarity). Well, in this case I will suggest to use spectral clustering this way: import numpy as np from sklearn.cluster import SpectralClustering lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1') n_clusters = 3 SpectralClustering(n_clusters).fit_predict(lena) >> array([1, 1, 0, 2], dtype=int32)
Counting of adjacent cells in a numpy array
Past midnight and maybe someone has an idea how to tackle a problem of mine. I want to count the number of adjacent cells (which means the number of array fields with other values eg. zeroes in the vicinity of array values) as sum for each valid value!. Example: import numpy, scipy s = ndimage.generate_binary_structure(2,2) # Structure can vary a = numpy.zeros((6,6), dtype=numpy.int) # Example array a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure print a >[[0 0 0 0 0 0] [0 0 0 0 0 0] [0 0 1 1 1 0] [0 0 1 1 0 0] [0 0 0 0 0 0] [0 0 0 0 0 0]] # The value at position [2,4] is surrounded by 6 zeros, while the one at # position [2,2] has 5 zeros in the vicinity if 's' is the assumed binary structure. # Total sum of surrounding zeroes is therefore sum(5+4+6+4+5) == 24 How can i count the number of zeroes in such way if the structure of my values vary? I somehow believe to must take use of the binary_dilation function of SciPy, which is able to enlarge the value structure, but simple counting of overlaps can't lead me to the correct sum or does it? print ndimage.binary_dilation(a,s).astype(a.dtype) [[0 0 0 0 0 0] [0 1 1 1 1 1] [0 1 1 1 1 1] [0 1 1 1 1 1] [0 1 1 1 1 0] [0 0 0 0 0 0]]
Use a convolution to count neighbours: import numpy import scipy.signal a = numpy.zeros((6,6), dtype=numpy.int) # Example array a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure b = 1-a c = scipy.signal.convolve2d(b, numpy.ones((3,3)), mode='same') print numpy.sum(c * a) b = 1-a allows us to count each zero while ignoring the ones. We convolve with a 3x3 all-ones kernel, which sets each element to the sum of it and its 8 neighbouring values (other kernels are possible, such as the + kernel for only orthogonally adjacent values). With these summed values, we mask off the zeros in the original input (since we don't care about their neighbours), and sum over the whole array.
I think you already got it. after dilation, the number of 1 is 19, minus 5 of the starting shape, you have 14. which is the number of zeros surrounding your shape. Your total of 24 has overlaps.