Create a matrix for datapoints in same or different clusters - python

I want to iterate through my datapoints and check whether they are in the same cluster, after using KMeans to cluster them.
And then I need to create a matrix for all the datapoints, and have 1 if they belong on the same cluster, and 0 if they don't.
After using Kmeans, I'm not sure how to retrieve which cluster every datapoint belongs to so I can create such matrix.
Do I do that using labels_ argument?
k_means = KMeans(n_clusters=5).fit(X)
labels_columns = k_means.labels_
labels_row = k_means.labels_
for row in labels_row:
for column in labels_columns:
if row == columns:
--add 1 in matrix position
else:
--add 0 in matrix position
How to best create this matrix? Or do they labels_ provide different information from what my understanding?
Any help is appreciated!

You are on the right track. Kmeans.labels_ returns a vector of n elements which tells you that the
cluster each point belongs to: [3, 4, 10, ...] tells you that point 0 belongs to cluster 3, point 1
belongs to cluster 4 and so on.
You can build the matrix you want in many ways. One possibility I thought which is a bit more elegant than
2 for loops would be the following:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
n_samples, n_features = 10, 2
X, y = make_blobs(n_samples, n_features)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.show()
neighbour_matrix = np.zeros(n_samples)
repeat_labels = np.repeat(kmeans.labels_.T, n_samples, axis=0).reshape(n_samples, n_samples)
print(kmeans.labels_)
print(repeat_labels)
proximity_matrix = (repeat_labels == repeat_labels.T).astype(int)
print(proximity_matrix)
I use the vector of labels as my starting point. Let's say that it is the following:
[1 0 0 1 1 2 2 2 2 0]
I transform it in a 2D matrix with np.repeat which has the following shape:
[[1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1]
.....
So I repeat the labels as many times as is the number of points n. Then I can just check where this
matrix and its transpose are equal. That will be true only if two points belong to the same cluster:
[[1 0 0 1 1 0 0 0 0 0]
[0 1 1 0 0 0 0 0 0 1]
[0 1 1 0 0 0 0 0 0 1]
[1 0 0 1 1 0 0 0 0 0]
.....
I casted the matrix to int, but mind you that the original output is actually a boolean array.
I left the print statements and the plots in the code to hopefully make it more clear.
Hope it helps!

Related

Numpy get secondary diagonal with offset=1 and change the values

I have this 6x6 matrix filled with 0s. I got the secondary diagonal in sec_diag. The thing I am trying to do is to change the values of above the sec_diag inside the matrix with the odds numbers from 9-1 [9,7,5,3,1]
import numpy as np
x = np.zeros((6,6), int)
sec_diag = np.diagonal(np.fliplr(x), offset=1)
The result should look like this:
[[0,0,0,0,9,0],
[0,0,0,7,0,0],
[0,0,5,0,0,0],
[0,3,0,0,0,0],
[1,0,0,0,0,0],
[0,0,0,0,0,0]]
EDIT: np.fill_diagonal isn't going to work.
You should use roll
x = np.zeros((6,6),dtype=np.int32)
np.fill_diagonal(np.fliplr(x), [9,7,5,3,1,0])
xr = np.roll(x,-1,axis=1)
print(xr)
Output
[[0 0 0 0 9 0]
[0 0 0 7 0 0]
[0 0 5 0 0 0]
[0 3 0 0 0 0]
[1 0 0 0 0 0]
[0 0 0 0 0 0]]
Maybe you should try with a double loop

Efficient way to find coordinates of connected blobs in binary image

I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")

how to solve the edges issue in hexagonal plot?

I'm a beginner and I hope to be clear in exposing the problem.
I created a matrix like this:
[0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0]
[0 0 6 8 9 1 0 0]
[0 0 4 6 5 4 0 0]
[0 0 4 2 8 9 0 0]
[0 0 1 3 6 7 0 0]
[0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0]
the point is that I have to create a hexagonal plot in which the color scale represents the random numbers in the individual cells.
This is what I did:
import numpy as np
import matplotlib.pyplot as plt
n=4
A=np.zeros([2*n,2*n], dtype=int)
B=np.random.randint(1,10, size=(n,n))
A[2:6,2:6]=B
plt.figure(figsize=(5,5))
plt.imshow(A, origin=['lower'], cmap=plt.cm.Purples_r)
plt.colorbar()
x=[]
y=[]
for i in range (np.shape(A)[0]):
for j in range (np.shape(A)[1]):
N_occurence=A[i,j]
print(N_occurence)
for k in range (N_occurence):
x=np.append(x, i)
y=np.append(y, j)
plt.figure(figsize=(5,5))
plt.hexbin(x,y,gridsize=(10), cmap=plt.cm.Purples_r)
plt.xlim([1, 6])
plt.ylim([1, 6])
plt.colorbar()
plt.show()
but I can not solve the problem of edges, I always get half hexagons and the plot is not accurate. Does anyone know a simpler way or a similar example?
I'm still not really sure, what you're looking for, but I guess you want to have a imshow plot which uses hexagons like hexbin?
Maybe this helps a little bit:
import matplotlib.pyplot as plt
import numpy as np
# Generate array
A = np.zeros([8, 8], dtype=int)
A[2:6, 2:6] = np.random.randint(1, 10, size=(4, 4))
# Print array
print(A)
# `imshow` plot
plt.figure(figsize=(5,5))
plt.imshow(A, extent=(0, 8, 0, 8), origin='lower')
plt.colorbar()
# Rewrite array to get x and y values
# TODO: There has to be a better way than to use two `for` loops
X = []
Y = []
for y in range(len(A)):
for x, n in enumerate(A[len(A)-y-1]):
X += [x]*n
Y += [y]*n
# `scatter` plot to visualize rewritten array data
plt.figure(figsize=(5,5))
plt.scatter(X, Y)
# `hexbin` plot
plt.figure(figsize=(5,5))
plt.hexbin(X, Y, gridsize=5, extent=(0, 7, 0, 7))
plt.colorbar()
# show plots
plt.show()
Which results for random array A
[[0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0]
[0 0 3 7 3 3 0 0]
[0 0 3 5 8 1 0 0]
[0 0 4 8 7 3 0 0]
[0 0 1 7 9 3 0 0]
[0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0]]
in
imshow
scatter
hexbin
I think you might be better off with a custom solution like a scatterplot plotting hexagon tiles with your specified color.

Spectral Clustering a graph in python

I'd like to cluster a graph in python using spectral clustering.
Spectral clustering is a more general technique which can be applied not only to graphs, but also images, or any sort of data, however, it's considered an exceptional graph clustering technique. Sadly, I can't find examples of spectral clustering graphs in python online.
Scikit Learn has two spectral clustering methods documented: SpectralClustering and spectral_clustering which seem like they're not aliases.
Both of those methods mention that they could be used on graphs, but do not offer specific instructions. Neither does the user guide. I've asked for such an example from the developers, but they're overworked and haven't gotten to it.
A good network to document this against is the Karate Club Network. It's included as a method in networkx.
I'd love some direction in how to go about this. If someone can help me figure it out, I can add the documentation to scikit learn.
Notes:
A question much like this one has already been asked on this site.
Without much experience with Spectral-clustering and just going by the docs (skip to the end for the results!):
Code:
import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)
# Get your mentioned graph
G = nx.karate_club_graph()
# Get ground-truth: club-labels -> transform to 0/1 np-array
# (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])
# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)
print('ground truth')
print(gt)
# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))
# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828
The general idea:
Introduction on the data and task from here:
The nodes in the graph represent the 34 members in a college Karate club. (Zachary is a sociologist, and he was one of the members.) An edge between two nodes indicates that the two members spent significant time together outside normal club meetings. The dataset is interesting because while Zachary was collecting his data, there was a dispute in the Karate club, and it split into two factions: one led by “Mr. Hi”, and one led by “John A”. It turns out that using only the connectivity information (the edges), it is possible to recover the two factions.
Using sklearn & spectral-clustering to tackle this:
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
This describes normalized graph cuts as:
Find two disjoint partitions A and B of the vertices V of a graph, so
that A ∪ B = V and A ∩ B = ∅
Given a similarity measure w(i,j) between two vertices (e.g. identity
when they are connected) a cut value (and its normalized version) is defined as:
cut(A, B) = SUM u in A, v in B: w(u, v)
...
we seek the minimization of disassociation
between the groups A and B and the maximization of the association
within each group
Sounds alright. So we create the adjacency matrix (nx.to_numpy_matrix(G)) and set the param affinity to precomputed (as our adjancency-matrix is our precomputed similarity-measure).
Alternatively, using precomputed, a user-provided affinity matrix can be used.
Edit: While unfamiliar with this, i looked for parameters to tune and found assign_labels:
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
So trying the less sensitive approach:
sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')
Output:
ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351
That's a pretty much perfect fit to the ground-truth!
Here is a dummy example just to see what it does to a simple similarity matrix -- inspired by sascha's answer.
Code
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)
adj_mat = [[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]]
adj_mat = np.array(adj_mat)
sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)
print('spectral clustering')
print(sc.labels_)
Output
spectral clustering
[0 0 0 1 1 1 2 2 2]
Let's first cluster a graph G into K=2 clusters and then generalize for all K.
We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.
Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:
import networkx as nx
import numpy as np
A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
nx.draw(G, pos=nx.drawing.layout.spring_layout(G),
with_labels=True, node_color=labels)
We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:
L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense())
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
drawing the graph again with the clusters labeled:
With SpectralClustering from sklearn.cluster we can get the exact same result:
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]
We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):
The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:
A = np.array([[3,2,2,0,0,0,0,0,0],
[2,3,2,0,0,0,0,0,0],
[2,2,3,1,0,0,0,0,0],
[0,0,1,3,3,3,0,0,0],
[0,0,0,3,3,3,0,0,0],
[0,0,0,3,3,3,1,0,0],
[0,0,0,0,0,1,3,1,1],
[0,0,0,0,0,0,1,3,1],
[0,0,0,0,0,0,1,1,3]])
K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])
Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

Counting of adjacent cells in a numpy array

Past midnight and maybe someone has an idea how to tackle a problem of mine. I want to count the number of adjacent cells (which means the number of array fields with other values eg. zeroes in the vicinity of array values) as sum for each valid value!.
Example:
import numpy, scipy
s = ndimage.generate_binary_structure(2,2) # Structure can vary
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
print a
>[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 1 1 1 0]
[0 0 1 1 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]]
# The value at position [2,4] is surrounded by 6 zeros, while the one at
# position [2,2] has 5 zeros in the vicinity if 's' is the assumed binary structure.
# Total sum of surrounding zeroes is therefore sum(5+4+6+4+5) == 24
How can i count the number of zeroes in such way if the structure of my values vary?
I somehow believe to must take use of the binary_dilation function of SciPy, which is able to enlarge the value structure, but simple counting of overlaps can't lead me to the correct sum or does it?
print ndimage.binary_dilation(a,s).astype(a.dtype)
[[0 0 0 0 0 0]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 1]
[0 1 1 1 1 0]
[0 0 0 0 0 0]]
Use a convolution to count neighbours:
import numpy
import scipy.signal
a = numpy.zeros((6,6), dtype=numpy.int) # Example array
a[2:4, 2:4] = 1;a[2,4] = 1 # with example value structure
b = 1-a
c = scipy.signal.convolve2d(b, numpy.ones((3,3)), mode='same')
print numpy.sum(c * a)
b = 1-a allows us to count each zero while ignoring the ones.
We convolve with a 3x3 all-ones kernel, which sets each element to the sum of it and its 8 neighbouring values (other kernels are possible, such as the + kernel for only orthogonally adjacent values). With these summed values, we mask off the zeros in the original input (since we don't care about their neighbours), and sum over the whole array.
I think you already got it. after dilation, the number of 1 is 19, minus 5 of the starting shape, you have 14. which is the number of zeros surrounding your shape. Your total of 24 has overlaps.

Categories