How to read contents of scipy hierarchy cluster - python

I have the following code with which I am clustering hierarchically. My data object is an array of similarity distances I calculated earlier. I think I am executing the clustering properly. I thought I could just get the leaves of the Cluster, but when I compare that to the original input I get a mismatch.
I have two questions here:
Why is there a mismatch between the leaves of my cluster and my actual input data?
How can I extract the original data from a cluster by either the linkage matrix or clusternodes?
import numpy as np
import pandas
import scipy.cluster.hierarchy as sch
def list_difference(list1, list2):
return [value for value in list1 if value not in list2]
if __name__ == '__main__':
# example data for this questions purpose.
data = [10, 11, 29, 288, 16]
X = np.array([[i] for i in data])
linkage_matrix = sch.average(X)
rootnode, nodelist = sch.to_tree(linkage_matrix, rd=True)
leaves = sch.leaves_list(linkage_matrix)
print(list_difference(leaves, data))
I want to retrieve the original data points per cluster.

Given your data
data = [10, 11, 29, 288, 16]
the result is compatible with the dendrogram
sch.dendrogram(linkage_matrix);
Analyzing linkage_matrix we can confirm
print(linkage_matrix)
array([[ 0. , 1. , 1. , 2. ],
[ 4. , 5. , 5.5 , 3. ],
[ 2. , 6. , 16.66666667, 4. ],
[ 3. , 7. , 271.5 , 5. ]])
Row by row we have
element 0 and element 1, with distance 1 in a cluster that has got 2 elements (this cluster will be called 5)
element 4 with clustered elements 5 (the previous), with distance 5.5 and 3 elements (this cluster will be called 6)
element 2 with clustered elements 6 (the previous), with distance 16.667 and 4 elements (this cluster will be called 7)
element 3 with clustered elements 7 (the previous), with distance 271.5 and 5 elements

Related

Re-calculate elements of symmetric matrix using a "i not equal to j" loop in Python

The correlation matrix is a symmetric matrix, meaning that its upper diagonal and lower diagonal elements are mirror images of each other, together called off-diagonal elements (as opposed to the diagonal elements, which are all equal to 1 in any correlation matrix since any variable's correlation with itself is just 1).
The off-diagonal elements of a correlation matrix are the same wherever the i'th row number and j'th column number in the lower diagonal are swapped in the upper diagonal, i.e. correlation of variables 1 and 2 (row 1, column 2) are the same for variables 2 and 1 (row 2, column 1). Therefore, we only need to re-calculate the lower-diagonal elements, and copy them to corresponding positions in the matrix's upper-diagonal after
import numpy as np
from numpy.random import randn
X = randn(20,3)
Rho = np.corrcoef(X.T) #correlation matrix
print(np.tril(Rho)) #lower off-diagonal of matrix Rho to re-calculate, then copy to other side
shows
array([[ 1. , 0. , 0. ],
[-0.03003281, 1. , 0. ],
[-0.02602238, 0.06137713, 1. ]])
What is the most efficient way to code a "i not-equal-to j" loop for the following sequence of steps:
re-calculate the lower off-diagonal elements of the symmetric matrix according to some apply function (to make it simple, we will just add +2 to each of these elements)
flip those same calculations onto its mirror image (the corresponding upper off-diagonals)
Also, replace the diagonal elements of the symmetric matrix with a vector filled with 10's (instead of 1's as found in the correlation matrix)
The aim is to generate a new matrix that is a re-calculation of the original.
Let us generate Rho first (note that I'm initializing the pseudo-random number generator in order to obtain the same Rho in different runs of the code):
In [526]: import numpy as np
In [527]: np.random.seed(0)
...: n = 3
...: X = np.random.randn(20, n)
...: Rho = np.corrcoef(X.T)
In [528]: Rho
Out[528]:
array([[1. , 0.03224462, 0.05021998],
[0.03224462, 1. , 0.15140358],
[0.05021998, 0.15140358, 1. ]])
Then you can use NumPy's tril_indices_from and advanced indexing to generate the new matrix:
In [548]: result = np.zeros_like(Rho)
In [549]: lrows, lcols = np.tril_indices_from(Rho, k=-1)
In [550]: result[lrows, lcols] = Rho[lrows, lcols] + 2
In [551]: result
Out[551]:
array([[0. , 0. , 0. ],
[2.03224462, 0. , 0. ],
[2.05021998, 2.15140358, 0. ]])
In [552]: result[lcols, lrows] = result[lrows, lcols]
In [553]: result
Out[553]:
array([[0. , 2.03224462, 2.05021998],
[2.03224462, 0. , 2.15140358],
[2.05021998, 2.15140358, 0. ]])
In [554]: result[np.arange(n), np.arange(n)] = 10
In [555]: result
Out[555]:
array([[10. , 2.03224462, 2.05021998],
[ 2.03224462, 10. , 2.15140358],
[ 2.05021998, 2.15140358, 10. ]])

Define cluster centers manually

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm
I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

Coloring specific links in a dendrogram

In a dendrogram from a hierarchical clustering in scipy, I would like to highlight links connecting specific two labels, let's say 0 and 1.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
clustering = hac.linkage(points, method='single', metric='cosine')
link_colors = ["black"] * (2 * len(points) - 1)
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
The clustering has the following format:
clustering[i] corresponds to node number len(points) + i and its first two numbers are indices of nodes that are linked. Nodes with indices smaller than len(points) correspond to original points, higher indices to the clusters.
When drawing the dendrogram, different indexing of the links is used and these are the indices that are used for choosing the color. How do the indices of the links (as indexed in link_colors) correspond to indices in clustering?
You have been very close to the solution. The indices in clustering are sorted by size of the 3rd columns of the clustering array. The indices of the color list for link_color_func are indices of clustering + the length of points.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
import numpy as np
# Sample data
points = np.array([[8, 7, 7, 1],
[8, 4, 7, 0],
[4, 0, 6, 4],
[2, 4, 6, 3],
[3, 7, 8, 5]])
clustering = hac.linkage(points, method='single', metric='cosine')
clustering does look like this
array([[3. , 4. , 0.00766939, 2. ],
[0. , 1. , 0.02763245, 2. ],
[5. , 6. , 0.13433008, 4. ],
[2. , 7. , 0.15768043, 5. ]])
As you can see the ordering (and thus the row-index) results from clustering being sorted by the third column.
To highlight now a specific link (e.g. [0,1] as you proposed) you have to find the row index of the pair [0,1] within clustering and add len(points). The resulting number is the index of the color list provided for link_color_func.
# Initialize the link_colors list with 'black' (as you did already)
link_colors = ['black'] * (2 * len(points) - 1)
# Specify link you want to have highlighted
link_highlight = (0, 1)
# Find index in clustering where first two columns are equal to link_highlight. This will cause an exception if you look for a link, which is not in clustering (e.g. [0,4])
index_highlight = np.where((clustering[:,0] == link_highlight[0]) *
(clustering[:,1] == link_highlight[1]))[0][0]
# Index in color_list of desired link is index from clustering + length of points
link_colors[index_highlight + len(points)] = 'red'
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
Like this, you can highlight the desired link:
It works also for links between an original element and a cluster or between two clusters (e.g. link_highlight = (5, 6))

Correct usage of TensorFlow Transform apply_buckets

This is on TensorFlow 1.11.0. The documentation of tft.apply_buckets is not very descriptive. In specific, I read:
"bucket_boundaries: The bucket boundaries represented as a rank 2 Tensor."
I assume this has to be bucket indices and bucket boundaries?
When I try with the toy example below:
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
tf.convert_to_tensor(x),
tf.float32
)
boundaries = tf.cast(
tf.transpose(
tf.convert_to_tensor([[0, 1, 2, 3], [10, 20, 30, 40]])
),
tf.float32
)
buckets = tft.apply_buckets(xt, boundaries)
I get:
InvalidArgumentError: Expected sorted boundaries [Op:BucketizeWithInputBoundaries] name: assign_buckets
Note that in this case x and bucket_boundaries arguments are:
tf.Tensor([-1. 9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor(
[[ 0. 10.]
[ 1. 20.]
[ 2. 30.]
[ 3. 40.]], shape=(4, 2), dtype=float32)
So, it seems like bucket_boundaries is not supposed to be indices and boundaries. Does anyone know how to properly use this method?
After some playing around, I found out that bucket_boundaries is supposed to be a 2 dimensional array where entries are bucket boundaries and the array is wrapped so it has two columns. See example below:
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
x = np.array([-1,9,19, 29, 39])
xt = tf.cast(
tf.convert_to_tensor(x),
tf.float32
)
boundaries = tf.cast(
tf.transpose(
tf.convert_to_tensor([[0, 20, 40, 60], [10, 30, 50, 70]])
),
tf.float32
)
buckets = tft.apply_buckets(xt, boundaries)
So, the expected inputs are:
print (xt)
print (buckets)
print (boundaries)
tf.Tensor([-1. 9. 19. 29. 39.], shape=(5,), dtype=float32)
tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int64)
tf.Tensor(
[[ 0. 10.]
[20. 30.]
[40. 50.]
[60. 70.]], shape=(4, 2), dtype=float32)
Wanted to add to this as this is the only result for the Google search "tft.apply_buckets" :)
The example for me did not work in the latest version of TFT. The following code did work for me.
Note that the buckets are specified as a rank 2 tensor, but with only one element in the inner dimension.
(I'm using the wrong words but hopefully my example below will clarify)
import tensorflow as tf
import tensorflow_transform as tft
import numpy as np
tf.enable_eager_execution()
xt = tf.cast(tf.convert_to_tensor(np.array([-1,9,19, 29, 39])),tf.float32)
bds = [[0],[10],[20],[30],[40]]
boundaries = tf.cast(tf.convert_to_tensor(bds),tf.float32)
buckets = tft.apply_buckets(xt, boundaries)
thanks for your help as this answer got me most of the way there!
The rest I found from the TFT source code:
https://github.com/tensorflow/transform/blob/deb198d59f09624984622f7249944cdd8c3b733f/tensorflow_transform/mappers.py#L1697-L1698
I love this answer, just wanted to add some simplification as enabling eager execution, casting, and numpy aren't really needed. Note that casting below for the float case is done by making one of the scalars a float, tensorflow standardizes on the highest fidelity data type.
The code below shows how this mapping works. The number of buckets created is the length of bucket boundaries vector + 1, or (in my opinion), more intuitively, the minimum number of commas + 2. Plus two because negative infinity to the smallest value, and the largest value to infinity. If something is on the bucket boundary, it goes to the bucket representing bigger numbers. What happens when the bucket boundaries aren't sorted is left as an exercise for the reader :)
import tensorflow as tf
import tensorflow_transform as tft
xt = tf.constant([-1., 9, 19, 29, 39, float('nan'), float('-inf'), float('inf')])
bucket_boundaries = tf.constant([[0], [10], [20], [30], [40]])
bucketed_floats = tft.apply_buckets(xt, bucket_boundaries)
for scalar, index in zip(xt, range(len(xt))):
print(f"{scalar} was mapped to bucket {bucketed_floats[index]}.")
-1.0 was mapped to bucket 0.
9.0 was mapped to bucket 1.
19.0 was mapped to bucket 2.
29.0 was mapped to bucket 3.
39.0 was mapped to bucket 4.
nan was mapped to bucket 5.
-inf was mapped to bucket 0.
inf was mapped to bucket 5.
xt_int = tf.constant([-1, 9, 19, 29, 39, 41])
bucketed_ints = tft.apply_buckets(xt_int, bucket_boundaries)
for scalar, index in zip(xt_int, range(len(xt_int))):
print(f"{scalar} was mapped to bucket {bucketed_ints[index]}.")
-1 was mapped to bucket 0.
9 was mapped to bucket 1.
19 was mapped to bucket 2.
29 was mapped to bucket 3.
39 was mapped to bucket 4.
41 was mapped to bucket 5.
Note that there's also a function called tft.bucketize which appears to require a full pass over the data. I'm not a 100% clear on the nuance between tft.apply_buckets and tft.bucketize.

Probability functions convolution in python

There are N distributions which take on integer values 0,... with associated probabilities. Further, I assume 3 variables [value, prob]:
import numpy as np
x = np.array([ [0,0.3],[1,0.2],[3,0.5] ])
y = np.array([ [10,0.2],[11,0.4],[13,0.1],[14,0.3] ])
z = np.array([ [21,0.3],[23,0.7] ])
As there are N variables I convolve first x+y, then I add z, and so on.
Unfortunately numpy.convole() takes 1-d arrays as input variables, so it does not suit in this case directly. I play with variables to take them all values 0,1,2,...,23 (if value is not know then Pr=0)... I feel like there is another much better solution.
Does anyone have a suggestion for making it more efficient? Thanks in advance.
I don't see a built-in method for this in Scipy; there's a way to define a custom discrete random variables, but those don't support addition. Here is an approach using pandas, assuming import pandas as pd and x,y,z as in your example:
values = np.add.outer(x[:,0], y[:,0]).flatten()
probs = np.multiply.outer(x[:,1], y[:,1]).flatten()
df = pd.DataFrame({'values': values, 'probs': probs})
conv = df.groupby('values').sum()
result = conv.reset_index().values
The output is
array([[ 10. , 0.06],
[ 11. , 0.16],
[ 12. , 0.08],
[ 13. , 0.13],
[ 14. , 0.31],
[ 15. , 0.06],
[ 16. , 0.05],
[ 17. , 0.15]])
With more than two variables, you don't have to go back and forth between numpy and pandas: the additional variables can be included at the beginning.
values = np.add.outer(np.add.outer(x[:,0], y[:,0]), z[:,0]).flatten()
probs = np.multiply.outer(np.multiply.outer(x[:,1], y[:,1]), z[:,1]).flatten()
Aside: it would be better to keep values and probabilities in separate numpy arrays, if they have different intrinsic data types (integers vs reals).

Categories