Coloring specific links in a dendrogram

Coloring specific links in a dendrogram - python

In a dendrogram from a hierarchical clustering in scipy, I would like to highlight links connecting specific two labels, let's say 0 and 1.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
clustering = hac.linkage(points, method='single', metric='cosine')
link_colors = ["black"] * (2 * len(points) - 1)
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
The clustering has the following format:
clustering[i] corresponds to node number len(points) + i and its first two numbers are indices of nodes that are linked. Nodes with indices smaller than len(points) correspond to original points, higher indices to the clusters.
When drawing the dendrogram, different indexing of the links is used and these are the indices that are used for choosing the color. How do the indices of the links (as indexed in link_colors) correspond to indices in clustering?

You have been very close to the solution. The indices in clustering are sorted by size of the 3rd columns of the clustering array. The indices of the color list for link_color_func are indices of clustering + the length of points.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
import numpy as np
# Sample data
points = np.array([[8, 7, 7, 1],
[8, 4, 7, 0],
[4, 0, 6, 4],
[2, 4, 6, 3],
[3, 7, 8, 5]])
clustering = hac.linkage(points, method='single', metric='cosine')
clustering does look like this
array([[3. , 4. , 0.00766939, 2. ],
[0. , 1. , 0.02763245, 2. ],
[5. , 6. , 0.13433008, 4. ],
[2. , 7. , 0.15768043, 5. ]])
As you can see the ordering (and thus the row-index) results from clustering being sorted by the third column.
To highlight now a specific link (e.g. [0,1] as you proposed) you have to find the row index of the pair [0,1] within clustering and add len(points). The resulting number is the index of the color list provided for link_color_func.
# Initialize the link_colors list with 'black' (as you did already)
link_colors = ['black'] * (2 * len(points) - 1)
# Specify link you want to have highlighted
link_highlight = (0, 1)
# Find index in clustering where first two columns are equal to link_highlight. This will cause an exception if you look for a link, which is not in clustering (e.g. [0,4])
index_highlight = np.where((clustering[:,0] == link_highlight[0]) *
(clustering[:,1] == link_highlight[1]))[0][0]
# Index in color_list of desired link is index from clustering + length of points
link_colors[index_highlight + len(points)] = 'red'
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
Like this, you can highlight the desired link:
It works also for links between an original element and a cluster or between two clusters (e.g. link_highlight = (5, 6))

Related

How to read contents of scipy hierarchy cluster

I have the following code with which I am clustering hierarchically. My data object is an array of similarity distances I calculated earlier. I think I am executing the clustering properly. I thought I could just get the leaves of the Cluster, but when I compare that to the original input I get a mismatch.
I have two questions here:
Why is there a mismatch between the leaves of my cluster and my actual input data?
How can I extract the original data from a cluster by either the linkage matrix or clusternodes?
import numpy as np
import pandas
import scipy.cluster.hierarchy as sch
def list_difference(list1, list2):
return [value for value in list1 if value not in list2]
if __name__ == '__main__':
# example data for this questions purpose.
data = [10, 11, 29, 288, 16]
X = np.array([[i] for i in data])
linkage_matrix = sch.average(X)
rootnode, nodelist = sch.to_tree(linkage_matrix, rd=True)
leaves = sch.leaves_list(linkage_matrix)
print(list_difference(leaves, data))
I want to retrieve the original data points per cluster.

Given your data
data = [10, 11, 29, 288, 16]
the result is compatible with the dendrogram
sch.dendrogram(linkage_matrix);
Analyzing linkage_matrix we can confirm
print(linkage_matrix)
array([[ 0. , 1. , 1. , 2. ],
[ 4. , 5. , 5.5 , 3. ],
[ 2. , 6. , 16.66666667, 4. ],
[ 3. , 7. , 271.5 , 5. ]])
Row by row we have
element 0 and element 1, with distance 1 in a cluster that has got 2 elements (this cluster will be called 5)
element 4 with clustered elements 5 (the previous), with distance 5.5 and 3 elements (this cluster will be called 6)
element 2 with clustered elements 6 (the previous), with distance 16.667 and 4 elements (this cluster will be called 7)
element 3 with clustered elements 7 (the previous), with distance 271.5 and 5 elements

Define cluster centers manually

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm

I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

Sum up data on specific (multiple) ranges

I'm certain there's a good way to do this but I'm blanking on the right search terms to google, so I'll ask here instead. My problem is this:
I have 2 2-dimensional array, both with the same dimensions. One array (array 1) is the accumulated precipitation at (x,y) points. The other (array 2) is the topographic height of the same (x,y) grid. I want to sum up array 1 between specific heights of array 2, and create a bar graph with topographic height bins a the x-axis and total accumulated precipitation on the y axis.
So I want to be able to declare a list of heights (say [0, 100, 200, ..., 1000]) and for each bin, sum up all precipitation that occurred within that bin.
I can think of a few complicated ways to do this, but I'm guessing there's probably an easier way that I'm not thinking of. My gut instinct is to loop through my list of heights, mask anything outside of that range, sum up remaining values, add those to a new array, and repeat.
I'm wondering is if there's a built-in numpy or similar library that can do this more efficiently.

This code shows what you're asking for, some explanation in comments:
import numpy as np
def in_range(x, lower_bound, upper_bound):
# returns wether x is between lower_bound (inclusive) and upper_bound (exclusive)
return x in range(lower_bound, upper_bound)
# vectorize allows you to easily 'map' the function to a numpy array
vin_range = np.vectorize(in_range)
# representing your rainfall
rainfall = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# representing your height map
height = np.array([[1, 2, 1], [2, 4, 2], [3, 6, 3]])
# the bands of height you're looking to sum
bands = [[0, 2], [2, 4], [4, 6], [6, 8]]
# computing the actual results you'd want to chart
result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands]
print(result)
The next to last line is where the magic happens. vin_range(height, *band) uses the vectorized function to create a numpy array of boolean values, with the same dimensions as height, that has True if a value of height is in the range given, or False otherwise.
By using that array to index the array with the target values (rainfall), you get an array that only has the values for which the height is in the target range. Then it's just a matter of summing those.
In more steps than result = [(band, sum(rainfall[vin_range(height, *band)])) for band in bands] (but with the same result):
result = []
for lower, upper in bands:
include = vin_range(height, lower, upper)
values_to_include = rainfall[include]
sum_of_rainfall = sum(values_to_include)
result.append(([lower, upper], sum_of_rainfall))

You can use np.bincount together with np.digitize. digitize creates an array of bin indices from the height array height and the bin boundaries bins. bincount then uses the bin indices to sum the data in array rain.
# set up
rain = np.random.randint(0,100,(5,5))/10
height = np.random.randint(0,10000,(5,5))/10
bins = [0,250,500,750,10000]
# compute
sums = np.bincount(np.digitize(height.ravel(),bins),rain.ravel(),len(bins)+1)
# result
sums
# array([ 0. , 37. , 35.6, 14.6, 22.4, 0. ])
# check against direct method
[rain[(height>=bins[i]) & (height<bins[i+1])].sum() for i in range(len(bins)-1)]
# [37.0, 35.6, 14.600000000000001, 22.4]

An example using the numpy ma module which allows to make masked arrays. From the docs:
A masked array is the combination of a standard numpy.ndarray and a mask. A mask is either nomask, indicating that no value of the associated array is invalid, or an array of booleans that determines for each element of the associated array whether the value is valid or not.
which seems what you need in this case.
import numpy as np
pr = np.random.randint(0, 1000, size=(100, 100)) #precipitation map
he = np.random.randint(0, 1000, size=(100, 100)) #height map
bins = np.arange(0, 1001, 200)
values = []
for vmin, vmax in zip(bins[:-1], bins[1:]):
#creating the masked array, here minimum included inside bin, maximum excluded.
maskedpr = np.ma.masked_where((he < vmin) | (he >= vmax), pr)
values.append(maskedpr.sum())
values is the list of values for each bin, which you can plot.
The numpy.ma.masked_where function returns an array masked where condition is True. So you need to set the condition to be True outside the bins.
The sum() method performs the sum only where the array is not masked.

Problem with NearestCentroid, python, cluster

I want to find the centroid coordinates of a cluster (list of points [x,y]).
So, I want to use NearestCentroid() from sklearn.
clf = NearestCentroid()
clf.fit(X, y)
X : np.array of my coordinates points.
y : np.array fully filled with 1
I have an error when I launch the fit() function.
ValueError: y has less than 2 classes
Maybe there is a problem with arrays shape.
(X= (7,2) ,y= (7,))

The centroid of points can be calculated by summing up all the values in each dimension and averaging them. You can use numpy.mean() for this. Refer to the documention: numpy.mean
import numpy as np
points = [
[0, 0],
[1, 1],
[0, 1],
[0, 100]
]
a = np.array(points)
centroid = np.mean(a, axis=0)
print(centroid)
Which will give:
[ 0.25 25.5 ]
You can verify this by hand. Sum up the x-axis values: 0+1+0+0 = 1 and average it: 1/4. Same for y-axis: 0+1+1+100 = 102, average it: 102/4 = 25.5.

confused with the output of sklearn.neighbors.NearestNeighbors

Here is the code.
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
>indices
>array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]])
>distances
>array([[0. , 1. ],[0. , 1. ],[0. , 1.41421356], [0. , 1. ],[0. , 1. ],[0. , 1.41421356]])
I don't really understand the shape of 'indices' and 'distances'. How do I understand what these numbers mean?

Its pretty straightforward actually. For each data sample in the input to kneighbors() (X here), it will show 2 neighbors. (Because you have specified n_neighbors=2. The indices will give you the index of training data (again X here) and distances will give you the distance for the corresponding data point in training data (to which the indices are referring).
Take an example of single data point. Assuming X[0] as the first query point, the answer will be indices[0] and distances[0]
So for X[0],
the index of first nearest neighbor in training data is indices[0, 0] = 0 and distance is distances[0, 0] = 0. You can use this index value to get the actual data sample from the training data.
This makes sense, because you used the same data for training and testing, so the first nearest neighbor for each point is itself and the distance is 0.
the index of second nearest neigbor is indices[0, 1] = 1 and distance is distances[0, 1] = 1
Similarly for all other points. The first dimension in indices and distances correspond to the query points and second dimension to the number of neighbors asked.

Maybe a little sketch will help
As an example, the closest point to the training sample with index 0 is 1, and since you are using n_neighbors = 2 (two neighbors) you would expect to see this pair in the results. And indeed you see that the pair [0, 1] appears in the output.

I will comment to the aforementioned, how you can get the "n_neighbors=2" neighbors using the indices array, in a pandas dataframe. So,
import pandas as pd
df = pd.DataFrame([X.iloc[indices[row,col]] for row in range(indices.shape[0]) for col in range(indices.shape[1])])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Coloring specific links in a dendrogram - python

Related

How to read contents of scipy hierarchy cluster

Define cluster centers manually

Sum up data on specific (multiple) ranges

Problem with NearestCentroid, python, cluster

confused with the output of sklearn.neighbors.NearestNeighbors

Categories

Resources