sklearn agglomerative clustering compute_full_tree - python

I'm trying to cluster a matrix of longitude and latitude values with agglomerating clustering of sklearn library.
I want to calculate in how many clusters I have to divide my points to have a maximum distance of fifty meters. I calculate the maximum distance between the points of each clusters until, in all the clusters, there is a maximum distance of fifty meters. If not, I do a clustering with n_clusters + = 1
I calculate the tree and the clusters until I find the proper separation. I would like to not have to calculate the tree in each iteration because I lose a lot of computational time.
I would like to calculate the whole tree once using the function "compute_full_tree" and then just ask the groups indicating the number of clusters I want. It's possible?
cl = 1
while(max(maxDist) > 50):
cl += 1
labels = aglomerativeClust(X, cl)
for l in range(len(X)):
X[l].cluster = str(labels[l])
clusters = [[] for o in range(max(labels)+1)]
for x in X:
clusters[int(x.cluster)].append(x)
maxDist = [0. for o in range(max(labels)+1)]
for m in range(len(clusters)):
for x in clusters[m]:
for y in clusters[m]:
dist=compute_vincenty_distance(x,y)
if(dist > maxDist[m]):
maxDist[m] = dist
def aglomerativeClust(Y, cl):
X = []
for y in Y:
X.append([y.visit.latitude,y.visit.longitude])
model = AgglomerativeClustering(linkage='complete',n_clusters=cl)
model.fit(X)
labels = model.labels_
return labels

Related

Intra-cluster for custom k-means

I'm stuck trying to implement and plot in python the intra-cluster of each cluster in k-means to get best number of k. Which is represented using this formula
Which is the sum of the square distances of data points which belong to a certain cluster from the centroid and normalized by the size of the cluster Ck.
Then we can compute the intra cluster variance for all clusters by just adding up the individual cluster or specific variances using this formula:
Can I get help implementing Wk and W?
The custom k-mean implementaion:
def kmeans(X, k):
iterations=0
data = pd.DataFrame(X)
cluster = np.zeros(X.shape[0])
#taking random samples from the datapoints as an initialization of centroids
centroids = data.sample(n=k).values
while True:
# for each observation
for i, row in enumerate(X):
mn_dist = float('inf')
# distance of the point from all centroids
for idx, centroid in enumerate(centroids):
# calculating euclidean distance
d = np.sqrt((centroid[0]-row[0])**2 + (centroid[1]-row[1])**2)
# assign closest centroid
if mn_dist > d:
mn_dist = d
cluster[i] = idx
#updating centroids by taking the mean value of all datapoints of each cluster
new_centroids = pd.DataFrame(X).groupby(by=cluster).mean().values
iterations+=1
# if centroids are same then break.
if np.count_nonzero(centroids-new_centroids) == 0:
break
else: #else update old centroids with new ones
centroids = new_centroids
return centroids, cluster, iterations

How to automatically and efficiently determine number of cluster for K-means clustering algorithm using Knee algorithm?

I want to determine number of cluster (K) for K-means algorithm automatically, and the wcss for the chosen K to be less than a threshold. To do so, I am using Knee algorithm. However, it is really time-consuming when the dataset is large and my criteria is not satisfied when wcss computed for a small range for K, as I should increase the range to increase the chance of K satisfying my criteria. As a result, I have to cluster the large dataset with K-means and compute its wcss for a large range of possible K (e.g, for k in range(1,1000)). As the knee algorithm couldn't find best K unless I have computed wcss for a large set of K, how should I set the Knee algorithm's parameters to find the K with computing less number of wcss?
For smaller dataset when I increase the range for K, sometimes I get ConvergenceWarning so I limit the range to only compute wcss in the range mentioned in warning message. As a result, although I increase the sensitivity parameter (s) to 300 I am not able to decrease wcss to be less than my threshold. And for larger dataset I have tried to compute wcss for k in range(1,1000,step), is there any other suggestion to find the K more efficiently?
Moreover, sometimes Knee could't find the best cluster. If it is the first time computing the number of cluster with Knee I set K = 1, otherwise I increase the previously found k by 1 (K+=1). What is the best heuristic here?
The following is the base code:
def computewcss(j, X, rang, step):
wcss = []
indexes = []
for i in range(j, rang, step):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
indexes.append(i)
return wcss, indexes
wcss, x =computewcss(1, ds, rang, step) # rang and step are determined based on the dataset length at the begining
s = 1
ncluster = KneeLocator(x, wcss, S=s, curve='convex', direction='decreasing', online=True).knee
distortion = wcss[x.index[ncluster]]
while (distortion > threshold and s < 250):
s+=50
start = rang + 1
rang*=2
wcsstemp , xtemp = computewcss(start, ds, rang, step)
wcss.extend(wcsstemp)
x.extend(xtemp)
ncluster = KneeLocator(x, wcss, S=s, curve='convex', direction='decreasing', online=True).knee
distortion = wcss[x.index[ncluster]]

k-means with a centroid constraint

I'm working on a data science project for my intro to Data Science class, and we've decided to tackle a problem relating to desalination plants in california: "Where should we place k plants to minimize the distance to zip codes?"
The data that we have so far is, zip, city, county, pop, lat, long, amount of water.
The issue is, I can't find any resources on how to force the centroid to be constrained to staying on the coast. What we've thought of so far is:
Use a normal kmeans algorithm, but move the centroid to the coast once clusters have settled (bad)
Use a normal kmeans algorithm with weights, making the coastal zips have infinite weight (I've been told this isn't a great solution)
What do you guys think?
K-means does not minimize distances.
It minimizes squared errors, which is quite different.
The difference is roughly that of the median, and the mean in 1 dimensional data. The error can be massive.
Here is a counter example, assuming we have the coordinates:
-1 0
+1 0
0 -1
0 101
The center chosen by k-means would be 0,25. The optimal location is 0,0.
The sum of distances by k-means is > 152, the optimum location has distance 104. So here, the centroid is almost 50% worse than the optimum! But the centroid (= multivariate mean) is what k-means uses!
k-means does not minimize the Euclidean distance!
This is one variant how "k-means is sensitive to outliers".
It does not get better if you try to constrain it to place "centers" on the coast only...
Also, you may want to at least use Haversine distance, because in California, 1 degree north != 1 degree east, because it's not at the Equator.
Furthermore, you likely should not make the assumption that every location requires its own pipe, but rather they will be connected like a tree. This greatly reduces the cost.
I strongly suggest to treat this as a generic optimization problem, rather than k-means. K-means is an optimization too, but it may optimize the wrong function for your problem...
I would approach this by setting possible points that could be centers, i.e. your coastline.
I think this is close to Nathaniel Saul's first comment.
This way, for each iteration, instead of choosing a mean, a point out of the possible set would be chosen by proximity to the cluster.
I’ve simplified the conditions to only 2 data columns (lon. and lat.) but you should be able to extrapolate the concept. For simplicity, to demonstrate, I based this on code from here.
In this example, the purple dots are places on the coastline. If I understood correctly, the optimal Coastline locations should look something like this:
See code below:
#! /usr/bin/python3.6
# Code based on:
# https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
import matplotlib.pyplot as plt
import numpy as np
import random
##### Simulation START #####
# Generate possible points.
def possible_points(n=20):
y=list(np.linspace( -1, 1, n ))
x=[-1.2]
X=[]
for i in list(range(1,n)):
x.append(x[i-1]+random.uniform(-2/n,2/n) )
for a,b in zip(x,y):
X.append(np.array([a,b]))
X = np.array(X)
return X
# Generate sample
def init_board_gauss(N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1, 1), random.uniform(-1, 1))
s = random.uniform(0.05,0.5)
x = []
while len(x) < n:
a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) < 1 and abs(b) < 1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X
##### Simulation END #####
# Identify points for each center.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
# Get closest possible point for each cluster.
def closest_point(cluster,possiblePoints):
closestPoints=[]
# Check average distance for each point.
for possible in possiblePoints:
distances=[]
for point in cluster:
distances.append(np.linalg.norm(possible-point))
closestPoints.append(np.sum(distances)) # minimize total distance
# closestPoints.append(np.mean(distances))
return possiblePoints[closestPoints.index(min(closestPoints))]
# Calculate new centers.
# Here the 'coast constraint' goes.
def reevaluate_centers(clusters,possiblePoints):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(closest_point(clusters[k],possiblePoints))
return newmu
# Check whether centers converged.
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
# Meta function that runs the steps of the process in sequence.
def find_centers(X, K, possiblePoints):
# Initialize to K random centers
oldmu = random.sample(list(possiblePoints), K)
mu = random.sample(list(possiblePoints), K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Re-evaluate centers
mu = reevaluate_centers(clusters,possiblePoints)
return(mu, clusters)
K=3
X = init_board_gauss(30,K)
possiblePoints=possible_points()
results=find_centers(X,K,possiblePoints)
# Show results
# Show constraints and clusters
# List point types
pointtypes1=["gx","gD","g*"]
plt.plot(
np.matrix(possiblePoints).transpose()[0],np.matrix(possiblePoints).transpose()[1],'m.'
)
for i in list(range(0,len(results[0]))) :
plt.plot(
np.matrix(results[0][i]).transpose()[0], np.matrix(results[0][i]).transpose()[1],pointtypes1[i]
)
pointtypes=["bx","yD","c*"]
# Show all cluster points
for i in list(range(0,len(results[1]))) :
plt.plot(
np.matrix(results[1][i]).transpose()[0],np.matrix(results[1][i]).transpose()[1],pointtypes[i]
)
plt.show()
Edited to minimize total distance.

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.

Is there is a way to print outliers after finishing K-means in sklearn? let's say the top 5

preparing data
df= rn.read_sql(sql,conn)
Data = df.as_matrix(['TOT_CLM_GROSS_AMT','UNIT_PRICE','QUANTITY'])
applying K-means
kmeansFinal = KMeans(n_clusters = 47, init="k-means++",precompute_distances=True, copy_x=True,max_iter=500,n_init=20 ).fit(Data)
then computing the distance
distances= kmeansFinal.transform(Data)
I want to print the values of the first n outliers
let's say n is 5 for now
After you computed your distance, choose n and run:
n = 5
outliers = [x[0] for x in sorted(enumerate(distances), key=lambda x: sum(x[1]**2)**0.5, reverse=True)[:n]]
Now, outliers hold the index of the data points in Data which are furthest from their centroids.
for outlier in outliers:
print(Data[outlier])

Categories