spatial data clustering with sklearn

spatial data clustering with sklearn - python

I have arrays of latitude and longitude data points which I want to do hierachical clustering. Here is my code:
position = zip(longitude, latitude)
X = np.asarray(position)
knn_graph = kneighbors_graph(X, 30, include_self=False, metric= haversine)
for connectivity in (None, knn_graph):
for n_clusters in(5,8,10,15,20):
plt.figure(figsize=(4, 5))
cnt = 0
for index, linkage in enumerate(('average', 'complete', 'ward')):
model = AgglomerativeClustering(linkage = linkage,
connectivity = connectivity,
n_clusters = n_clusters)
model.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=model.labels_,
cmap=plt.cm.spectral)
plt.title('linkage=%s (ncluster) %s)' % (linkage, n_clusters),
fontdict=dict(verticalalignment='top'))
plt.axis([37.1, 37.9, -122.6, -121.6])
plt.show()
the problem is for kneighbors_graph there is a parameter called metric which is how we defined the destination,http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html I want to define my own(real distance regard to the logitude and latitude and earth radius). Let seems I could not plug in my own function, any ideas?

Note that
the distance function expects a string usually (e.g. "haversine")
you have two locations where you use a distance, then knn graph and as affinity for the clustering.
hierarchical clustering has two types of distances, and thus two distance parameters. One is the distance of objects (e.g. haversine), the other is the distance of clusters, which is usually derived from that other disance by aggregation (e.g. maximum, minimum). Both are often called "distance". In sklearn, the first is called affinity.

Related

Do KMeans and creating a dendrogram produce the same labels?

I am using some data to generate some labels so that I can sort my data to be used in a supervised learning environment. I have been generating a dendrogram to visualize how the data clusters but when I use KMeans to create the labels only a few of the labels show that they are in the shown cluster for the dendrogram.
code:
combined_array = pd.read_pickle('arrays.pickle')
model = KMeans(algorithm = 'auto', copy_x = True, init = 'k-means++', max_iter = 300,
n_clusters = 7, n_init = 10, n_jobs = 1, precompute_distances = 'auto',
random_state = 1, tol = 0.0001, verbose = 0)
model.fit(combined_array)
labels = model.predict(combined_array)
pd.DataFrame(labels).to_csv("arrays_labels.csv")
mergings = linkage(combined_array, method = 'ward')
dendrogram(mergings, leaf_rotation = 0, leaf_font_size = 14, show_contracted = True)
The image above shows a section of what files should be in that cluster but when I use kmeans to generate labels only files 28, 33, 41, 45, 70 are included. So why aren't 13, 42, 67, 81 showing up in my labels? Do KMeans and dendrogram create different types of clustering?

I don't really link your code to what you are asking, but yes! They're totally different!
Dendrogram is done by applying Hierarchical Clustering, very simple and DETERMINISTIC (you apply it 2 times? You'll get same result).
It works in this way:
1) Compute distance between points
2) Select the minimun distance
3) Aggregate the 2 points with minimum distance in a cluster
4) Go to 1 until you get 1 cluster containing all elements
There are a lot of details omitted but this is the core.
As you can see it's based on distance between points and does not tell you which cluster configuration is the best, there are techniques to select the number of clusters.
K-means needs to know previously the number of clusters you are looking for (see that you specify n_clusters in the code).
It works like this:
1) Randomly initialize n Centroids (center of mass of a cluster)
2) Assign each point to its closest centroid
3) Re-compute center of mass of the clusters created
4) Go to 2 until convergence
So - if I'm right - what you are trying to do is to generate labels from a clustering algorithm to then fit a supervised model.
So what you are looking for is simply clustering model selection.
To select the best number of clusters and the best algorithm there are a lot of techniques which highly depend on your problem and your data (have a deep look to scikit documentation before doing any kind of clustering)
If you want a general approach, try to look at this library which can select the best results among the ones you provide.
PS: An approach which can go well in general is Silouhettes Analysis

Find distance between centroid and points in a single feature dataframe - KMeans

I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:
df = array([[12534.],
[12014.],
[12158.],
[11935.],
...,
[ 5120.],
[ 4828.],
[ 4443.]])
I'm able to fit and to predict values with the following instructions:
km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)
In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.
I found examples which used euclidean distance to calculate the distance. An example is the following one:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return distances
centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
distances.append({'x': cx, 'y': cy, 'distance': mean_distance})
This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:
array([[11899.90692187],
[ 5406.54143126]])
In this case, what is the correct approach to find the distance between centroid and points? Is it possible?
Thank you and sorry for the trivial question, i'm still learning

There's scipy.spatial.distance_matrix you can make use of:
# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))
# make it a dataframe
df = pd.DataFrame(df)
# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)
# get centroids
centroids = km.cluster_centers_
# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)
gives
Now the distance matrix:
from scipy.spatial import distance_matrix
dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))
You can confirm that this is correct by
dist_mat.idxmin(axis=1) == preds
And finally, the mean distance to centroids:
dist_mat.groupby(preds).mean()
gives:
0 1 2
0 0.243367 0.525194 0.571674
1 0.525350 0.228947 0.575169
2 0.560297 0.573860 0.197556
where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.

You can use scipy.spatial.distance.cdist to create a distance matrix:
from scipy.spatial.distance import cdist
dm = cdist(df, centroids)
This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. cdist uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature).

Fit mixture of Gaussians with fixed covariance in Python

I have some 2D data (GPS data) with clusters (stop locations) that I know resemble Gaussians with a characteristic standard deviation (proportional to the inherent noise of GPS samples). The figure below visualizes a sample that I expect has two such clusters. The image is 25 meters wide and 13 meters tall.
The sklearn module has a function sklearn.mixture.GaussianMixture which allows you to fit a mixture of Gaussians to data. The function has a parameter, covariance_type, that enables you to assume different things about the shape of the Gaussians. You can, for example, assume them to be uniform using the 'tied' argument.
However, it does not appear directly possible to assume the covariance matrices to remain constant. From the sklearn source code it seems trivial to make a modification that enables this but it feels a bit excessive to make a pull request with an update that allows this (also I don't want to accidentally add bugs in sklearn). Is there a better way to fit a mixture to data where the covariance matrix of each Gaussian is fixed?
I want to assume that the SD should remain constant at around 3 meters for each component, since that is roughly the noise level of my GPS samples.

It is simple enough to write your own implementation of EM algorithm. It would also give you a good intuition of the process. I assume that covariance is known and that prior probabilities of components are equal, and fit only means.
The class would look like this (in Python 3):
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
class FixedCovMixture:
""" The model to estimate gaussian mixture with fixed covariance matrix. """
def __init__(self, n_components, cov, max_iter=100, random_state=None, tol=1e-10):
self.n_components = n_components
self.cov = cov
self.random_state = random_state
self.max_iter = max_iter
self.tol=tol
def fit(self, X):
# initialize the process:
np.random.seed(self.random_state)
n_obs, n_features = X.shape
self.mean_ = X[np.random.choice(n_obs, size=self.n_components)]
# make EM loop until convergence
i = 0
for i in range(self.max_iter):
new_centers = self.updated_centers(X)
if np.sum(np.abs(new_centers-self.mean_)) < self.tol:
break
else:
self.mean_ = new_centers
self.n_iter_ = i
def updated_centers(self, X):
""" A single iteration """
# E-step: estimate probability of each cluster given cluster centers
cluster_posterior = self.predict_proba(X)
# M-step: update cluster centers as weighted average of observations
weights = (cluster_posterior.T / cluster_posterior.sum(axis=1)).T
new_centers = np.dot(weights, X)
return new_centers
def predict_proba(self, X):
likelihood = np.stack([multivariate_normal.pdf(X, mean=center, cov=self.cov)
for center in self.mean_])
cluster_posterior = (likelihood / likelihood.sum(axis=0))
return cluster_posterior
def predict(self, X):
return np.argmax(self.predict_proba(X), axis=0)
On the data like yours, the model would converge quickly:
np.random.seed(1)
X = np.random.normal(size=(100,2), scale=3)
X[50:] += (10, 5)
model = FixedCovMixture(2, cov=[[3,0],[0,3]], random_state=1)
model.fit(X)
print(model.n_iter_, 'iterations')
print(model.mean_)
plt.scatter(X[:,0], X[:,1], s=10, c=model.predict(X))
plt.scatter(model.mean_[:,0], model.mean_[:,1], s=100, c='k')
plt.axis('equal')
plt.show();
and output
11 iterations
[[9.92301067 4.62282807]
[0.09413883 0.03527411]]
You can see that the estimated centers ((9.9, 4.6) and (0.09, 0.03)) are close to the true centers ((10, 5) and (0, 0)).

I think the best option would be to "roll your own" GMM model by defining a new scikit-learn class that inherits from GaussianMixture and overwrites the methods to get the behavior you want. This way you just have an implementation yourself and you don't have to change the scikit-learn code (and create a pull-request).
Another option that might work is to look at the Bayesian version of GMM in scikit-learn. You might be able to set the prior for the covariance matrix so that the covariance is fixed. It seems to use the Wishart distribution as a prior for the covariance. However I'm not familiar enough with this distribution to help you out more.

First, you can use spherical option, which will give you single variance value for each component. This way you can check yourself, and if the received values of variance are too different then something went wrong.
In a case you want to preset the variance, you problem degenerates to finding only best centers for your components. You can do it by using k-means, for example. If you don't know the number of the components, you may sweep over all logical values (like 1 to 20) and evaluate the decrement in fitting error. Or you can optimize your own EM function, to find the centers and the number of components simultaneously.

Python: Converting KMeans Centroids to Shapefile for Pixel Classification in Land Cover Analysis

I'm trying to use KMeans centroids to label/clump pixels for a land cover analysis. I'm hoping to do this only using sklearn and matplotlib. At the moment my code looks like this:
kmeans.fit(band_5)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1])
The shape of band_5 is (713, 1163), yet from the scatter plot I can tell that the centroid coordinates have values well in excess of that shape.
From my understanding, the centroids that KMeans provides need to be converted into the correct coordinates and then a shapefile, which would then be used in a supervised process to label/clump pixels.
How do I convert those centroids to the correct coordinates and then export to a shapefile? Also, do I need to create a shapefile?
I tried to adopt some of the code from this post, but I could not get that to work. http://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py

A couple of points:
scikit-learn expects data in columns (think a table in a spreadsheet), so simply passing in an array representing a raster band will actually try and classify the data as if you had 1163 sample points and 713 values (bands) for each sample. Instead you'll need to flatten the array, and what kmeans will return will be equivalent to quantile classification of your raster if you're looking at it in something like ArcGIS, with centroids in the range of band minimum value to band maximum value (not in cell coordinates).
Looking at the example you provide, they have a three band jpeg, which the reshape into three long columns:
image_array = np.reshape(china, (w * h, d))
If you need to have spatially constrained pixels then you have two choices: choose a connectivity constrained cluster method such as Agglomerative Clustering or Affinity Propagation, and look at adding the normalised cell coordinates to your sample-set, e.g.:
xs, ys = np.meshgrid(
np.linspace(0, 1, 1163), # x
np.linspace(0, 1, 713), # y
)
data_with_coordinates = np.column_stack([
band_5.flatten(),
xs.flatten(),
ys.flatten()
])
# And on with the clustering
Once you've done the clustering with scikit-learn, assuming you use fit_predict you'll get a label back for each value by cluster, and you can reshape back to the original shape of the band to plot the clustered results.
labels = classifier.fit_predict(data_with_coordinates)
plt.imshow(labels.reshape(band_5.shape)
Do you actually need the cluster centroids given you have labelled points? And do you need them in real world spatial coordinates? If yes, then you need to be looking at the rasterio and the affine methods to transform from map coordinates to array coordinates and vice versa. And then look into fiona to write the points to a shapefile.

Spectral Clustering Scikit learn print items in Cluster

I know I can get the contents of a particular cluster in K-means clustering with the following code using scikit-learn.
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
How do I do the same for spectral clustering as there is no attribute 'cluster_centers_'for spectral clustering? I am trying to cluster terms in Text documents.

UPDATED:
Sorry, I've not understood your question correctly at first time.
I think it's impossible to do what you want with Spectral Clustering, because spectral clustering method by itself doesn't compute any centers, it doesn't needs them at all. It even doesn't operates on sample points in raw space, Spectral Clustering transforms your dataset into different subspace and then tries to cluster points at this dataset. And i don't know how to invert this transformation mathematically.
A Tutorial on Spectral Clustering
Maybe you should ask your question as more theoretical on Math-related communities of SO.
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
spectral.fit(X)
y_pred = spectral.labels_.astype(np.int)
From here

Spectral clustering does not compute any centroids. In a more practical context, if you really need a kind of 'centroids' derived by the spectral clustering algorithm you can always compute the average (mean) of the points belonging at the same cluster, after the end of the clustering process. These would be an approximation of the centroids defined in the context of the typical k-means algorithm. The same principle applies also in other clustering algorithms that do not produce centroids (e.g. hierarchical).

While it's true that you can't get the cluster centers for spectral clustering, you can do something close that might be useful in some cases. To explain, I'll run through the spectral clustering algorithm quickly and explain the modification.
First, let's call our dataset X = {x_1, ..., x_N}, where each point is d-dimensional (d is the number of features you have in your dataset). We can think of X as an N by d matrix. Let's say we want to put this data into k clusters. Spectral clustering first transforms the data set into another representation and then uses K-means clustering on the new representation of the data to obtain clusters. First, the affinity matrix A is formed by using K-neighbors information. For this, we need to choose a positive integer n to construct A. The element A_{i, j} is equal to 1 if x_i and x_j are both in the list of the top n neighbors of each other, and A_{i, j} is equal to 0 otherwise. A is a symmetric N by N matrix. Next, we construct the normalized Laplacian matrix L of A, which is L = I - D^{-1/2}AD^{-1/2}, where D is the degree matrix. Then the eigenvalue decomposition is performed on L to get L = VEV^{-1}, where V is the matrix of eigenvectors of L, and E is the diagonal matrix with the eigenvalues of L in the diagonal. Since L is positive semi-definite, it's eigenvalues are all non-negative real numbers. For spectral clustering, we use this to order the columns of V so that the first column of V corresponds to the smallest eigenvalue of L, and the last column to the largest eigenvalue of L.
Next, we take the first k columns of V, and view them as N points in k-dimensional space. Let's write this truncated matrix as V', and write it's rows as {v'_1, ..., v'_N}, where each v'_i is k-dimensional. Then we use the K-means algorithm to cluster these points into k clusters; {C'_1,...,C'_k}. Then the clusters are assigned to the points in the dataset X by "pulling back" the clusters from V' to X: the point x_i is in cluster C_j if and only if v'_i is in cluster C'_j.
Now, one of the main points of transforming X into V' and clustering on that representation is that often X is not spherically distributed, and V' at least comes closer to being so. Since V' is closer to being spherically distributed, the centroid will be "inside" the cluster of points it defines. We can take the point in V' that is closest to the cluster centroid for each cluster. Let's call the cluster centroids {c_1,...,c_k}. These are points in the parameter space that V' is represented in. Then for each cluster, choose the point of V' that is closest to the cluster's centroid to get k points of V'. Let's say {v'_i_1,...,v'_i_k} are the representative points closest to the cluster centroids of V'. Then choose {x_i_1,...,x_i_k} as the cluster representatives for the clusters of X.
This method might not always work how you might want, but it's at least a way to get closer to what you're wanting, and maybe you can modify it to get closer to what you need. Here's some example code to show how to do this.
Let's use some fake data provided by scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
moons_data = make_moons(n_samples=1000, noise=0.07, random_state=0)
moons = pd.DataFrame(data=moons_data[0], columns=['x', 'y'])
moons['label_truth'] = moons_data[1]
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(8, 8),
s=10,
alpha=0.7
);
I'm going to kind of cheat and use the spectral clustering method provided by scikit-learn, and then extract the affinity matrix from there.
from sklearn.cluster import SpectralClustering
sclust = SpectralClustering(
n_clusters=2,
random_state=42,
affinity='nearest_neighbors',
n_neighbors=10,
assign_labels='kmeans'
)
sclust.fit(moons[['x', 'y']]);
moons['label_cluster'] = sclust.labels_
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(16, 14),
s=10,
alpha=0.7,
c='label_cluster',
cmap='Spectral'
);
Next, we'll compute the normalized Laplacian of the affinity matrix, and instead of computing the whole eigenvalue decomposition of the Laplacian, we use the scipy function eigsh to extract the two (since we are wanting two clusters) eigenvectors corresponding to the two smallest eigenvalues.
from scipy.sparse.csgraph import laplacian
from scipy.sparse.linalg import eigsh
affinity_matrix = sclust.affinity_matrix_
lpn = laplacian(affinity_matrix, normed=True)
w, v = eigsh(lpn, k=2, which='SM')
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
figsize=(16, 16),
s=10,
alpha=0.7
);
Then let's use K-means to cluster on this new representation of the data. Let's also find the two points in this new representation that are closest to the cluster centroids, and highlight them.
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt
kmeans = KMeans(
n_clusters=2,
random_state=42
)
kmeans.fit(v)
center_0, center_1 = kmeans.cluster_centers_
representative_index_0 = np.argmin(np.array([euclidean(a, center_0) for a in v]))
representative_index_1 = np.argmin(np.array([euclidean(a, center_1) for a in v]))
fig, ax = plt.subplots(figsize=(16, 16));
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=10,
alpha=0.7);
pd.DataFrame(v).iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=100,
alpha=0.9,
c='orange',
)
And finally, let's plot the original dataset with the corresponding points highlighted.
moons['labels_lpn_kmeans'] = kmeans.labels_
fig, ax = plt.subplots(figsize=(16, 14));
moons.plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=10,
alpha=0.7,
c='labels_lpn_kmeans',
cmap='Spectral'
);
moons.iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=100,
alpha=0.9,
c='orange',
);
As we can see, the highlighted points are maybe not where we might expect them to be, but this might be useful to give some way of algorithmically choosing points from each cluster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.