On doing K means fit on some vectors with 3 clusters, I was able to get the labels for the input data.
KMeans.cluster_centers_ returns the coordinates of the centers and so shouldn't there be some vector corresponding to that? How can I find the value at the centroid of these clusters?
closest, _ = pairwise_distances_argmin_min(KMeans.cluster_centers_, X)
The array closest will contain the index of the point in X that is closest to each centroid.
Let's say the closest gave output as array([0,8,5]) for the three clusters. So X[0] is the closest point in X to centroid 0, and X[8] is the closest to centroid 1 and so on.
Source: https://codedump.io/share/XiME3OAGY5Tm/1/get-nearest-point-to-centroid-scikit-learn
The cluster centre value is the value of the centroid. At the end of k-means clustering, you'll have three individual clusters and three centroids, with each centroid being located at the centre of each cluster. The centroid doesn't necessarily have to coincide with an existing data point.
Sharda neglected to import the metrics module from scikit-learn, see below.
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
or
closest, _ = sklearn.metrics.pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
Assuming X is the input data and kmeans has been fit to that data, both options give you an array, closest, for which each element is the index of the closest element in X to that centroid. Thus, closest[0] is the index of the data closest to the first centroid and X[closest[0]] is that data.
To answer your first question, k-means clustering randomly selects a point in the plane for each centroid and then adjusts them all to be the best representatives of the data. The centroids will not necessarily end up coinciding with any of the original data. This contrasts with the Affinity Propagation Clustering algorithm which picks an exemplar data point as the representative for each cluster, not just a point in the same plane.
Related
I have a 50 by 50 grid of evenly spaced (x,y) points. Each of these points has a third scalar value. This can be visualized using a contourplot which I have added. I am interested in the regions indicated in by the red circles. These regions of low "Z-values" are what I want to extract from this data.
2D contour plot of 50 x 50 evenly spaced grid points:
I want to do this by using clustering (machine learning), which can be lightning quick when applied correctly. The problem is, however, that the points are evenly spaced together and therefore the density of the entire dataset is equal everywhere.
I have tried using a DBSCAN algorithm with a custom distance metric which takes into account the Z values of each point. I have defined the distance between two points as follows:\
def custom_distance(point1,point2):
average_Z = (point1[2]+point2[2])/2
distance = np.sqrt(np.square((point1[0]-point2[0])) + np.square((point1[1]-point2[1])))
distance = distance * average_Z
return distance
This essentially determines the Euclidean distance between two points and adds to it the average of the two Z values of both points. In the picture below I have tested this distance determination function applied in a DBSCAN algorithm. Each point in this 50 by 50 grid each has a Z value of 1, except for four clusters that I have randomly placed. These points each have a z value of 10. The algorithm is able to find the clusters in the data based on their z value as can be seen below.
DBSCAN clustering result using scalar value distance determination:
Positive about the results I tried to apply it to my actual data, only to be disappointed by the results. Since the x and y values of my data are very large, I have simply scaled them to be 0 to 49. The z values I have left untouched. The results of the clustering can be seen in the image below:
Clustering result on original data:
This does not come close to what I want and what I was expecting. For some reason the clusters that are found are of rectangular shape and the light regions of low Z values that I am interested in are not extracted with this approach.
Is there any way I can make the DBSCAN algorithm work in this way? I suspect the reason that it is currently not working has something to do with the differences in scale of the x,y and z values. I am also open for tips or recommendations on other approaches on how to define and find the lighter regions in the data.
I use the intended code to output the results of the clustering. What does this value mean in "Cluster Centers", and how should I interpret this data?
kmeans = KMeans(n_clusters = 4).fit(df)
print("Number of clusters: ", kmeans.n_clusters)
print("-"*70)
print("Cluster Centers: ", '\n', kmeans.cluster_centers_)
Number of clusters: 4
----------------------------------------------------------------------
Cluster Centers:
[[4.10000000e+02 9.92833333e+03 3.42200000e+03 3.73333333e+00
2.32433333e+03 1.36733333e+03 1.31600000e+03 5.16666667e+01
9.57000000e+02]
[4.55000000e+01 3.41650000e+03 1.42100000e+04 3.70000000e+00
5.95000000e+02 3.60000000e+02 3.46500000e+02 1.35000000e+01
2.34500000e+02]
[3.41666667e+01 1.14600000e+03 3.33358333e+03 3.69166667e+00
7.02500000e+02 4.14583333e+02 3.99166667e+02 1.53333333e+01
2.87916667e+02]
[5.14000000e+02 2.48310000e+04 5.78750000e+03 3.75000000e+00
1.75350000e+03 1.05200000e+03 1.01200000e+03 3.95000000e+01
7.02000000e+02]]
It means that you have four clusters, and the given vectors ar the centers of those clusters.
So, for a new point, you can check which centroid is the closest and you can determine the new point cluster accordingly.
For example, for the following four clusters above the X represent its centroids for the clusters and a new point can be classified accordingly.
Also, you can check for yourself measurement on the clusters. you can check here: Silhouette - Wikipedia
Your code asked to find four clusters using the KMeans algorithm. See the docs. As expected, you obtain 4 clusters. Based on the kmeans.cluster_centers_, we can tell that your space is 9-dimensional (9 coordinates for each point), because the cluster centroids are 9-dimensional.
The centroids are the means of all points within a cluster. This doc is a good introduction for getting an intuitive understanding of the k-means algorithm.
I have a dataframe with almost 4000000 entries. Based on 3 features I want to find the distance between each point and its 1000th nearest neighbor. So far I've tried it like this:
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1000)
nbrs = neigh.fit(df[features])
distances, indices = nbrs.kneighbors(df[features])
Afterwars i would slice the distances array to get an array with just the distance to the 1000th nearest neighbor for each entry, because that's the only one I care about. However I don't get that far, because I don't have enough memory for an array with shape (4000000, 1000).
Is there a way where I can save just the distance to the 1000th neighbor and discard all other 999?
Background is that I'm trying to find a good fit for epsilon to run an DBSCAN algorith, but apparently my datapoints are too close to each other. I've already tried the code above for 5 and 100 neighbors. However besides from some outliers the distance is pretty much 0.
Quantiles for distances to the 100th neighbor
You may wish to try:
from sklearn.neighbors import KDTree
x = np.random.randn(4000000,3)
kdt = KDTree(x)
closest_1000th =[]
for i in range(x.shape[0]):
dist, _ = kdt.query(x[i,:].reshape(1,-1),1000)
closest_1000th.append(dist[0, -1])
On my 4Gb RAM laptop it took about 1hr to complete this task.
Hat tip #bogovicj.
I am using scikit-learn to run k-means. I looked the scikit-learn k-means code but I don't understand how k-means pre-computes distances in advance. Which distances k-means pre-computes in advance while it doesn't know in advance the values of the centers?
It does not pre-compute the distance between the centers, it precomputes the distance between a point, say X and all other points in the system and stores them for later use.
Check this line 619 in kmeans which calls _labels_inertia_precompute_dense, which in turn calls pairwise_distances_argmin_min at line 562.
The documentation of pairwise_distances_argmin_min states that
Compute minimum distances between one point and a set of points.
This function computes for each row in X, the index of the row of Y
which is closest (according to the specified distance). The minimal
distances are also returned.
So it does not need to know the centres, this is just used to precompute distances between all possible pair of points with each other.
I'm trying to use KMeans centroids to label/clump pixels for a land cover analysis. I'm hoping to do this only using sklearn and matplotlib. At the moment my code looks like this:
kmeans.fit(band_5)
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1])
The shape of band_5 is (713, 1163), yet from the scatter plot I can tell that the centroid coordinates have values well in excess of that shape.
From my understanding, the centroids that KMeans provides need to be converted into the correct coordinates and then a shapefile, which would then be used in a supervised process to label/clump pixels.
How do I convert those centroids to the correct coordinates and then export to a shapefile? Also, do I need to create a shapefile?
I tried to adopt some of the code from this post, but I could not get that to work. http://scikit-learn.org/stable/auto_examples/cluster/plot_color_quantization.html#sphx-glr-auto-examples-cluster-plot-color-quantization-py
A couple of points:
scikit-learn expects data in columns (think a table in a spreadsheet), so simply passing in an array representing a raster band will actually try and classify the data as if you had 1163 sample points and 713 values (bands) for each sample. Instead you'll need to flatten the array, and what kmeans will return will be equivalent to quantile classification of your raster if you're looking at it in something like ArcGIS, with centroids in the range of band minimum value to band maximum value (not in cell coordinates).
Looking at the example you provide, they have a three band jpeg, which the reshape into three long columns:
image_array = np.reshape(china, (w * h, d))
If you need to have spatially constrained pixels then you have two choices: choose a connectivity constrained cluster method such as Agglomerative Clustering or Affinity Propagation, and look at adding the normalised cell coordinates to your sample-set, e.g.:
xs, ys = np.meshgrid(
np.linspace(0, 1, 1163), # x
np.linspace(0, 1, 713), # y
)
data_with_coordinates = np.column_stack([
band_5.flatten(),
xs.flatten(),
ys.flatten()
])
# And on with the clustering
Once you've done the clustering with scikit-learn, assuming you use fit_predict you'll get a label back for each value by cluster, and you can reshape back to the original shape of the band to plot the clustered results.
labels = classifier.fit_predict(data_with_coordinates)
plt.imshow(labels.reshape(band_5.shape)
Do you actually need the cluster centroids given you have labelled points? And do you need them in real world spatial coordinates? If yes, then you need to be looking at the rasterio and the affine methods to transform from map coordinates to array coordinates and vice versa. And then look into fiona to write the points to a shapefile.