Simple 2-D Clustering Algorithm in Python - python

Being new to unsupervised methods I'm in need of a push in the right direction with some semi-simple code to run through some data as a case study. The data I'm working on only has 300 or so observations but I'm wanting to learn how I can apply clustering to very large sets as well that behave similarly.
I have a 2 feature set of data and I'd like to run an DBSCAN or something similar using Euclidean distances (if this is the correct clustering approach).
As an example the data looks like this:
I can just tell by eye that clustering this way might not be the best method as the distribution looks irregular.
What method should I use to begin understanding similar distributions like these - especially when the set is very large (100s of thousands of observations).

For most machine learning tasks, scikit-learn is your friend here. For DBSCAN, scikit-learn has sklearn.cluster.DBSCAN. From the scikit-learn docs:
>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
... [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0, 0, 0, 1, 1, -1])
>>> clustering
DBSCAN(algorithm='auto', eps=3, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=2, n_jobs=None, p=None)
You also have other clustering algorithms available to you through scikit-learn. You can see all of them here.

Related

Clustering algorithms with distance matrix as input

I need to cluster the graphs of countries around the world to find similarity.
The graphs are about covid-19 cases during the pandemic.
I need a clustering method that take distance matrix as input.
You should make your question more precise. If possible, try to include a reproducible example, with a small distance matrix to test.
Anyway, You can use : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html with affinity="precomputed"
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[0, 2, 3], [2, 0, 3], [3, 3, 0]])
clustering = AgglomerativeClustering(affinity="precomputed").fit(X)
clustering.labels_

Can somebody tell me the name of the algorithm if it exists or tell me how to find it

Here is the idea:
There is a huge 2D dataset (250,000 datapoints).
I need to get rid of 90% of the datapoint without hurting the data structure. Which means (i believe) to get rid of the closest ones. Density must decrease...
Considering we need to keep the structure - we can't just randomly delete 90% as this might cause bias. There may be a little element of random in this but no too much.
I can put the data in 2D matrix and divide into cells. Some cells then will have more datapoints and some will have less and some will have none.
I need the algorithm that will group those datapoints or the cells in my matrix into segments which will all have relatively close nummer of datapoints in it. Those segments or cells in "new" matrix can be different size(which i believe is the point in this algorithm).
I've drawn a picture. It is not accurate but I hope it will make idea a bit clearer.
Also I code in python :^)
Thank you!!
the algorithm you are searching is a unsupervised learning method, the most famous one is kmeans on python.
You can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Here is a code example for an array:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
If you have to adjust it for a dataframe (df), it looks like this:
from sklearn.cluster import KMeans
X = df[['column A',..., 'column D']]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
the output labels are your clusters.

how to apply mutual information on categorical features

I am using Scikit-learn to train a classification model. I have both discrete and continuous features in my training data.
I want to do feature selection using mutual information.
The features 1,2 and 3 are discrete. to this end, I try the code below :
mutual_info_classif(x, y, discrete_features=[1, 2, 3])
but it did not work, it gives me the error:
ValueError: could not convert string to float: 'INT'
A simple example with mutual information classifier:
import numpy as np
from sklearn.feature_selection import mutual_info_classif
X = np.array([[0, 0, 0],
[1, 1, 0],
[2, 0, 1],
[2, 0, 1],
[2, 0, 1]])
y = np.array([0, 1, 2, 2, 1])
mutual_info_classif(X, y, discrete_features=True)
# result: array([ 0.67301167, 0.22314355, 0.39575279]
mutual_info_classif can only take numeric data. You need to do label encoding of the categorical features and then run the same code.
x1=x.apply(LabelEncoder().fit_transform)
Then run the exact same code you were running.
mutual_info_classif(x1, y, discrete_features=[1, 2, 3])
.There is a difference between 'discrete' and 'categorical'
In this case, function demands the data to be numerical. May be you can use label encoder if you have ordinal features. Else you would have to use one hot encoding for nominal features. You can use pd.get_dummies for this purpose.
Mutual infomation calculates the shared information, where ordering does not matter. With that being said, it should not matter if categorical data is ordered or not in order to label-encode it.
So to answer the question:
Categorical values (like "udp","-","INT" which you mentioned in your comment) can be label-encoded in order to calculate the mutual information, even though sklearn recommends not to use LabelEncoder on features. Of course, you can dummy-code or one-hot-code the categorical features, but you lose the ability to look at the mutual information of the variable as a whole.

PCA: Get Top 20 Most Important Dimensions

I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Update: (Eleven years after this original answer, it's probably time for an update.)
First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).
Here are some resources:
sklearn k-means and sklearn other clustering algorithms
scipy k-means and scipy k-means2
Old answer:
Scipy's clustering implementations work well, and they include a k-means implementation.
There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.
SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.
For now, I would recommend using PyCluster instead. Example usage:
>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,
0.03 * numpy.diag([1,1]),
20)
for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound # Number of times this solution was found
1
For continuous data, k-means is very easy.
You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.
I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is
k*(1-(1/n)) + n*(1/n)
Here is the full code in Python
from __future__ import division
from random import random
# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]
param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1
for x in data:
closest_k = 0;
smallest_error = 9999; # this should really be positive infinity
for k in enumerate(means):
error = abs(x-k[1])
if error < smallest_error:
smallest_error = error
closest_k = k[0]
means[closest_k] = means[closest_k]*(1-param) + x*(param)
you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!
(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.
From wikipedia, you could use scipy, K-means clustering an vector quantization
Or, you could use a Python wrapper for OpenCV, ctypes-opencv.
Or you could OpenCV's new Python interface, and their kmeans implementation.
SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).
This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
(courtesy of SciKit Learn's documentation, linked above)
You can also use GDAL, which has many many functions to work with spatial data.
Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.

Categories