I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Update: (Eleven years after this original answer, it's probably time for an update.)
First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).
Here are some resources:
sklearn k-means and sklearn other clustering algorithms
scipy k-means and scipy k-means2
Old answer:
Scipy's clustering implementations work well, and they include a k-means implementation.
There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.
SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.
For now, I would recommend using PyCluster instead. Example usage:
>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,
0.03 * numpy.diag([1,1]),
20)
for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound # Number of times this solution was found
1
For continuous data, k-means is very easy.
You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.
I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is
k*(1-(1/n)) + n*(1/n)
Here is the full code in Python
from __future__ import division
from random import random
# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]
param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1
for x in data:
closest_k = 0;
smallest_error = 9999; # this should really be positive infinity
for k in enumerate(means):
error = abs(x-k[1])
if error < smallest_error:
smallest_error = error
closest_k = k[0]
means[closest_k] = means[closest_k]*(1-param) + x*(param)
you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!
(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.
From wikipedia, you could use scipy, K-means clustering an vector quantization
Or, you could use a Python wrapper for OpenCV, ctypes-opencv.
Or you could OpenCV's new Python interface, and their kmeans implementation.
SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).
This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
(courtesy of SciKit Learn's documentation, linked above)
You can also use GDAL, which has many many functions to work with spatial data.
Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.
Related
I need to make adjacency matrix for each community(sub-graph) detected by leidenalg; but the problem is the output of find_partition() is just showing the nodes in each sub-graph. Is there any way to convert the output to something like np.array with edge information of each sub-graph??
import leidenalg
import igraph as ig
G = ig.Graph.Erdos_Renyi(10, 0.1);
partitions = leidenalg.find_partition(G, leidenalg.ModularityVertexPartition)
print(partitions)
output:
Clustering with 10 elements and 3 clusters
[0] 2, 5, 8, 9
[1] 3, 4, 6
[2] 0, 1, 7
You can do this by simply constructing the subgraph and the computing the adjacency matrix. Your example is not quite reproducible because ig.Graph.Erdos_Renyi uses the random number generator. Therefore, I added a little code to set the random seed and generate a graph like yours, except reproducible. I simply get the adjacency matrix for the first partition, but of course, you can just loop through the partitions and get all of the matrices.
import igraph as ig
import leidenalg
import random
random.seed(a=321)
G = ig.Graph.Erdos_Renyi(10, 0.28);
partitions = leidenalg.find_partition(G, leidenalg.ModularityVertexPartition)
print(partitions)
P0 = G.subgraph(partitions[0])
P0.get_adjacency()
Out[15]: Matrix([[0, 1, 1, 0], [1, 0, 1, 1], [1, 1, 0, 1], [0, 1, 1, 0]])
I am using Scikit-learn to train a classification model. I have both discrete and continuous features in my training data.
I want to do feature selection using mutual information.
The features 1,2 and 3 are discrete. to this end, I try the code below :
mutual_info_classif(x, y, discrete_features=[1, 2, 3])
but it did not work, it gives me the error:
ValueError: could not convert string to float: 'INT'
A simple example with mutual information classifier:
import numpy as np
from sklearn.feature_selection import mutual_info_classif
X = np.array([[0, 0, 0],
[1, 1, 0],
[2, 0, 1],
[2, 0, 1],
[2, 0, 1]])
y = np.array([0, 1, 2, 2, 1])
mutual_info_classif(X, y, discrete_features=True)
# result: array([ 0.67301167, 0.22314355, 0.39575279]
mutual_info_classif can only take numeric data. You need to do label encoding of the categorical features and then run the same code.
x1=x.apply(LabelEncoder().fit_transform)
Then run the exact same code you were running.
mutual_info_classif(x1, y, discrete_features=[1, 2, 3])
.There is a difference between 'discrete' and 'categorical'
In this case, function demands the data to be numerical. May be you can use label encoder if you have ordinal features. Else you would have to use one hot encoding for nominal features. You can use pd.get_dummies for this purpose.
Mutual infomation calculates the shared information, where ordering does not matter. With that being said, it should not matter if categorical data is ordered or not in order to label-encode it.
So to answer the question:
Categorical values (like "udp","-","INT" which you mentioned in your comment) can be label-encoded in order to calculate the mutual information, even though sklearn recommends not to use LabelEncoder on features. Of course, you can dummy-code or one-hot-code the categorical features, but you lose the ability to look at the mutual information of the variable as a whole.
Being new to unsupervised methods I'm in need of a push in the right direction with some semi-simple code to run through some data as a case study. The data I'm working on only has 300 or so observations but I'm wanting to learn how I can apply clustering to very large sets as well that behave similarly.
I have a 2 feature set of data and I'd like to run an DBSCAN or something similar using Euclidean distances (if this is the correct clustering approach).
As an example the data looks like this:
I can just tell by eye that clustering this way might not be the best method as the distribution looks irregular.
What method should I use to begin understanding similar distributions like these - especially when the set is very large (100s of thousands of observations).
For most machine learning tasks, scikit-learn is your friend here. For DBSCAN, scikit-learn has sklearn.cluster.DBSCAN. From the scikit-learn docs:
>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
... [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0, 0, 0, 1, 1, -1])
>>> clustering
DBSCAN(algorithm='auto', eps=3, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=2, n_jobs=None, p=None)
You also have other clustering algorithms available to you through scikit-learn. You can see all of them here.
I'm trying to build a toy recommendation engine to wrap my mind around Singular Value Decomposition (SVD). I've read enough content to understand the motivations and intuition behind the actual decomposition of the matrix A (a user x movie matrix).
I need to know more about what goes on after that.
from numpy.linalg import svd
import numpy as np
A = np.matrix([
[0, 0, 0, 4, 5],
[0, 4, 3, 0, 0],
...
])
U, S, V = svd(A)
k = 5 #dimension reduction
A_k = U[:, :k] * np.diag(S[:k]) * V[:k, :]
Three Questions:
Do the values of matrix A_k represent the the predicted/approximate ratings?
What role/ what steps does cosine similarity play in the recommendation?
And finally I'm using Mean Absolute Error (MAE) to calculate my error. But what I'm values am I comparing? Something like MAE(A, A_k) or something else?
I am trying to run a Fisher's LDA (1, 2) to reduce the number of features of matrix.
Basically, correct if I am wrong, given n samples classified in several classes, Fisher's LDA tries to find an axis that projecting thereon should maximize the value J(w), which is the ratio of total sample variance to the sum of variances within separate classes.
I think this can be used to find the most useful features for each class.
I have a matrix X of m features and n samples (m rows, n columns).
I have a sample classification y, i.e. an array of n labels, each one for each sample.
Basing on y I want to reduce the number of features to, for example, 3 most representative features.
Using scikit-learn I tried in this way (following this documentation):
>>> import numpy as np
>>> from sklearn.lda import LDA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = LDA(n_components=3)
>>> clf.fit_transform(X, y)
array([[ 4.],
[ 4.],
[ 8.],
[-4.],
[-4.],
[-8.]])
At this point I am a bit confused, how to obtain the most representative features?
The features you are looking for are in clf.coef_ after you have fitted the classifier.
Note that n_components=3 doesn't make sense here, since X.shape[1] == 2, i.e. your feature space only has two dimensions.
You do not need to invoke fit_transform in order to obtain coef_, calling clf.fit(X, y) will suffice.