Python KMeans clustering words

Python KMeans clustering words - python

I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.
1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.
2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.
Can someone find me a good lib? Google hasn't been my friend

Yeah I think there isn't a good implementation to what I need.
I have some crazy requirements, like distance caching etc.
So i think i will just write my own lib and release it as GPLv3 soon.

Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.

Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.

What about this very nice answer on CrossValidated?
It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.
Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).
If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).

Related

Algorithms to model non-linear relationship between two vectors

I want to build a model that describes a curve that fits the data shown in the scatterplot. I thought it would be straight forward using sklearn. But the choice and application of the different methods gets rather confusing.
Which algorithms would you use to tackle this problem?

This is really a question for CrossValidated rather than a Python question.
Your data seems to strongly indicate a simple underlying model which is linear until the very end, when it perhaps becomes polynomial.
As a first step, if possible, I would investigate this phenomenon. It's unusual. Perhaps there's something wrong with the data source. But maybe not. For example, a physical phenomenon with two distinct phases might produce data like these.
As to models, I would suggest natural cubic splines for this data. They are simple and involve cutting the data up into windows which you fit with cubic polynomials (a special case of which is a line).
You might also consider smoothing splines, and local regression.
For information on these, see the free online textbook, An Introduction to Statistical Learning.

Efficient KNN implementation which allows inserts

Suppose I have multi-dimensional datasets, which have many vectors as data. I am writing an algorithm which needs to do k nearest neighbour searches for all those vectors - classical KNN. However, during my algorithm I add new vectors to the overall dataset and need to include those new vectors into my KNN search. I want to do that efficiently. I looked into KD tree and ball tree of scikit-learn, but they don't allow inserts (by the nature of the concepts). I am not sure whether SR tree or R tree would provide inserts, but in any case, I was not able to find a python implementation for data beyond 3D.
Regarding the search I am fine with either the query "give me the closest vector" (so 1-NN) or "give me all vectors that are closer then radius".

General comment: I don't quite understand why KD-Trees are so popular for high-dimensional kNN queries. In my experience, other trees scale much better with high dimensionality or large datasets (I tested up to 25Million points and (only) up to 40 dimensions). Some more details:
KD-Trees: As far as I know, KD-Trees should support insertion at any time, but there is a chance that they get imbalanced. I don't use python, so I don't know why your KD-tree does not support insertion/deletion on the fly.
Quadtree: Depending on the dimensionality, you could also use quadtree/octrees, but standard implementations are not good for more than 10 dimensions or so. In the reference above I tested a quadtree with a special 'hypecube' navigation approach. That requires a lot of memory but scales much better with dimensionality in terms of performance.
R-Tree/R*Tree: The original R-Trees are not very good with insertion on the fly. However, if you look at R+Trees, (R-Plus-Tree), they are quite fast with reinsertion and kNN queries.
PH-Trees have basically the same kNN performance as R+Trees, but much better insertion time, because PH-Trees do not need rebalancing, while having inherently limited depth and nodesize. Unfortunately, implementations gets a lot more complicated for >=64 dimensions (the tree uses one bit of a long integer for each dimensions). I'm not aware of an implementation that supports more than 63 dimensions.
Python:
R+Plus trees should be available for Python. If not, you could adapt a normal R-Tree (only the insertion algorithm is different)
I heard once of someone starting to implement a PH-Tree in Python, but I haven't seen any open-source variant yet.
If you have some time/interest to do your own implementation, you could look at the Java implementations here and translate them to Python. The library contains various multidimensional indexes, except KD-Trees. KD-Tree implementations that allow on-the-fly insertion can be found here and here.

Reverse-engineering a clustering algorithm from the clusters

I have a clustering of data performed by a human based solely on their knowledge of the system. I also have a feature vector for each element. I have no knowledge about the meaning of the features, nor do I know what the reasoning behind the human clustering was.
I have complete information about which elements belong to which cluster. I can assume that the human was not stupid and there is a way to derive the clustering from the features.
Is there an intelligent way to reverse-engineer the clustering? That is, how can I select the features and the clustering algorithm that will yield the same clustering most of the time (on this data set)?
So far I have tried the naive approach - going through the clustering algorithms provided by the sklearn library in python and comparing the obtained clusters to the source one. This approach does not yield good results.
My next approach would be to use some linear combinations of the features, or subsets of features. Here, again, my question is if there is a more intelligent way to do this than to go through as many combinations as possible.
I can't shake the feeling that this is a standard problem and I'm just missing the right term to find the solution on Google.

Are you sure it was done automatically?
It sounds to me as if you should be treating this as a classification problem: construct a classifier that does the same as the human did.

Document Clustering in python using SciKit

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.

My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.

For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.

This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.

I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).
OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).
The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:
http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf
The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.
There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.

EDIT: the following is known to not be a complete implementation of OPTICS.
I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.
Here is a quick example of how to build clusters on the output of the optics algorithm:
def cluster(order, distance, points, threshold):
''' Given the output of the options algorithm,
compute the clusters:
#param order The order of the points
#param distance The relative distances of the points
#param points The actual points
#param threshold The threshold value to cluster on
#returns A list of cluster groups
'''
clusters = [[]]
points = sorted(zip(order, distance, points))
splits = ((v > threshold, p) for i,v,p in points)
for iscluster, point in splits:
if iscluster: clusters[-1].append(point)
elif len(clusters[-1]) > 0: clusters.append([])
return clusters
rd, cd, order = optics(points, 4)
print cluster(order, rd, points, 38.0)

While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.
Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.

There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.

It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)
here is the link:
https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html

See "Density-based clustering approaches" on
http://www.chemometria.us.edu.pl/index.php?goto=downloads

You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.