Hierarchical clustering of 1 million objects

Hierarchical clustering of 1 million objects - python

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.
hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).
I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.

The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.
You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.
If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.
Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.

To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:
build a hierarchical tree from say 15k points, then add the rest one by one:
time ~ 1M * treedepth
first build 100 or 1000 flat clusters,
then build your hierarchical tree of the 100 or 1000 cluster centres.
How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?
What software are you using,
and how many hours / days do you have to do the clustering ?
For the flat-cluster approach,
K-d_tree s
work fine for points in 2d, 3d, 20d, even 128d -- not your case.
I know hardly anything about clustering text;
Locality-sensitive_hashing ?
Take a look at scikit-learn clustering --
it has several methods, including DBSCAN.
Added: see also
google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics

Related

Clustering on large, mixed type data

I'm dealing with a dataframe of dimension 4 million x 70. Most columns are numeric, and some are categorical, in addition to the occasional missing values. It is essential that the clustering is ran on all data points, and we look to produce around 400,000 clusters (so subsampling the dataset is not an option).
I have looked at using Gower's distance metric for mixed type data, but this produces a dissimilarity matrix of dimension 4 million x 4 million, which is just not feasible to work with since it has 10^13 elements. So, the method needs to avoid dissimilarity matrices entirely.
Ideally, we would use an agglomerative clustering method, since we want a large amount of clusters.
What would be a suitable method for this problem? I am struggling to find a method which meets all of these requirements, and I realise it's a big ask.
Plan B is to use a simple rules-based grouping method based on categorical variables alone, handpicking only a few variables to cluster on since we will suffer from the curse of dimensionality otherwise.

The first step is going to be turning those categorical values into numbers somehow, and the second step is going to be putting the now all numeric attributes into the same scale.
Clustering is computationally expensive, so you might try a third step of representing this data by the top 10 components of a PCA (or however many components have an eigenvalue > 1) to reduce the columns.
For the clustering step, you'll have your choice of algorithms. I would think something hierarchical would be helpful for you, since even though you expect a high number of clusters, it makes intuitive sense that those clusters would fall under larger clusters that continue to make sense all the way down to a small number of "parent" clusters. A popular choice might be HDBSCAN, but I tend to prefer trying OPTICS. The implementation in free ELKI seems to be the fastest (it takes some messing around with to figure it out) because it runs in java. The output of ELKI is a little strange, it outputs a file for every cluster so you have to then use python to loop through the files and create your final mapping, unfortunately. But it's all doable (including executing the ELKI command) from python if you're building an automated pipeline.

"fast ward" clustering in Python

In JMP software there is an option to use the "fast Ward" method when the number of rows is greater than 2000. From the documentation [fast ward]:
"Applies an algorithm that computes Ward's method more quickly for large numbers of rows. The computation time is shorter because this algorithm does not require the calculation of a distance matrix. It is used automatically whenever there are more than 2,000 rows."
Matlab does the same thing....
"Find a maximum of four clusters in a hierarchical cluster tree created using the ward linkage method. Specify 'SaveMemory' as 'on' to construct clusters without computing the distance matrix. Otherwise, you can receive an out-of-memory error if your machine does not have enough memory to hold the distance matrix."
I'm looking for something similar in Python but they all seem to require the distance matrix calculated ahead of time (which requires absurd amounts of memory for my problem of 275k rows and 10 columns). In JMP/Matlab though it works just fine on a machine with half the memory of the machine I want to run the python script on. Anybody know of something?

From a now-rolled-back edit to the question by the OP:
I found that using the "linkage_vector" option seems to be what i was looking for. I was thrown off because "vector" to me meant 1D, but I guess it can be N-D.

Have you worked with fastcluster? It has the option for "hierarchical clusters from distance matrices or from vector data"

Python kmeans clustering for large datasets

I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).
Is there any other Python kmeans implementation more memory effcient?
I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?

When having many samples, consider using sklearn's MiniBatchKMeans, which is a SGD-like method build for this case! (A more tutorial-like intro which does not address memory-usage, but i expect it to be better there for large n_samples. Of course memory also depends on many other parameters like k ... In the case of huge n_features it won't help in regards to memory; but that's not your problem here)
In this case you should carefully tune your mini-batch sizes then.
You can try the classic kmeans implementation there too as you seem to be just quite off the memory-requirements and maybe this implementation is more efficient (more tunable for sure).
In the latter case, init, n_init, precompute_distances, algorithm and maybe copy_x are all parameters having effect on memory-consumption.
And furthermore: if(!) your data is sparse; try calling it with sparse-matrices. (from reading kmeans2-docs it seems it's not supported, but sklearn's kmeans does!)

large scale clustering library possibly with python bindings

I've been trying to cluster some larger dataset. consisting of 50000 measurement vectors with dimension 7. I'm trying to generate about 30 to 300 clusters for further processing.
I've been trying the following clustering implementations with no luck:
Pycluster.kcluster (gives only 1-2 non-empty clusters on my dataset)
scipy.cluster.hierarchy.fclusterdata (runs too long)
scipy.cluster.vq.kmeans (runs out of memory)
sklearn.cluster.hierarchical.Ward (runs too long)
Are there any other implementations which I might miss?

50000 instances and 7 dimensions isn't really big, and should not kill an implementation.
Although it doesn't have python binding, give ELKI a try. The benchmark set they use on their homepage is 110250 instances in 8 dimensions, and they run k-means on it in 60 seconds apparently, and the much more advanced OPTICS in 350 seconds.
Avoid hierarchical clustering. It's really only for small data sets. The way it is commonly implemented on matrix operations is O(n^3), which is really bad for large data sets. So I'm not surprised these two timed out for you.
DBSCAN and OPTICS when implemented with index support are O(n log n). When implemented naively, they are in O(n^2). K-means is really fast, but often the results are not satisfactory (because it always splits in the middle). It should run in O(n * k * iter) which usually converges in not too many iterations (iter<<100). But it will only work with Euclidean distance, and just doesn't work well with some data (high-dimensional, discrete, binary, clusters with different sizes, ...)

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward and supports parallel fitting on multicore machines. MiniBatchKMeans is better still, but won't do random restarts for you.
>>> from sklearn.cluster import MiniBatchKMeans
>>> X = np.random.randn(50000, 7)
>>> %timeit MiniBatchKMeans(30).fit(X)
1 loops, best of 3: 114 ms per loop

My package milk handles this problem easily:
import milk
import numpy as np
data = np.random.rand(50000,7)
%timeit milk.kmeans(data, 300)
1 loops, best of 3: 14.3 s per loop
I wonder whether you meant to write 500,000 data points, because 50k points is not that much. If so, milk takes a while longer (~700 sec), but still handles it well as it does not allocate any memory other than your data and the centroids.

The real answer for actually large scale situations is to use something like FAISS, Facebook Research's library for efficient similarity search and clustering of dense vectors.
See
https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

OpenCV has a k-means implementation, Kmeans2
Expected running time is on the order of O(n**4) - for an order-of-magnitude approximation, see how long it takes to cluster 1000 points, then multiply that by seven million (50**4 rounded up).

Memory Error when calculating pairwise distances in scipy

I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user.
I am using Scipy api for clustering.
Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.
import scipy.cluster.hierarchy as sch
Y = sch.distance.pdist( allUserVector,'cosine')
set_printoptions(threshold='nan')
print Y
But my program gives me MemoryError while calculating the distance matrix itself
File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
separator, prefix)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
format_function = FloatFormat(data, precision, suppress_small)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
self.fillFormat(data)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
MemoryError
Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error.
I am running it on i3 and 4 Gb Ram.
I need to apply DBScan clustering too, but that too needs distance matrix as input.
Any suggestions appreciated.
Edit: I get the error only when I print Y. Any ideas why?

Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3) which makes it scale really bad to large datasets.
DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2) memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2) at the same CPU cost of computing the distance matrix.
Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n) runtime.
On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2) implementation. Actually the only one I've seen so far that is not the naive O(n^3) matrix-editing approach.
If you are willing to go beyond python, that might be a good choice.

It's possible that you really are running out of RAM. Finding pairwise distances between N objects means storing N^2 distances. In your case, N^2 is going to be 14039 ^ 2 = 1.97 * 10^8. If we assume that each distance takes only four bytes (which is almost certainly not the case, as they have to be held in some sort of data structure which may have non-constant overhead) that works out to 800 megabytes. That's a lot of memory for the interpreter to be working with. 32-bit architectures only allow up to 2 GB of process memory, and just your raw data is taking up around 50% of that. With the overhead of the data structure you could be looking at usage much higher than that -- I can't say how much because I don't know the memory model behind SciPy/numpy.
I would try breaking your data sets up into smaller sets, or not constructing the full distance matrix. You can break it down into more manageable chunks (say, 14 subsets of around 1000 elements) and do nearest-neighbor between each chunk and all of the vectors -- then you're looking at loading an order of magnitude less into memory at any one time (14000 * 1000, 14 times instead of 14000 * 14000 once).
Edit: agf is completely right on both counts: I missed your edit, and the problem probably comes about when it tries to construct the giant string that represents your matrix. If it's printing floating point values, and we assume 10 characters are printed per element and the string is stored with one byte per character, then you're looking at exactly 2 GB of memory usage just for the string.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.