Memory Efficient Agglomerative Clustering with Linkage in Python - python

I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2.
When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB.
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward', pooling_func=<function mean at 0x2b8085912398>)[source]
I also tried the memory=Memory(cachedir) option with no success. Does anybody have a suggestion (another library or change in the scikit code) so that I can run the clustering algorithm on the data?
I have run the algorithm successfully on small datasets.

Related

is there a way to cluster tweets after vectorizing them?

I need to cluster tweets based on similarity between them, I am using dec2vec to vectorize them and now I need a way to cluster this vectors, also I tried kmeans and it wasn't a good model for me as I don't know the number of clusters. I tried to use function similarity in gensim library but the result is different each time and wasn't correct! So is there a way to cluster this?
You need to know how many clusters you want for your particular task, before applying K-means or any other clustering algorithm. And if the number of clusters is very large, then some clustering algorithms like K-means will not be able to scale well. For large number of clusters, you could try some other clustering algorithms like agglomerative clustering or DBSCAN.
If you only need a small number of clusters but don't know the exact number of clusters, you could use T-SNE (T-distributed Stochastic Neighbourhood Embedding) to get an approximate 2-D visualisation of your vectorized tweets, to get an idea of how many clusters you would need.

Time-series clustering in python: DBSCAN and OPTICS giving me strange results

I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?

Is k nearest neighbours regression inherently slow?

I am trying to use k nearest neighbours implementation from scikit learn on a fairly large dataset. The problem is that predictions take a very long time, almost as long as training which doesn't make sense. Is it an issue with the algorithm, or the fact that scikit learn isn't made for large datasets (no GPU support).
For further information, I am trying to predict lidar intensity based on x, y, z and object label. Each lidar scan has ~100,000 points, so I'm trying to predict the intensity for each point.
Things to try to make scikit-learn's KNeighborsClassifier run faster:
different algorithm parameter: kd_tree, ball_tree for low dimensional data, brute for high dimensional data
n_jobs parameter. Using a larger n_jobs doesn't necessarily make things faster, sometimes the opposite.
make sure you are using the latest version: there have been performance improvements in v0.22 and some not yet merged optimizations (scikit-learn#14543)
use an external approximate nearest neighbours library (e.g. Annoy) together with pre-computed sparse distances using metric="precomputed"

Is there an efficient python implementation of spectral clustering for large, dense matrices?

Currently I'm using the spectral clustering method from sklearn for my dense 7000x7000 matrix which performs very slowly and exceeds an execution time of 6 hours. Is there a faster implementation of spectral clustering in python?
I'd recommend performing PCA to project the data to a lower dimensionality , and then utilize mini batch k-means

Handling K-means with large dataset 6gb with scikit-learn?

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

Categories