How to implement a sklearn -AgglomerativeClustering from clusters? - python

I'd like to perform a "mixed" unsupervised clustering which uses first a KMeans algorithm to generate a certain number of first small and homogeneous clusters and THEN apply a hierarchical clustering on these clusters I get from Kmeans.
I used cluster.Kmeans from scikit-learn for the first part, and I have my clusters but then I don't know how to use the AgglomerativeClustering function from sklearn so that it can go from those clusters.
Any ideas?
Thank you !

You also get the labels from KMeans.
These give you the partitions.
Just see the manual.

Related

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn
I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.
How can I achieve this efficiently in pyspark?
This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.
I found this link helpful to understand the LSH.
All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

is there a way to cluster tweets after vectorizing them?

I need to cluster tweets based on similarity between them, I am using dec2vec to vectorize them and now I need a way to cluster this vectors, also I tried kmeans and it wasn't a good model for me as I don't know the number of clusters. I tried to use function similarity in gensim library but the result is different each time and wasn't correct! So is there a way to cluster this?
You need to know how many clusters you want for your particular task, before applying K-means or any other clustering algorithm. And if the number of clusters is very large, then some clustering algorithms like K-means will not be able to scale well. For large number of clusters, you could try some other clustering algorithms like agglomerative clustering or DBSCAN.
If you only need a small number of clusters but don't know the exact number of clusters, you could use T-SNE (T-distributed Stochastic Neighbourhood Embedding) to get an approximate 2-D visualisation of your vectorized tweets, to get an idea of how many clusters you would need.

Time-series clustering in python: DBSCAN and OPTICS giving me strange results

I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?

Do I need to extract feature vectors from MNIST before using Kmeans

I am practicing with MNIST by sklearn.cluster.KMeans.
Intuitively, I just fit the training data to the sklearn function. But I have got pretty low accuracy. I am wondering what step I have missed. Should I extract feature vectors by PCA in the first place? Or should I change a bigger n_clusters?
from sklearn import cluster
from sklearn.metrics import accuracy_score
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)
clf.fit(X_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
I got poor 0.137 as result. Any recommendation? Thanks!
How are you passing the images in? Are pixels flattened or kept in the 2d format?Are pixels being normalized to between 0-1?
As you are running clustering I would advise against PCA regardless and instead opt for T-SNE which keeps neighbourhood info but you should not need to do so before running K-Means.
The best way to debug is to see what your fitted model is predicting as the clusters. You can see an example here:
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
With this info, you can get an idea of where mistakes might be. Good luck!
Adding a note: K-Means also probably is not the best model for your purposes. It's best for unsupervised contexts to cluster data. Whereas, MNIST is a classification usecase. KNN would be a better option while still allowing you to experiment with neighbours and such.
Here is an example I created with KNN: https://gist.github.com/andrew-x/0bb997b129647f3a7b7c0907b7e836fc
Unless I'm missing something: you are comparing clustering labels which are arbitrarily numbered 0-9, to labels which are unarbitrarily numbered 0-9. The 0s in your clustering might not end up in cluster number 0, yet this is the comparison you make. Clustering results are evaluated differently because of this. Some options to get a correct evaluation:
Generate a contingency matrix and plot it
Calculate the adjusted rand index

Agglomerative with even sized clusters

Is there a way to make Agglomerative clustering in sklearn create even sized clusters?
I see that there's a connectivity optional parameter, but not sure how to use it and if it helps.
Thanks,
Miki

Categories