is there a way to cluster tweets after vectorizing them? - python

I need to cluster tweets based on similarity between them, I am using dec2vec to vectorize them and now I need a way to cluster this vectors, also I tried kmeans and it wasn't a good model for me as I don't know the number of clusters. I tried to use function similarity in gensim library but the result is different each time and wasn't correct! So is there a way to cluster this?

You need to know how many clusters you want for your particular task, before applying K-means or any other clustering algorithm. And if the number of clusters is very large, then some clustering algorithms like K-means will not be able to scale well. For large number of clusters, you could try some other clustering algorithms like agglomerative clustering or DBSCAN.
If you only need a small number of clusters but don't know the exact number of clusters, you could use T-SNE (T-distributed Stochastic Neighbourhood Embedding) to get an approximate 2-D visualisation of your vectorized tweets, to get an idea of how many clusters you would need.

Related

Calculate Silhouette coefficient for each sample in PySpark

I have a Spark ML pipeline in pyspark that looks like this,
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
pca = PCA(inputCol=scaler.getOutputCol(), outputCol="pca_output")
kmeans = clustering.KMeans(seed=2014)
pipeline = Pipeline(stages=[scaler, pca, kmeans])
After training the model, I wanted to get silhouette coefficients for each sample just like this function in sklearn
I know that I can use ClusteringEvaluator and generate scores for the whole dataset. But I want to do it for each sample instead.
How can I achieve this efficiently in pyspark?
This has been explored before on Stack overflow. What I would change about the answer and would supplement is you can use LSH as part of spark. This essentially does blind clustering with a reduced set of dimensions. It reduces the number of comparisons and allows you to specify a 'boundary'(density limit) for your clusters. It could be used a good tool to enforce a level of density that you are interested in. You could run KMeans first and use the centroids as input to the approximate join or vice versa help you pick the number of kmeans points to look at.
I found this link helpful to understand the LSH.
All that said, you could partition the data by each kmean cluster and then run silhouette on a sample of the partitions(via mapPartitions). Then apply the sample score to the entire group. Here's a good explanation of how samples are taken so you don't have to start from scratch. I would assume that really dense clusters be underscored by silhouette samples, so this may not be a perfect way of going about things. But still would be informative.

Time-series clustering in python: DBSCAN and OPTICS giving me strange results

I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?

what follows after clustering

I am trying to cluster images based on their similarities with SIFT and Affinity Propagation, I did the clustering but I just don't want to visualize the results. How can I test with a random image from the obtained labels? Or maybe there's more to it?
Other than data visualization, I just don't know what follows after clustering. How do I verify the 'clustering'
since clustering is unsupervised, there isn't an objective way to evaluate it. Typically, you just observe and see if there is some features for a certain cluster.
If you have ground-truth cluster labels, you can measure Jacquad-Index or something in that line to get an error score. Then, you can tweak your distance measure or parameters etc. to minimize the error score.
You can also do some clustering in order to group your data as the divide step in divide-and-conquer algorithms/applications.

How to implement a sklearn -AgglomerativeClustering from clusters?

I'd like to perform a "mixed" unsupervised clustering which uses first a KMeans algorithm to generate a certain number of first small and homogeneous clusters and THEN apply a hierarchical clustering on these clusters I get from Kmeans.
I used cluster.Kmeans from scikit-learn for the first part, and I have my clusters but then I don't know how to use the AgglomerativeClustering function from sklearn so that it can go from those clusters.
Any ideas?
Thank you !
You also get the labels from KMeans.
These give you the partitions.
Just see the manual.

How to explain clustering results?

Say I have a high dimensional dataset which I assume to be well separable by some kind of clustering algorithm. And I run the algorithm and end up with my clusters.
Is there any sort of way (preferable not "hacky" or some kind of heuristic) to explain "what features and thresholds were important in making members of cluster A (for example) part of cluster A?"
I have tried looking at cluster centroids but this gets tedious with a high dimensional dataset.
I have also tried fitting a decision tree to my clusters and then looking at the tree to determine which decision path most of the members of a given cluster follow. I have also tried fitting an SVM to my clusters and then using LIME on the closest samples to the centroids in order to get an idea of what features were important in classifying near the centroids.
However, both of these latter 2 ways require the use of supervised learning in an unsupervised setting and feel "hacky" to me, whereas I'd like something more grounded.
Have you tried using PCA or some other dimensionality reduction techniques and checking whether the clusters still hold? Sometimes relationships still exist in lower dimensions (Caveat: it doesn't always help one's understanding of the data). Cool article about visualizing MNIST data. http://colah.github.io/posts/2014-10-Visualizing-MNIST/. I hope this helps a bit.
Do not treat the clustering algorithm as a black box.
Yes, k-means uses centroids. But most algorithms for high-dimensional data don't (and don't use k-means!). Instead, they will often select some features, projections, subspaces, manifolds, etc. So look at what information the actual clustering algorithm provides!

Categories