K-means with Cosine similarity in Python - python

I have embedded vectors with K-means using cosine similarity.
when I used Spark to to so, I had no problem. Now I wish to convert it into Python.
Is it possible to so with scikit learn? I couldn't find this option, it seems to support only Euclidian distance, is that correct?
Do you think that scaling all vectors to be on the unit sphere and then run the sklearn Kmeans is identical to running k-means with cosine similarity?
If 2 is correct, so for inference on new points, should I just scale it to the unit sphere and use the sklearn Kmeans inference?

Related

is there a way to cluster tweets after vectorizing them?

I need to cluster tweets based on similarity between them, I am using dec2vec to vectorize them and now I need a way to cluster this vectors, also I tried kmeans and it wasn't a good model for me as I don't know the number of clusters. I tried to use function similarity in gensim library but the result is different each time and wasn't correct! So is there a way to cluster this?
You need to know how many clusters you want for your particular task, before applying K-means or any other clustering algorithm. And if the number of clusters is very large, then some clustering algorithms like K-means will not be able to scale well. For large number of clusters, you could try some other clustering algorithms like agglomerative clustering or DBSCAN.
If you only need a small number of clusters but don't know the exact number of clusters, you could use T-SNE (T-distributed Stochastic Neighbourhood Embedding) to get an approximate 2-D visualisation of your vectorized tweets, to get an idea of how many clusters you would need.

Time-series clustering in python: DBSCAN and OPTICS giving me strange results

I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?

Clustering with Tensorflow 2

I have a dataset of 1600 points in the 3D dimension. I need to use a clustering algorithm (like K-means) to cluster them into for example 10 different clusters. I want to use the Euclidean distance to do this clustering. How can I do this in TensorFlow 2?

Explained variance from scikit-learn MDS

Is there a way to calculate the explained variance (eigenvalues) from scikit learn's MDS? I've seen this thread, but I think scikit learn's MDS is a "non-classical" form of MDS, so I'm guessing it wouldn't work? Is there a way to compute the explained variance from running scikit learn's implementation of MDS?
Also, if I'm using a precomputed dissimilarity matrix for scikit learn's MDS, is it then running classical MDS? Based on the code it seems like it's still running the SMACOF algorithm regardless (rather than eigendecomposition)?

Normalising Data to use Cosine Distance in Kmeans (Python)

I am currently solving a problem where I have to use Cosine distance as the similarity measure for Kmeans clustering. However, the standard Kmeans clustering package (from Sklearn package) uses Euclidean distance as standard, and does not allow you to change this.
Therefor it is my understanding that by normalising my original dataset through the the code below. I can then run kmeans package (using Euclidean distance) and it will be the same as if I had changed the distance metric to Cosine Distance?
from sklearn import preprocessing # to normalise existing X
X_Norm = preprocessing.normalize(X)
km2 = cluster.KMeans(n_clusters=5,init='random').fit(X_Norm)
Please let me know if my mathematical understanding of this is incorrect?

Categories