I have a dataset of 1600 points in the 3D dimension. I need to use a clustering algorithm (like K-means) to cluster them into for example 10 different clusters. I want to use the Euclidean distance to do this clustering. How can I do this in TensorFlow 2?
Related
I have embedded vectors with K-means using cosine similarity.
when I used Spark to to so, I had no problem. Now I wish to convert it into Python.
Is it possible to so with scikit learn? I couldn't find this option, it seems to support only Euclidian distance, is that correct?
Do you think that scaling all vectors to be on the unit sphere and then run the sklearn Kmeans is identical to running k-means with cosine similarity?
If 2 is correct, so for inference on new points, should I just scale it to the unit sphere and use the sklearn Kmeans inference?
I need to cluster tweets based on similarity between them, I am using dec2vec to vectorize them and now I need a way to cluster this vectors, also I tried kmeans and it wasn't a good model for me as I don't know the number of clusters. I tried to use function similarity in gensim library but the result is different each time and wasn't correct! So is there a way to cluster this?
You need to know how many clusters you want for your particular task, before applying K-means or any other clustering algorithm. And if the number of clusters is very large, then some clustering algorithms like K-means will not be able to scale well. For large number of clusters, you could try some other clustering algorithms like agglomerative clustering or DBSCAN.
If you only need a small number of clusters but don't know the exact number of clusters, you could use T-SNE (T-distributed Stochastic Neighbourhood Embedding) to get an approximate 2-D visualisation of your vectorized tweets, to get an idea of how many clusters you would need.
I want to perform clustering on time-series data. I use Python's Sklearn library for the project. At first, I created a distance matrix by using dynamic time warping (DTW). Then I clustered the data using OPTICS function in sklearn like this:
clustering = OPTICS(min_samples=3, max_eps=0.7, cluster_method='dbscan', metric="precomputed").fit(distance_matrix)
Then I visualized this distances using MDS like the following:
mds = MDS(n_components=2, dissimilarity="precomputed").fit(distance_matrix)
And this is the result:
The dark blue points are the outliers and the other two are the clusters identified by optics. I cannot understand these results. The yellow points cluster doesn't make any sense. I played with numbers and changed them but it always gives strange results. This is the same when I use DBSCAN but for K-MEANS and AGNES, I get more reasonable clusters when I visualize them. Am I doing something wrong here?
I have a similarity matrix that I have calculated between a large number of objects, and each object can have a non-zero similarity with any other object. I generated this matrix for another task, and would now like to cluster it for a new analysis.
It seems like scikit's spectral clustering method could be a good fit, because I can pass in a precomputed affinity matrix. I also know that spectral clustering typically uses some number of nearest neighbors when building the affinity matrix, and my similarity matrix does not have that same constraint.
If I pass in a matrix that allows any number of edges between nodes in the affinity matrix, will scikit limit each node to having only a certain number of nearest neighbors? If not, I guess I will have to make that change to my pre-computed affinity matrix.
You don't have to compute the affinity yourself to do some spectral clustering, sklearn does that for you.
When you call sc = SpectralClustering(),, the affinity parameter allows you to chose the kernel used to compute the affinity matrix. rbf seems to be the kernel by default and doesn't use a particular number of nearest neighbours. However, if you decide to chose another kernel, you might want to specify that number with the n_neighboursparameter.
You can then use sc.fit_predict(your_matrix) to compute the clusters.
Spectral clustering does not require a sparsified matrix.
But if I'm not mistaken it's faster to find the dmallest non-zero Eigenvectors of a sparse matrix rather than of a dense matrix. Worst case may remain O(n^3) though - spectral clustering is one of the slowest methods you can find.
I have an array of TF-IDF feature vectors. I'd like to find similar vectors in the array using two methods:
Cosine similarity
k-means clustering
Using Scikit Learn, this process is pretty simple.
Now I'd like to weight certain features so that they will influence the results more than the other features. For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features.
How can I meaningfully weight certain features in my feature vectors? Is the process for weighting certain features the same for each of the similarity algorithms I listed above?
As I understand, low values in the TFIDF matrix mean that the words are less significant. So one approach is to lower the values in the matrix for those columns you considered.
The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/.