I have the following dataframe:
print(df)
document embeddings
1 [-1.1132643 , 0.793635 , 0.8664889]
2 [-1.1132643 , 0.793635 , 0.8664889]
3 [-0.19276126, -0.48233205, 0.17549737]
4 [0.2080252 , 0.01567003, 0.0717131]
I want to cluster and visualize them to see the similarities between the documents. What is the best method/steps to do this?
This is just a small dataframe, the original dataframe has more than 20k documents.
Document vectors in your case reside in a 768-dimensional euclidean space. Meaning in a 768-dimensional coordinate space, each point represents a document. Assuming these have been trained correctly, it's safe to imagine that contextually similar documents should be closer to each other in this space as compared to different ones. This may allow you to apply a clustering method to club similar documents together.
For clustering, you can use multiple clustering techniques such as -
Kmeans (clusters based on euclidean distances)
Dbscan (clusters with the notion of density)
Gaussian mixtures (clusters based on a mixture of k gaussians)
You can use Silhouette score to find the optimal number of clusters for the clustering algorithm to best create separations in clusters.
For visualization, you can ONLY visualize in 3D or 2D space. This means you will have to use some dimensionality reduction methods to reduce the 768 dimensions to 3 dimensions or 2 dimensions.
This can be achieved with the following algorithms set to 2 or 3 components -
PCA
T-SNE
LDA (requires labels)
Once you have clustered the data AND reduced the dimensionality of the data separately, you can use matplotlib to plot each of the points in a 2D/3D space and color each point based on its cluster (0-7) to visualize documents and clusters.
#process flow
(20k,768) -> K-clusters (20k,1) ------|
|--- Visualize (3 axis, k colors)
(20k,768) -> Dim reduction (20k,3)----|
Here is an example of the goal you are trying to achieve -
Here, you see the first 2 components of data from T-SNE, and each color represents the clusters you have created from your clustering method of choice (deciding the number of clusters using silhouette score)
EDIT: You can apply dimensionality reduction to project your 768-dimensional data into a 3D or 2D space and THEN cluster using a clustering method. This would reduce the amount of computation you have to handle since now you are clustering only on 3 dimensions instead of 768, but at cost of information that might help you discriminate clusters better.
#process flow
|------------------------|
(20k,768) -> Dim reduction (20k,3)--| |-- Visualize
|--- K-Clusters (20k,1)--|
Related
I have the following dataset with 10 variables:
I want to identify clusters with this multidimensional dataset, so I tried k-means clustering algorith with the following code:
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
In order to plot the result I used PCA for dimensionality reduction:
reduced_data = PCA(n_components=2).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=kmeans['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
And in the end I get the following result:
So I have following questions:
1.) However, this PCA plot looks really weird splitting the whole dataset in two corners of the plot. Is that even correct or did I code something wrong?
2.) Is there another algorithm for clustering multidimensional data? I look at this but I can not find an approriate algorithm for clustering multidimensional data... How do I even implement e.g. Ward hierarchical clustering in python for my dataset?
3.) Why should I use PCA for dimensionality reduction? Can I also use t SNE? Is it better?
the problem is that you fit your PCA on your dataframe, but the dataframe contains the cluster. Column 'cluster' will probably contain most of the variation in your dataset an therefore the information in the first PC will just coincide with data['cluster'] column. Try to fit your PCA only on the distance columns:
data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
You can fit hierarchical clustering with sklearn by using:
sklearn.cluster.AgglomerativeClustering()`
You can use different distance metrics and linkages like 'ward'
tSNE is used to visualize multivariate data and the goal of this technique is not clustering
I have a pandas dataframe with rows as records (patients) and 105 columns as features.(properties of each patient)
I would like to cluster, not the patients, not the rows as is customary, but the columns so I can see which features are similar or correlated to which other features. I can already calculate the correlation each feature with every other feature using df.corr(). But how can I cluster these into k=2,3,4... groups using sklearn.cluster.KMeans?
I tried KMeans(n_clusters=2).fit(df.T) which does cluster the features (because I took the transpose of the matrix) but only with a Euclidian distance function, not according to their correlations. I prefer to cluster the features according to correlations.
This should be very easy but I would appreciate your help.
KMeans is not very useful in this case, but you can use any clustering method that can work with distances matrix. For example - agglomerative clustering.
I'll use scipy, sklearn version is simpler, but not such powerful (e.g. in sklearn you cannot use WARD method with distances matrix).
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
df = ... # your dataframe with many features
corr = df.corr() # we can consider this as affinity matrix
distances = 1 - corr.abs().values # pairwise distnces
distArray = ssd.squareform(distances) # scipy converts matrix to 1d array
hier = hierarchy.linkage(distArray, method="ward") # you can use other methods
Read docs to understand hier structure.
You can print dendrogram with
dend = hierarchy.dendrogram(hier, truncate_mode="level", p=30, color_threshold=1.5)
And finally, obtain cluster labels for your features
threshold = 1.5 # choose threshold using dendrogram or any other method (e.g. quantile or desired number of features)
cluster_labels = hierarchy.fcluster(hier, threshold, criterion="distance")
Create a new matrix by taking the correlations of all the features df.corr(), now use this new matrix as your dataset for the k-means algorithm.
This will give you clusters of features which have similar correlations.
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.
I have a twitter corpus which I am using to build sentiment analysis application. The corpus has 5k tweets which have been hand labelled as - negative, neutral or positive
To represent the text - I am using gensim word2vec pretrained vectors. Each word is mapped to 300 dimensions. For a tweet, I add all the word vectors to get a single 300 dim vectors. Thus every tweet is mapped to a single vector of 300 dimension.
I am visualizing my data using t-SNE (tsne python package). See attached image 1 - Red points = negative tweets, Blue points = neutral tweets and Green points = Positive tweets
Question:
In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
i.e if points overlap in t-SNE graph then they also overlap in original space and vice-versa ?
Question: In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
In most cases NO. By reducing dimensions you will probably loose some information.
The case where you may reduce dimension without losing information is when or data in some dimensions is zero(for example line in 3dimensional space) or when some dimensions linearly dependent on other.
There are few tricks to test how good some dimensionality reductions techniques works. For example:
You may use PCA to reduce dimension form 300 to for example 10. You can calculate sum of 300 eigenvalues(original space) and sum of 10 biggest eigenvalues(these 10 eigenvalues represent eigenvectors that will be used for dimension reduction) and calculate percentage of lost information sum(top-10-eigenvalues)/sum(300-eigenvalues) .This value is not exactly "information" lost, but it is close to that.
Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.