I have the following dataset with 10 variables:
I want to identify clusters with this multidimensional dataset, so I tried k-means clustering algorith with the following code:
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
In order to plot the result I used PCA for dimensionality reduction:
reduced_data = PCA(n_components=2).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=kmeans['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
And in the end I get the following result:
So I have following questions:
1.) However, this PCA plot looks really weird splitting the whole dataset in two corners of the plot. Is that even correct or did I code something wrong?
2.) Is there another algorithm for clustering multidimensional data? I look at this but I can not find an approriate algorithm for clustering multidimensional data... How do I even implement e.g. Ward hierarchical clustering in python for my dataset?
3.) Why should I use PCA for dimensionality reduction? Can I also use t SNE? Is it better?
the problem is that you fit your PCA on your dataframe, but the dataframe contains the cluster. Column 'cluster' will probably contain most of the variation in your dataset an therefore the information in the first PC will just coincide with data['cluster'] column. Try to fit your PCA only on the distance columns:
data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
You can fit hierarchical clustering with sklearn by using:
sklearn.cluster.AgglomerativeClustering()`
You can use different distance metrics and linkages like 'ward'
tSNE is used to visualize multivariate data and the goal of this technique is not clustering
Related
I am currently working on clustering categorical attributes that come from a bank marketing dataset from Kaggle. I have created the three clusters with kmodes:
Output: cluster_df
Now I want to visualize each row of a cluster as a projection or point so that I get some kind of image:
Desired visualization
I am having a hard time with this. I don't get a Euclidean distance with categorized data, right? That makes no sense. Is there then no possibility to create this desired visualization?
The best way to visualize clusters is to use PCA.
You can use PCA to reduce the multi-dimensional data into 2 dimensions so that you can plot and hopefully understand the data better.
To use it see the following code:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
where x is the fitted and transformed data on your cluster.
Now u can easily visualize your clustered data since it's 2 dimensional.
My goal is to find out if I can manipulate and measure data from a PCA or t-SNE plot in Python. I want to know if there is a way I can find distances of points from a center of clusters.
I think there is a way but I'm not too sure.
You don't specify so much but maybe this can help you:
Clustering techniques information:
https://scikit-learn.org/stable/modules/clustering.html#clustering
Dimensionality reduction:
https://scikit-learn.org/stable/modules/decomposition.html#decompositions
Maybe the following script helps you:
from sklearn.decomposition import PCA
X= your_data_variables
cluster = "your cluster technique"
cluster.fit(X)
pca=PCA(n_components= 2)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
centers = pca.transform(cluster.cluster_centers_)
Now you have the clusters center and your data in two dimenssion and you can calculate the distance as you want.
I have a pandas dataframe with rows as records (patients) and 105 columns as features.(properties of each patient)
I would like to cluster, not the patients, not the rows as is customary, but the columns so I can see which features are similar or correlated to which other features. I can already calculate the correlation each feature with every other feature using df.corr(). But how can I cluster these into k=2,3,4... groups using sklearn.cluster.KMeans?
I tried KMeans(n_clusters=2).fit(df.T) which does cluster the features (because I took the transpose of the matrix) but only with a Euclidian distance function, not according to their correlations. I prefer to cluster the features according to correlations.
This should be very easy but I would appreciate your help.
KMeans is not very useful in this case, but you can use any clustering method that can work with distances matrix. For example - agglomerative clustering.
I'll use scipy, sklearn version is simpler, but not such powerful (e.g. in sklearn you cannot use WARD method with distances matrix).
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
df = ... # your dataframe with many features
corr = df.corr() # we can consider this as affinity matrix
distances = 1 - corr.abs().values # pairwise distnces
distArray = ssd.squareform(distances) # scipy converts matrix to 1d array
hier = hierarchy.linkage(distArray, method="ward") # you can use other methods
Read docs to understand hier structure.
You can print dendrogram with
dend = hierarchy.dendrogram(hier, truncate_mode="level", p=30, color_threshold=1.5)
And finally, obtain cluster labels for your features
threshold = 1.5 # choose threshold using dendrogram or any other method (e.g. quantile or desired number of features)
cluster_labels = hierarchy.fcluster(hier, threshold, criterion="distance")
Create a new matrix by taking the correlations of all the features df.corr(), now use this new matrix as your dataset for the k-means algorithm.
This will give you clusters of features which have similar correlations.
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.
I have 7 known centroids shape=(7,4) and a numpy array of shape=(160000, 4).
If I remove one data set from the numpy array and just use 6 centroids then the kmeans cluster algorithm works quite well. If I include the dataset that has noisy data, then the kmeans clustering algorithm runs into issues.
What are some recommended ways of reducing noise or filtering it out with scikit-learn kmeans clustering?
As an alternative, I have considered using the dbscan algorithm http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
I did also try to use init=initial_centroids but that did not seem to make a difference
#numpy array containing noise
data_with_noise.shape=(160000,4)
centroids= sk.KMeans(n_clusters=7, init='k-means++', max_iter=1000)
#this data set gives errors
centroids.fit_transform(data_with_noise)
#works well but only for 6 clusters
no_noise_centroids= sk.KMeans(n_clusters=6, init='k-means++', max_iter=1000)
no_noise_centroids.fit(data_no_noise)