I have a multidimensional vector designed for an NLP Classifier.
Here's the dataframe (text_df):
I used a TfidfVectorizer to create the vector:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray()
y = text_df.iloc[:,1].values
Shape of X is (13834, 2701).
I used 7 clusters for KMeans:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=7,random_state=42)
I tried using PCA, but I'm not sure if the graph looks right.
from sklearn.decomposition import PCA
X_pca = PCA(2).fit_transform(X)
plt.scatter(X_pca[:,0],X_pca[:,1],c=y_kmeans)
plt.title("Clusters")
plt.legend()
plt.show()
Is this normal for NLP based clusters? I was hoping for more distinctive clusters. Is there a way to clean up this cluster graph? (i.e. clearer groupings, distinct boundaries, cluster points closer together, etc.).
K-Means clustering does not work very well on high dimensional data (see this) and is usually done after Dimensionality Reduction (PCA, in your example).
As an aside, if you aim is to cluster the documents according to their topics, it's worth exploring topic modelling. Clustering can then be done using the distributions over topics identified by the topic modelling algorithms.
Related
I have the following dataset with 10 variables:
I want to identify clusters with this multidimensional dataset, so I tried k-means clustering algorith with the following code:
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
In order to plot the result I used PCA for dimensionality reduction:
reduced_data = PCA(n_components=2).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=kmeans['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
And in the end I get the following result:
So I have following questions:
1.) However, this PCA plot looks really weird splitting the whole dataset in two corners of the plot. Is that even correct or did I code something wrong?
2.) Is there another algorithm for clustering multidimensional data? I look at this but I can not find an approriate algorithm for clustering multidimensional data... How do I even implement e.g. Ward hierarchical clustering in python for my dataset?
3.) Why should I use PCA for dimensionality reduction? Can I also use t SNE? Is it better?
the problem is that you fit your PCA on your dataframe, but the dataframe contains the cluster. Column 'cluster' will probably contain most of the variation in your dataset an therefore the information in the first PC will just coincide with data['cluster'] column. Try to fit your PCA only on the distance columns:
data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
You can fit hierarchical clustering with sklearn by using:
sklearn.cluster.AgglomerativeClustering()`
You can use different distance metrics and linkages like 'ward'
tSNE is used to visualize multivariate data and the goal of this technique is not clustering
I obtained features from 10 images from 2 categories (cats and dogs) using CNN. So I have a (10, 2500) numpy array. I applied the OPTICS clustering algorithm on the array to find which image belongs to which cluster
clustering = OPTICS(min_samples=2).fit(train_data_array)
Now I'm trying to plot the clusters using seaborn
sns.scatterplot(data=train_data_array).plot
But there's no plot.
There's two issues.
The object returned by OPTICS only contain the labels, so you need
to add it to your training data.
The training data has 2500 variables, most likely you need to do a dimension reduction to render a 2-D plot.
Below is an example using iris dataset:
from sklearn.cluster import OPTICS
import seaborn as sns
import pandas as pd
df = sns.load_dataset("iris").iloc[:,:4]
Peform the clustering like you did:
clustering = OPTICS(min_samples=20).fit(df)
Perform PCA on this data with 4 variables, return top 2 components:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(df)
Add PC scores and clustering results to training data, or you can make a separate data.frame:
df['PC1'] = pca.fit_transform(df)[:,0]
df['PC2'] = pca.fit_transform(df)[:,1]
df['clustering'] = clustering.labels_
Plot:
sns.scatterplot(data=df,x="PC1",y="PC2",hue=df['clustering'])
My goal is to find out if I can manipulate and measure data from a PCA or t-SNE plot in Python. I want to know if there is a way I can find distances of points from a center of clusters.
I think there is a way but I'm not too sure.
You don't specify so much but maybe this can help you:
Clustering techniques information:
https://scikit-learn.org/stable/modules/clustering.html#clustering
Dimensionality reduction:
https://scikit-learn.org/stable/modules/decomposition.html#decompositions
Maybe the following script helps you:
from sklearn.decomposition import PCA
X= your_data_variables
cluster = "your cluster technique"
cluster.fit(X)
pca=PCA(n_components= 2)
pca.fit(X)
pca_data = pd.DataFrame(pca.transform(X))
centers = pca.transform(cluster.cluster_centers_)
Now you have the clusters center and your data in two dimenssion and you can calculate the distance as you want.
I'd like to use the Incremental principal components analysis (IPCA) to reduce my feature space such that it contains x% of information.
I would use the sklearn.decomposition.IncrementalPCA(n_components=None, whiten=False, copy=True, batch_size=None)
I can leave the n_components=None so that it works on all the features that I have.
But later once the whole data set is analyzed.
How do I select the features that represent x% of data and how do I create a transform() for those features number of features.
This idea taken from this question.
You can get the percentage of explained variance from each of the components of your PCA using explained_variance_ratio_. For example in the iris dataset, the first 2 principal components account for 98% of the variance in the data:
import numpy as np
from sklearn import decomposition
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
pca = decomposition.IncrementalPCA()
pca.fit(X)
pca.explaned_variance_ratio_
#array([ 0.92461621, 0.05301557, 0.01718514, 0.00518309])
I have 7 known centroids shape=(7,4) and a numpy array of shape=(160000, 4).
If I remove one data set from the numpy array and just use 6 centroids then the kmeans cluster algorithm works quite well. If I include the dataset that has noisy data, then the kmeans clustering algorithm runs into issues.
What are some recommended ways of reducing noise or filtering it out with scikit-learn kmeans clustering?
As an alternative, I have considered using the dbscan algorithm http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
I did also try to use init=initial_centroids but that did not seem to make a difference
#numpy array containing noise
data_with_noise.shape=(160000,4)
centroids= sk.KMeans(n_clusters=7, init='k-means++', max_iter=1000)
#this data set gives errors
centroids.fit_transform(data_with_noise)
#works well but only for 6 clusters
no_noise_centroids= sk.KMeans(n_clusters=6, init='k-means++', max_iter=1000)
no_noise_centroids.fit(data_no_noise)