How to plot OPTICS clustering results using seaborn? - python

I obtained features from 10 images from 2 categories (cats and dogs) using CNN. So I have a (10, 2500) numpy array. I applied the OPTICS clustering algorithm on the array to find which image belongs to which cluster
clustering = OPTICS(min_samples=2).fit(train_data_array)
Now I'm trying to plot the clusters using seaborn
sns.scatterplot(data=train_data_array).plot
But there's no plot.

There's two issues.
The object returned by OPTICS only contain the labels, so you need
to add it to your training data.
The training data has 2500 variables, most likely you need to do a dimension reduction to render a 2-D plot.
Below is an example using iris dataset:
from sklearn.cluster import OPTICS
import seaborn as sns
import pandas as pd
df = sns.load_dataset("iris").iloc[:,:4]
Peform the clustering like you did:
clustering = OPTICS(min_samples=20).fit(df)
Perform PCA on this data with 4 variables, return top 2 components:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(df)
Add PC scores and clustering results to training data, or you can make a separate data.frame:
df['PC1'] = pca.fit_transform(df)[:,0]
df['PC2'] = pca.fit_transform(df)[:,1]
df['clustering'] = clustering.labels_
Plot:
sns.scatterplot(data=df,x="PC1",y="PC2",hue=df['clustering'])

Related

PCA after k-means clustering of multidimensional data

I have the following dataset with 10 variables:
I want to identify clusters with this multidimensional dataset, so I tried k-means clustering algorith with the following code:
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
In order to plot the result I used PCA for dimensionality reduction:
reduced_data = PCA(n_components=2).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=kmeans['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
And in the end I get the following result:
So I have following questions:
1.) However, this PCA plot looks really weird splitting the whole dataset in two corners of the plot. Is that even correct or did I code something wrong?
2.) Is there another algorithm for clustering multidimensional data? I look at this but I can not find an approriate algorithm for clustering multidimensional data... How do I even implement e.g. Ward hierarchical clustering in python for my dataset?
3.) Why should I use PCA for dimensionality reduction? Can I also use t SNE? Is it better?
the problem is that you fit your PCA on your dataframe, but the dataframe contains the cluster. Column 'cluster' will probably contain most of the variation in your dataset an therefore the information in the first PC will just coincide with data['cluster'] column. Try to fit your PCA only on the distance columns:
data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
You can fit hierarchical clustering with sklearn by using:
sklearn.cluster.AgglomerativeClustering()`
You can use different distance metrics and linkages like 'ward'
tSNE is used to visualize multivariate data and the goal of this technique is not clustering

Perform Multi-Dimension Scaling (MDS) for clustered categorical data in python

I am currently working on clustering categorical attributes that come from a bank marketing dataset from Kaggle. I have created the three clusters with kmodes:
Output: cluster_df
Now I want to visualize each row of a cluster as a projection or point so that I get some kind of image:
Desired visualization
I am having a hard time with this. I don't get a Euclidean distance with categorized data, right? That makes no sense. Is there then no possibility to create this desired visualization?
The best way to visualize clusters is to use PCA.
You can use PCA to reduce the multi-dimensional data into 2 dimensions so that you can plot and hopefully understand the data better.
To use it see the following code:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
where x is the fitted and transformed data on your cluster.
Now u can easily visualize your clustered data since it's 2 dimensional.

K-Means not resulting in elbow shape

I'm trying to use k-means in a dataset available at this link using only the variables about the client. The problem is that 7 of the 8 variables are categorical, so I've used one hot encoder on them.
To use the elbow method to select an ideal number of clusters I've ran the KMeans for 2 to 22 clusters and plotted the inertia_ values. But the shape wasn't anything like an elbow, it looked more like a straight line.
Am I doing something wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
bank = pd.read_csv('bank-additional-full.csv', sep=';') #available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
# 1. selecting only informations about the client
cli_vars = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
bank_cli = bank[cli_vars].copy()
#2. applying one hot encoder to categorical variables
X = bank_cli[['job', 'marital', 'education', 'default', 'housing', 'loan']]
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
X_2.values
enc = preprocessing.OneHotEncoder()
enc.fit(X_2)
one_hot_labels = enc.transform(X_2).toarray()
one_hot_labels.shape #(41188, 33)
#3. concatenating numeric and categorical variables
X = np.concatenate((bank_cli.values[:,0].reshape((41188,1)),one_hot_labels), axis = 1)
X.shape
X = X.astype(float)
X_fit = StandardScaler().fit_transform(X)
X_fit
#4. function to calculate k-means for 2 to 22 clusters
def calcular_cotovelo(data):
wcss = []
for i in range(2, 23):
kmeans = KMeans(init = 'k-means++', n_init= 12, n_clusters = i)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
return wcss
cotovelo = calcular_cotovelo(X_fit)
#5. plot to see the elbow to select the ideal number of clusters
plt.plot(cotovelo)
plt.show()
This is the plot of the inertia to select the clusters. It's not in an elbow shape, and the values are very high.
K-means is not suited for categorical data. You should look to k-prototypes instead which combines k-modes and k-means and is able to cluster mixed numerical and categorical data.
An implementation of k-prototypes is available in Python.
If you consider only the numerical variable however, you can see an elbow with k-means criteria:
To understand why you do not see any elbow (with k-means on both numerical and categorical data), you can look at the number of points per clusters. You can observe that each time you increase the number of clusters, a new cluster is formed with only a few points which were in a big cluster at the previous step, thus the criterion is only a few less than at the previous step.

Graphing multi-dimensional K-means cluster NLP python

I have a multidimensional vector designed for an NLP Classifier.
Here's the dataframe (text_df):
I used a TfidfVectorizer to create the vector:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_v = TfidfVectorizer(max_df=0.5,
max_features=13000,
min_df=5,
stop_words='english',
use_idf=True,
norm=u'l2',
smooth_idf=True
)
X = tfidf_v.fit_transform(corpus).toarray()
y = text_df.iloc[:,1].values
Shape of X is (13834, 2701).
I used 7 clusters for KMeans:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=7,random_state=42)
I tried using PCA, but I'm not sure if the graph looks right.
from sklearn.decomposition import PCA
X_pca = PCA(2).fit_transform(X)
plt.scatter(X_pca[:,0],X_pca[:,1],c=y_kmeans)
plt.title("Clusters")
plt.legend()
plt.show()
Is this normal for NLP based clusters? I was hoping for more distinctive clusters. Is there a way to clean up this cluster graph? (i.e. clearer groupings, distinct boundaries, cluster points closer together, etc.).
K-Means clustering does not work very well on high dimensional data (see this) and is usually done after Dimensionality Reduction (PCA, in your example).
As an aside, if you aim is to cluster the documents according to their topics, it's worth exploring topic modelling. Clustering can then be done using the distributions over topics identified by the topic modelling algorithms.

How to choose the features that describe x% of all information in data while using Incremental principal components analysis (IPCA)?

I'd like to use the Incremental principal components analysis (IPCA) to reduce my feature space such that it contains x% of information.
I would use the sklearn.decomposition.IncrementalPCA(n_components=None, whiten=False, copy=True, batch_size=None)
I can leave the n_components=None so that it works on all the features that I have.
But later once the whole data set is analyzed.
How do I select the features that represent x% of data and how do I create a transform() for those features number of features.
This idea taken from this question.
You can get the percentage of explained variance from each of the components of your PCA using explained_variance_ratio_. For example in the iris dataset, the first 2 principal components account for 98% of the variance in the data:
import numpy as np
from sklearn import decomposition
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
pca = decomposition.IncrementalPCA()
pca.fit(X)
pca.explaned_variance_ratio_
#array([ 0.92461621, 0.05301557, 0.01718514, 0.00518309])

Categories