Assign cluster members using log-likelihood distance metric in python - python

I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.

Related

Expected mean of correlated data in python

I have successfully generated three correlated random variables with Cholesky. I use the same mean(10) and the same standard deviation(5) for all of them. However, I tried to calculate the expected mean of the correlated variables, but I got some an unpleasant results I can't seem to know where exactly the problem. Please here is a working code:
import numpy as np
import pandas as pd
corr = np.array([[1,0.7,0.7], [0.7,1,0.7],[0.7,0.7,1]])
chol = np.linalg.cholesky(corr)
N=1000
rand_data = np.random.normal(10, 5, size=(3,N))
# generate uncorrelated data
uncorrelated_data = pd.DataFrame(rand_data, index=['A','B','C']).T/100
uncorrelated_data.corr() # shows barely any correlation as it should
uncorrelated_data.mean()*100 # shows each mean around 10
Output
A 10.308595
B 9.931958
C 10.165347
Generating correlation among them
x = np.dot(chol, rand_data) # cholesky
correlated_data = pd.DataFrame(x, index=['A','B','C']).T/100
print(correlated_data.corr()) # shows there are correlations among variable
sim_corr_rets.mean()*100 # mean keep increasing in across the variables
Output:
A 10.308595
B 14.308853
C 16.752117
The means of the uncorrelated variables were as expected but the mean of the correlated variables keeps increasing from the first variable to the last variable. My expectation is that each mean will be around the actual mean. Please could my noble seniors help me figure out the problem or suggest an alternative solution?

Grouping clusters based on one feature column

I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)
I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

changing cluster labels for kmeans model

I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me

How to calculate Silhouette coefficient for k-mediod clustering using pyclustering lib?

I like try the k-mediod clustering method (PAM) over the dataset https://archive.ics.uci.edu/ml/datasets/seeds
I don't know whether there exists other libraries other than pyclustering for this purpose. Anyway, how can I compute Silhouette coefficient for the clustering using this library? It don't provide such a method as an k-means with sklearn.
From the documentation, you can use sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds). This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples. I also recommend to see this vignette. There is a nice example in there for you to test too.
It is also possible via pyclustering since 0.8.2, here is an example from documentation:
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.silhouette import silhouette
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample
# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers
centers = kmeans_plusplus_initializer(sample, 4).initialize()
# Perform cluster analysis
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process();
clusters = kmeans_instance.get_clusters()
# Calculate Silhouette score
score = silhouette(sample, clusters).process().get_score()
In case of PAM you need to change the last part:
...
medoids = kmeans_plusplus_initializer(sample, 4).initialize(return_index=True)
kmedoids_instance = kmedoids(sample, medoids)
clusters = kmedoids_instance.process().get_clusters()
score = silhouette(sample, clusters).process().get_score()

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

Categories