I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me
I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:
I’m just trying to get my head around clustering.
I have a series of data points - y - which have a noise function associated with them (gaussian)
There are two classes of values 0 and >0 (obviously with noise). I’m trying to find the centre point of the group which is >0.
I’ve plotted the points with a simple moving average to be able to eyeball the data.
Moving average plot:
How can I cluster the data just based on the y value?
I’d like to have two clusters - one covering the points on the left and right (roughly <120 and >260 by the looks of it) and the other for the middle points (x = 120 to 260)
If I try with two clusters I get this:
k means plot - k=2:
How should I amend my code to achieve this?
x = range(315)
y= [-0.0019438692324050865, 0.0028994208839327852, 0.0051483573976274649, -0.0033242993359676809, -0.007205517954705391, 0.0023493638544448323, 0.0021109981155292179, 0.0035990200904119076, -0.0039516797159245328, 0.0046512034107712786, -0.0019248189368846083, 0.0036744109953683823, 0.0007898612768152954, 0.0050059088808496474, -0.0021084425769681558, 0.0014692258570182986, -0.0030711206115484175, -0.0026614801222815628, 0.0022816301256991535, 0.00019923934682088178, -0.0013181161659271139, -0.0021956355547661358, 0.0012941895041076283, 0.00337197586896105, -0.0019792508536746402, -0.002020497762984554, 0.0014495021773240431, 0.0011887337096206894, 0.0016667792145975404, -0.0010119590445208419, -0.0024506337087077676, 0.0072264471843846339, -0.0014126073097276062, -0.00065673498034648755, -0.0011355352304356647, -0.00042657980930307281, -0.0032875547481258042, -0.002351265010099495, -0.00073344218847348742, -0.0031555991687002589, 0.0026170287799315104, 0.0019289080666337198, -0.0021804765064623076, 0.0026221290350876979, 0.0019831827145683828, -0.005422907223254632, -0.0014107046201467732, -0.0049438583709020423, 0.00081884635937855494, 0.0054783747880986361, -0.0011282600170147909, -0.00436581779762948, 0.0024421851848953177, -0.0018564229613786095, -0.0052492274840120123, 0.0051775747035086306, 0.0052413417491534494, 0.0030817295096650732, -0.0014106391941506153, 0.00074380887788818206, -0.0041507550699856439, -0.00074928547462217287, -9.3938667619130614e-05, -0.00060592968804004362, 0.0064913597798387348, 0.0018098075166183621, 0.00099550852535854441, 0.0037322288350247917, 0.0027039351321340869, 0.0060238021513650541, -0.006567405116575234, 0.0020858553839503175, -0.0040329574871009084, -0.0029337227854833213, 0.0020743996957790969, 0.0041249738085716511, -0.0016678673351373336, -0.00081387164524554967, -0.0028411340446090278, 0.00013572776045231967, -0.00025350369023925548, 0.00071609777542998309, -0.0018427036825796074, -0.0015513575887011904, -0.0016357115978466398, 0.0038235991426514866, 0.0017693050063256977, -0.00029816429542494152, -0.0016071303644783605, -0.0031883070092131086, -0.0010340123778528594, -0.0049194467790889653, 0.0012109237666701397, 0.0024532524488299246, 0.0069307209537693721, 0.0009573350812806618, -6.0022322637651027e-05, -0.00050143013334696311, 0.0023415017810229548, 0.0033053845403900849, -0.0061156769150035222, 0.00022216114877491691, 0.0017257349557975464, 4.6919738262423826e-05, -0.0035257466102171162, -0.0043673831041441185, -0.0016592116617178102, -0.003298933045964781, -0.001667158964114637, 0.0011283739877531254, -0.0055098513985193534, 0.0023564462221116358, 0.0041971132878626258, 0.0061727231077443314, 0.0047583822927202779, 0.0022475414486232245, 0.0048682822792560521, 0.0022415648209199016, 0.00044859963858686957, -0.0018519391698513449, 0.0031460918774998763, 0.0038614233082916809, -0.0043409564348247066, -0.0055560805453666326, -0.00025133196059449212, 0.012436346397552794, 0.01136022093203152, 0.011244278807602391, 0.01470018209739289, 0.0075560289478025277, 0.012568781764361209, 0.0076068752709663838, 0.011022209533236597, 0.010545997929846045, 0.01084340614623565, 0.011728388118710915, 0.0075043238708055885, 0.012860298948366296, 0.0097297636410632864, 0.0098800557729756874, 0.011536517297700085, 0.0082316420968713416, 0.012612386004592427, 0.016617154743589352, 0.0091391582296167315, 0.014952150276251052, 0.011675391002362373, 0.01568297072839233, 0.01537664322062633, 0.01622711654371662, 0.010708828344561546, 0.016625354383482532, 0.010757807468539406, 0.016867909081979202, 0.010354635736138377, 0.014345365677006765, 0.011114328315579219, 0.010034249196973242, 0.015846180181371881, 0.014303841146954242, 0.011608682896746103, 0.0086826955459553216, 0.0088576104599897426, 0.011250553207393772, 0.005522552439745569, 0.011185993425936373, 0.010241377537878162, 0.0079206732150164348, 0.0052965651546758108, 0.011104715912291204, 0.010506408714857187, 0.010153282642128673, 0.010286986015082572, 0.01187330766677645, 0.014541420264499783, 0.013092204890199896, 0.012979246400649271, 0.012595814351669916, 0.014714607377710237, 0.011727516021525658, 0.011035077266739704, 0.0089698030032708698, 0.0087245475140550147, 0.011139467365240661, 0.0094505568595650603, 0.014430361388952871, 0.0089241578716030695, 0.014616210804585136, 0.013295072783119581, 0.014430633057603408, 0.01200577022494694, 0.011315388654675421, 0.013359877656434442, 0.017704146495248471, 0.0089900858719559155, 0.014731590728415532, 0.0053244009632545759, 0.011199377929150522, 0.0098899254166580439, 0.012220397221188688, 0.015315682643295272, 0.0042842773538990919, 0.0098560854848898077, 0.0088592602102698509, 0.011682575531316278, 0.0098450268165344631, 0.015508017179782136, 0.0083959771972897564, 0.0057504382506886418, 0.010149849298310511, 0.011467172305959087, 0.019354427705224483, 0.013200207481702888, 0.0084555200083286791, 0.011458643458455485, 0.0067582116806278788, 0.01083616691886825, 0.013189184991857963, 0.011774794518724967, 0.014419252448288828, 0.011252283438046358, 0.013346699363583018, 0.0070752340082163006, 0.013215300343131422, 0.0083841320189162287, 0.0067600805611729283, 0.014043517055899181, 0.0098241497159076551, 0.011466675085574904, 0.01155354571355972, 0.012051701509217881, 0.010150596813866767, 0.0093930906430917619, 0.003368481869910186, 0.0048359029438027378, 0.0072083852964288445, 0.010112266453748613, 0.014009345326404186, 0.0050187514558796657, 0.0076315122645601551, 0.0098572381625301152, 0.0114902035403828, 0.018390212262653569, 0.020552166087412803, 0.010428735773226807, 0.011717974670325962, 0.011586303572796604, 0.0092978832913345726, 0.0040060048273946845, 0.012302496528511328, 0.0076707934776137684, 0.014700766223305586, 0.013491092168119941, 0.016244916923257174, 0.010387716692694397, 0.0072564046806323553, 0.0089420045528720883, 0.012125390630607462, 0.013274623392811291, 0.012783388635585766, 0.013859113028817658, 0.0080975189401925642, 0.01379241865445455, 0.012648552766643405, 0.011380280655911323, 0.010109646424218717, 0.0098577688652478051, 0.0064661895943772208, 0.010848835432253455, -0.0010986941731458047, -0.00052875821639583262, 0.0020423603076171414, 0.0035710440970171805, 0.001652886517437206, 0.0023512717524485573, -0.002695275440737862, 0.002253880812688683, -0.0080855104018828141, -0.0020090808966136161, -0.0029794078852333791, 0.00047537441103425869, -0.0010168825525621432, 0.0028683012479151873, -0.0014733214239664142, 0.0019432702158397569, -0.0012411849653504801, -0.00034507088510895141, -0.0023587874349834145, 0.0018156591123708393, 0.0040923006067568324, 0.0043522232127477072, -0.0055992642684123371, -0.0019368557792245147, 0.0026257395447205848, 0.0025594329536029635, 0.00053681548609292378, 0.0032186216144045742, -0.003338121135450386, 0.00065996843114729585, 0.006711173245189642, 0.0032877327776177517, 0.0039528629317296367, 0.0063732674764248719, -0.0026207617244284023, 0.0061381482567009048, -0.003024741769256066, -0.0023891419421980839, -0.004011235930513047, 0.0018372067754070733, -0.0045928077859572689, -0.0021420171112169601, 0.001665179522797816, 0.0074356736689407859, 0.0065680163280897891, -0.0038116640825467678]
data = np.column_stack([x,y])
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
plt.scatter(data[:, 0], data[:, 1], c=y_kmeans, s=5, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.grid()
I’d also like to be able to return the max, min and average for the values in each cluster - is this possible?
Some ideas on your problem.
k-means is actually a multivariate method, so it is probably not a good choice in your case. You can take advantage of the 1-dimensionality of you data by looking for minima of a kernel density estimation of the y-data. A plot of the density estimation will show a bimodal density function with the two modes divided by a minimum which is the y-value at which you want to divide the two clusters.
Have a look at http://scikit-learn.org/stable/modules/density.html#kernel-density
To get the x-values at which you divide, you could use the moving average you already computed.
However, there might be methods better suited to your kind of data. You might want to ask your question at https://stats.stackexchange.com/ as it is not really a programming problem but one about the appropriate method.
You can reshape your data to a n x 1 array.
But if you want to take the time into account, I suggest you look into change detection in time series instead. It can detect a change in mean.
Using your code, the simplest way to get what you want is to change:
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
to
kmeans.fit(data[:,1].reshape(-1,1))
y_kmeans = kmeans.predict(data[:,1].reshape(-1,1))
You can get max, min, mean etc by using index, for example:
np.max(data[:,1][y_kmeans == 1])