Plotting clusters using PCA features as X an Y axis - python

I have applied PCA to a dataframe in order to plot clusters based on K-means. Since i have like 24 features in my original df, i don't want to plot clusters based in only 3 or 3 features at each time. So what i want to do, is to plot the combinations of those features, to get a more general/representative graphical respresentation of each feature in the clusters.
I extracted the components using pca.components_ and i have created the following df of components:
PC-1 PC-2
media_bi_mov 0.003094 0.050599
media_bi_post 0.000762 0.028931
total_mov_prod_300 0.000836 0.573675
codsprod_0 0.440476 -0.004404
codsprod_1 0.008005 0.105349
codsprod_2 0.002851 0.042459
codsprod_3 0.001078 0.009355
codsprod_4 -0.011922 -0.022020
idaplic_0 0.392229 -0.002817
idaplic_1 0.003001 0.004822
idaplic_2 0.044730 -0.001148
idaplic_3 0.097695 -0.008628
idaplic_4 0.024273 0.486973
idaplic_5 0.234798 -0.033369
idaplic_6 0.019329 0.015455
idempro_36 0.000401 -0.000438
idempro_38 0.032149 0.292137
idempro_49 0.439413 -0.023269
codmonsw_EUR 0.440543 -0.002770
codmonsw_USD 0.000378 0.000664
resto_codsprod 0.011406 0.011731
resto_idaplic 0.041649 0.005692
días_entre_ops -0.011129 -0.015144
frecuencia 0.440543 -0.002770
valor_total_eur 0.000836 0.573675
normally i would plot the clusters using kmeans.labels_ to apply a different color to each cluster if this was the original df. But my issue now is that i can't use kmeans.labels_ to differentiate each cluster in this pca-reduced df, since kmeans.labels_ will have a bigger length.
How can i apply color to differentiate the clusters in this dataframe??
Thanks in advance

i didn't realise the solution to this problem was so easy: I just needed to run kmeans on the components df to get the cluster labels for each feature in each principal component. Hope this will help someone with the same doubts as me.

Related

subsetting anndata on basis of louvain clusters

I want to subset anndata on basis of clusters, but i am not able to understand how to do it.
I am running scVelo pipeline, and in that i ran tl.louvain function to cluster cells on basis of louvain. I got around 32 clusters, of which cluster 2 and 4 is of my interest, and i have to run the pipeline further on these clusters only. (Initially i had the loom file which i read in scVelo, so i have now the anndata.)
I tried using adata.obs["louvain"] which gave me the cluster information, but i need to write a new anndata with only 2 clusters and process further.
Please help on how to subset anndata. Any help is highly appreciated. (Being very new to it, i am finding it difficult to get)
If your adata.obs has a "louvain" column that I'd expect after running tl.louvain, you could do the subsetting as
adata[adata.obs["louvain"] == "2"]
if you want to obtain one cluster and
adata[adata.obs['louvain'].isin(['2', '4'])]
for obtaining cluster 2 & 4.
Feel free to use this function I wrote for my work.
import AnnData
import numpy as np
def cluster_sampled(adata: AnnData, clusters: list, n_samples: int) -> AnnData:
"""Sample n_samples randomly from each louvain cluster from the provided clusters
Parameters
----------
adata
AnnData object
clusters
List of clusters to sample from
n_samples
Number of samples to take from each cluster
Returns
-------
AnnData
Annotated data matrix with sampled cells from the clusters
"""
l = []
adata_cluster_sampled = adata[adata.obs["louvain"].isin(clusters), :].copy()
for k, v in adata_cluster_sampled.obs.groupby("louvain").indices.items():
l.append(np.random.choice(v, n_samples, replace=False))
return adata_cluster_sampled[np.concatenate(l)]

plotting k-modes cluster in python

I've got 10 clusters in k-modes,
data:- categorical(i converted to binary then run model).
used technology:- jupyter-python.
doubt:- 1. find accuracy.
plotting/visualising cluster in 2d and 3d.
Something like this should be a good start.
#recreate data to feed into the algorithm
data = np.asarray([np.asarray(df['field1']),np.asarray(df['field2'])]).T
So now running the following piece of code:
# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
This is a great resource.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/

changing cluster labels for kmeans model

I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me

Output K-Means to CSV with SciKit Learn - give cluster names

I have the below scikit learn script which outputs a nice chart (below) with each of the clusters.
I have a couple of questions:
- How can I export this to CSV - with a cluster name or ID?
- How can I name the clusters?
- How can I make sure the clusters are always named the same thing? For example, I want to call the top right segment 'high spenders' how do I so that where it will always be correct?
Thanks!
#import the required libraries
# - matplotlib is a charting library
# - Seaborn builds on top of Matplotlib and introduces additional plot types. It also makes your traditional Matplotlib plots look a bit prettier.
# - Numpy is numerical Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
#Generate sample data, with distinct clusters for testing
#n_samples = the number of datapoints, equally split across each clusters
#centers = The number of centers to generate (number of clusters) - a center is the arithmetic mean of all the points belonging to the cluster.
#cluster_std = the standard deviation of the clusters - a quantity expressing by how much the members of a group differ from the mean value for the group (how tight is the cluster going to be)
#random_state = controls the random number generator being used. If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time. However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.
#make_blobs generates "isotropic Gaussian blobs" - X is a numpy array with two columns which contain the (x, y) Gaussian coordinates of these points, whereas y contains the list of categories for each.
#X, y = simply means that the output of make_blobs() has two elements, that are assigned to X and y.
X, y = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
#X now looks like this - column zero becomes the X axis, column1 becomes the Y axis
array([[ 1.85219907, 1.10411295],
[-1.27582283, 7.76448722],
[ 1.0060939 , 4.43642592],
[-1.20998253, 7.83203579],
[ 1.92461484, 1.06347673],
[ 2.28565919, 0.79166208],
[-1.57379043, 2.69773813],
[ 1.04917913, 4.31668562],
[-1.07436851, 7.93489945],
[-1.15872975, 7.97295642]
#The below statement, will enable us to visualise matplotlib charts, even in ipython
#Using matplotlib backend: MacOSX
#Populating the interactive namespace from numpy and matplotlib
%pylab
#plot the chart
#s = the sizer of the points.
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], s=50);
#now, I am definining that I want to find 4 clusters within the data. The general rule I follow is, I will have 7 times less clusters than datapoints.
kmeans = KMeans(n_clusters=4)
#build the model, based on X with the number of clusters defined above
kmeans.fit(X)
#now we're going to find clusters in the randomly generated dataset
predict = kmeans.predict(X)
#now we can plot the prediction
#c = colour, which is based on the predict variable we defined above
#s = the size of the plots
#X[:, 0] is the numpy coordinates way of selecting every row entry for column 0 - i.e. a single column from the numpy array.
#X[:, 1] is the numpy coordinates way of selecting every row entry for column 1 - i.e. a single column from the numpy array.
plt.scatter(X[:, 0], X[:, 1], c=predict, s=50)
Based on your code the following worked for me. You can certainly stay with numpy for storing the CSV but I simply prefer pandas. The sorting line should give you the same results everytime you run the code. However, since the initliazation of the clusters can have an impact I would also set a seed in your code, e.g. np.random.seed(42) and call the kmeans function with the random_state parameter, e.g. kmeans = KMeans(n_clusters=4, random_state=42)
# transform to dataframe
import pandas as pd
import seaborn as sns
df = pd.DataFrame(X)
df.columns = ["var1", "var2"]
df["cluster"] = predict
colors = sns.color_palette()[0:4]
df = df.sort_values("cluster")
# check plot
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster"], palette=colors)
plt.show()
# define rename schema
mynames = {"0": "center_left", "1": "top_left", "2": "bot_right", "3": "center"}
df["cluster_name"] = [mynames[str(i)] for i in df.cluster]
# plot again to verify order
sns.scatterplot(df["var1"], df["var2"], hue=df["cluster_name"],
palette=colors)
sns.despine()
plt.show()
# save dataframe as CSV
df.to_csv("myoutput.csv")
The first plot looks like this:
The second plot looks like this:
The CSV will look like this:

Using k-means clustering to cluster based on single variable

I’m just trying to get my head around clustering.
I have a series of data points - y - which have a noise function associated with them (gaussian)
There are two classes of values 0 and >0 (obviously with noise). I’m trying to find the centre point of the group which is >0.
I’ve plotted the points with a simple moving average to be able to eyeball the data.
Moving average plot:
How can I cluster the data just based on the y value?
I’d like to have two clusters - one covering the points on the left and right (roughly <120 and >260 by the looks of it) and the other for the middle points (x = 120 to 260)
If I try with two clusters I get this:
k means plot - k=2:
How should I amend my code to achieve this?
x = range(315)
y= [-0.0019438692324050865, 0.0028994208839327852, 0.0051483573976274649, -0.0033242993359676809, -0.007205517954705391, 0.0023493638544448323, 0.0021109981155292179, 0.0035990200904119076, -0.0039516797159245328, 0.0046512034107712786, -0.0019248189368846083, 0.0036744109953683823, 0.0007898612768152954, 0.0050059088808496474, -0.0021084425769681558, 0.0014692258570182986, -0.0030711206115484175, -0.0026614801222815628, 0.0022816301256991535, 0.00019923934682088178, -0.0013181161659271139, -0.0021956355547661358, 0.0012941895041076283, 0.00337197586896105, -0.0019792508536746402, -0.002020497762984554, 0.0014495021773240431, 0.0011887337096206894, 0.0016667792145975404, -0.0010119590445208419, -0.0024506337087077676, 0.0072264471843846339, -0.0014126073097276062, -0.00065673498034648755, -0.0011355352304356647, -0.00042657980930307281, -0.0032875547481258042, -0.002351265010099495, -0.00073344218847348742, -0.0031555991687002589, 0.0026170287799315104, 0.0019289080666337198, -0.0021804765064623076, 0.0026221290350876979, 0.0019831827145683828, -0.005422907223254632, -0.0014107046201467732, -0.0049438583709020423, 0.00081884635937855494, 0.0054783747880986361, -0.0011282600170147909, -0.00436581779762948, 0.0024421851848953177, -0.0018564229613786095, -0.0052492274840120123, 0.0051775747035086306, 0.0052413417491534494, 0.0030817295096650732, -0.0014106391941506153, 0.00074380887788818206, -0.0041507550699856439, -0.00074928547462217287, -9.3938667619130614e-05, -0.00060592968804004362, 0.0064913597798387348, 0.0018098075166183621, 0.00099550852535854441, 0.0037322288350247917, 0.0027039351321340869, 0.0060238021513650541, -0.006567405116575234, 0.0020858553839503175, -0.0040329574871009084, -0.0029337227854833213, 0.0020743996957790969, 0.0041249738085716511, -0.0016678673351373336, -0.00081387164524554967, -0.0028411340446090278, 0.00013572776045231967, -0.00025350369023925548, 0.00071609777542998309, -0.0018427036825796074, -0.0015513575887011904, -0.0016357115978466398, 0.0038235991426514866, 0.0017693050063256977, -0.00029816429542494152, -0.0016071303644783605, -0.0031883070092131086, -0.0010340123778528594, -0.0049194467790889653, 0.0012109237666701397, 0.0024532524488299246, 0.0069307209537693721, 0.0009573350812806618, -6.0022322637651027e-05, -0.00050143013334696311, 0.0023415017810229548, 0.0033053845403900849, -0.0061156769150035222, 0.00022216114877491691, 0.0017257349557975464, 4.6919738262423826e-05, -0.0035257466102171162, -0.0043673831041441185, -0.0016592116617178102, -0.003298933045964781, -0.001667158964114637, 0.0011283739877531254, -0.0055098513985193534, 0.0023564462221116358, 0.0041971132878626258, 0.0061727231077443314, 0.0047583822927202779, 0.0022475414486232245, 0.0048682822792560521, 0.0022415648209199016, 0.00044859963858686957, -0.0018519391698513449, 0.0031460918774998763, 0.0038614233082916809, -0.0043409564348247066, -0.0055560805453666326, -0.00025133196059449212, 0.012436346397552794, 0.01136022093203152, 0.011244278807602391, 0.01470018209739289, 0.0075560289478025277, 0.012568781764361209, 0.0076068752709663838, 0.011022209533236597, 0.010545997929846045, 0.01084340614623565, 0.011728388118710915, 0.0075043238708055885, 0.012860298948366296, 0.0097297636410632864, 0.0098800557729756874, 0.011536517297700085, 0.0082316420968713416, 0.012612386004592427, 0.016617154743589352, 0.0091391582296167315, 0.014952150276251052, 0.011675391002362373, 0.01568297072839233, 0.01537664322062633, 0.01622711654371662, 0.010708828344561546, 0.016625354383482532, 0.010757807468539406, 0.016867909081979202, 0.010354635736138377, 0.014345365677006765, 0.011114328315579219, 0.010034249196973242, 0.015846180181371881, 0.014303841146954242, 0.011608682896746103, 0.0086826955459553216, 0.0088576104599897426, 0.011250553207393772, 0.005522552439745569, 0.011185993425936373, 0.010241377537878162, 0.0079206732150164348, 0.0052965651546758108, 0.011104715912291204, 0.010506408714857187, 0.010153282642128673, 0.010286986015082572, 0.01187330766677645, 0.014541420264499783, 0.013092204890199896, 0.012979246400649271, 0.012595814351669916, 0.014714607377710237, 0.011727516021525658, 0.011035077266739704, 0.0089698030032708698, 0.0087245475140550147, 0.011139467365240661, 0.0094505568595650603, 0.014430361388952871, 0.0089241578716030695, 0.014616210804585136, 0.013295072783119581, 0.014430633057603408, 0.01200577022494694, 0.011315388654675421, 0.013359877656434442, 0.017704146495248471, 0.0089900858719559155, 0.014731590728415532, 0.0053244009632545759, 0.011199377929150522, 0.0098899254166580439, 0.012220397221188688, 0.015315682643295272, 0.0042842773538990919, 0.0098560854848898077, 0.0088592602102698509, 0.011682575531316278, 0.0098450268165344631, 0.015508017179782136, 0.0083959771972897564, 0.0057504382506886418, 0.010149849298310511, 0.011467172305959087, 0.019354427705224483, 0.013200207481702888, 0.0084555200083286791, 0.011458643458455485, 0.0067582116806278788, 0.01083616691886825, 0.013189184991857963, 0.011774794518724967, 0.014419252448288828, 0.011252283438046358, 0.013346699363583018, 0.0070752340082163006, 0.013215300343131422, 0.0083841320189162287, 0.0067600805611729283, 0.014043517055899181, 0.0098241497159076551, 0.011466675085574904, 0.01155354571355972, 0.012051701509217881, 0.010150596813866767, 0.0093930906430917619, 0.003368481869910186, 0.0048359029438027378, 0.0072083852964288445, 0.010112266453748613, 0.014009345326404186, 0.0050187514558796657, 0.0076315122645601551, 0.0098572381625301152, 0.0114902035403828, 0.018390212262653569, 0.020552166087412803, 0.010428735773226807, 0.011717974670325962, 0.011586303572796604, 0.0092978832913345726, 0.0040060048273946845, 0.012302496528511328, 0.0076707934776137684, 0.014700766223305586, 0.013491092168119941, 0.016244916923257174, 0.010387716692694397, 0.0072564046806323553, 0.0089420045528720883, 0.012125390630607462, 0.013274623392811291, 0.012783388635585766, 0.013859113028817658, 0.0080975189401925642, 0.01379241865445455, 0.012648552766643405, 0.011380280655911323, 0.010109646424218717, 0.0098577688652478051, 0.0064661895943772208, 0.010848835432253455, -0.0010986941731458047, -0.00052875821639583262, 0.0020423603076171414, 0.0035710440970171805, 0.001652886517437206, 0.0023512717524485573, -0.002695275440737862, 0.002253880812688683, -0.0080855104018828141, -0.0020090808966136161, -0.0029794078852333791, 0.00047537441103425869, -0.0010168825525621432, 0.0028683012479151873, -0.0014733214239664142, 0.0019432702158397569, -0.0012411849653504801, -0.00034507088510895141, -0.0023587874349834145, 0.0018156591123708393, 0.0040923006067568324, 0.0043522232127477072, -0.0055992642684123371, -0.0019368557792245147, 0.0026257395447205848, 0.0025594329536029635, 0.00053681548609292378, 0.0032186216144045742, -0.003338121135450386, 0.00065996843114729585, 0.006711173245189642, 0.0032877327776177517, 0.0039528629317296367, 0.0063732674764248719, -0.0026207617244284023, 0.0061381482567009048, -0.003024741769256066, -0.0023891419421980839, -0.004011235930513047, 0.0018372067754070733, -0.0045928077859572689, -0.0021420171112169601, 0.001665179522797816, 0.0074356736689407859, 0.0065680163280897891, -0.0038116640825467678]
data = np.column_stack([x,y])
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
plt.scatter(data[:, 0], data[:, 1], c=y_kmeans, s=5, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
plt.grid()
I’d also like to be able to return the max, min and average for the values in each cluster - is this possible?
Some ideas on your problem.
k-means is actually a multivariate method, so it is probably not a good choice in your case. You can take advantage of the 1-dimensionality of you data by looking for minima of a kernel density estimation of the y-data. A plot of the density estimation will show a bimodal density function with the two modes divided by a minimum which is the y-value at which you want to divide the two clusters.
Have a look at http://scikit-learn.org/stable/modules/density.html#kernel-density
To get the x-values at which you divide, you could use the moving average you already computed.
However, there might be methods better suited to your kind of data. You might want to ask your question at https://stats.stackexchange.com/ as it is not really a programming problem but one about the appropriate method.
You can reshape your data to a n x 1 array.
But if you want to take the time into account, I suggest you look into change detection in time series instead. It can detect a change in mean.
Using your code, the simplest way to get what you want is to change:
kmeans.fit(data)
y_kmeans = kmeans.predict(data)
to
kmeans.fit(data[:,1].reshape(-1,1))
y_kmeans = kmeans.predict(data[:,1].reshape(-1,1))
You can get max, min, mean etc by using index, for example:
np.max(data[:,1][y_kmeans == 1])

Categories