I want to subset anndata on basis of clusters, but i am not able to understand how to do it.
I am running scVelo pipeline, and in that i ran tl.louvain function to cluster cells on basis of louvain. I got around 32 clusters, of which cluster 2 and 4 is of my interest, and i have to run the pipeline further on these clusters only. (Initially i had the loom file which i read in scVelo, so i have now the anndata.)
I tried using adata.obs["louvain"] which gave me the cluster information, but i need to write a new anndata with only 2 clusters and process further.
Please help on how to subset anndata. Any help is highly appreciated. (Being very new to it, i am finding it difficult to get)
If your adata.obs has a "louvain" column that I'd expect after running tl.louvain, you could do the subsetting as
adata[adata.obs["louvain"] == "2"]
if you want to obtain one cluster and
adata[adata.obs['louvain'].isin(['2', '4'])]
for obtaining cluster 2 & 4.
Feel free to use this function I wrote for my work.
import AnnData
import numpy as np
def cluster_sampled(adata: AnnData, clusters: list, n_samples: int) -> AnnData:
"""Sample n_samples randomly from each louvain cluster from the provided clusters
Parameters
----------
adata
AnnData object
clusters
List of clusters to sample from
n_samples
Number of samples to take from each cluster
Returns
-------
AnnData
Annotated data matrix with sampled cells from the clusters
"""
l = []
adata_cluster_sampled = adata[adata.obs["louvain"].isin(clusters), :].copy()
for k, v in adata_cluster_sampled.obs.groupby("louvain").indices.items():
l.append(np.random.choice(v, n_samples, replace=False))
return adata_cluster_sampled[np.concatenate(l)]
Related
I am attempting to create a chord diagram using plot_connectivity_circle from mne_connectivity.viz library.
My data is similar to the following, where letter and number represent separate nodes and count represents the number of connections between those nodes:
import random
import string
random.seed(10)
df = pd.DataFrame({'letter':[random.choice(string.ascii_lowercase) for x in range(20)],
'number':[str(random.randint(0,22)) for x in range(20)],
'Count':[random.randint(20,50) for x in range(20)]})
The documentation for mne cites examples in which a square matrix of connectivity scores is used to create the chord diagram which differs from my use case.
However, it also states that a 1d matrix can be used for the connectivity scores if arrays of indices are also passed that correspond to the correct list of node names. Therefore I assume that df.Count can be used to represent the connectivity scores?
Given my data, I can't figure how to pass the relevant data to the node_names and indices arguments in the correct order and would appreciate some guidance please!
For reference, I have achieved a similar visualisation using the holoviews library but find the options for customisation to be lacking. Code and output for that visualisation included below as an example:
import numpy as np
import holoviews as hv
from holoviews import opts, dim
hv.extension('bokeh')
hv.output(size=350)
nodes = list(set(df['letter'].tolist() + df['number'].tolist()))
nodes = hv.Dataset(pd.DataFrame(nodes, columns=['node']))
chord = hv.Chord((df, nodes))
chord.opts(
opts.Chord(
labels = 'node', label_text_font_size='12pt',
node_color='node', node_cmap='Category20', node_size=10,
edge_color='number', edge_cmap='Category20', edge_alpha=0.9, edge_line_width=1)
)
For the record, I have found an acceptable solution to this issue.
I returned to the original data (the above was the result of groupby and count to get df.Count values) and used crosstab() to generate a dataframe containing the connectivity scores. I referred to the answer to this post for direction
I then transformed the result to an adjacency matrix using to_numpy() which could be passed to the con argument for plot_connectivity_circle().
A list of the columns from the crosstab() can then be passed to the node_names argument.
I don't have time to post a working example of my code right now but will hopefully find time later.
If anyone knowledgeable in the use of mne and plot_connectivity_cirlce can help answer the original question given the data in the form described in the original post, I'd be very interested to learn how it is done!
I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.
I've got 10 clusters in k-modes,
data:- categorical(i converted to binary then run model).
used technology:- jupyter-python.
doubt:- 1. find accuracy.
plotting/visualising cluster in 2d and 3d.
Something like this should be a good start.
#recreate data to feed into the algorithm
data = np.asarray([np.asarray(df['field1']),np.asarray(df['field2'])]).T
So now running the following piece of code:
# computing K-Means with K = 5 (5 clusters)
centroids,_ = kmeans(data,5)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'oy',
data[idx==2,0],data[idx==2,1],'or',
data[idx==3,0],data[idx==3,1],'og',
data[idx==4,0],data[idx==4,1],'om')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
This is a great resource.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/
I have applied PCA to a dataframe in order to plot clusters based on K-means. Since i have like 24 features in my original df, i don't want to plot clusters based in only 3 or 3 features at each time. So what i want to do, is to plot the combinations of those features, to get a more general/representative graphical respresentation of each feature in the clusters.
I extracted the components using pca.components_ and i have created the following df of components:
PC-1 PC-2
media_bi_mov 0.003094 0.050599
media_bi_post 0.000762 0.028931
total_mov_prod_300 0.000836 0.573675
codsprod_0 0.440476 -0.004404
codsprod_1 0.008005 0.105349
codsprod_2 0.002851 0.042459
codsprod_3 0.001078 0.009355
codsprod_4 -0.011922 -0.022020
idaplic_0 0.392229 -0.002817
idaplic_1 0.003001 0.004822
idaplic_2 0.044730 -0.001148
idaplic_3 0.097695 -0.008628
idaplic_4 0.024273 0.486973
idaplic_5 0.234798 -0.033369
idaplic_6 0.019329 0.015455
idempro_36 0.000401 -0.000438
idempro_38 0.032149 0.292137
idempro_49 0.439413 -0.023269
codmonsw_EUR 0.440543 -0.002770
codmonsw_USD 0.000378 0.000664
resto_codsprod 0.011406 0.011731
resto_idaplic 0.041649 0.005692
días_entre_ops -0.011129 -0.015144
frecuencia 0.440543 -0.002770
valor_total_eur 0.000836 0.573675
normally i would plot the clusters using kmeans.labels_ to apply a different color to each cluster if this was the original df. But my issue now is that i can't use kmeans.labels_ to differentiate each cluster in this pca-reduced df, since kmeans.labels_ will have a bigger length.
How can i apply color to differentiate the clusters in this dataframe??
Thanks in advance
i didn't realise the solution to this problem was so easy: I just needed to run kmeans on the components df to get the cluster labels for each feature in each principal component. Hope this will help someone with the same doubts as me.
I'm trying to cluster data using lat/lon as X/Y axes and DaysUntilDueDate as my Z axis. I also want to retain the index column ('PM') so that I can create a schedule later using this clustering analysis. The tutorial I found here has been wonderful but I don't know if it's taking the Z-axis into account, and my poking around hasn't resulted in anything but errors. I think the essential point in the code is the parameters of the iloc bit of this line:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(A.iloc[:, :])
I tried changing this part to iloc[1:4] (to only work on columns 1-3) but that resulted in the following error:
ValueError: n_samples=3 should be >= n_clusters=4
So my question is: How can I set up my code to run clustering analysis on 3-dimensions while retaining the index ('PM') column?
Here's my python file, thanks for your help:
from sklearn.cluster import KMeans
import csv
import pandas as pd
# Import csv file with data in following columns:
# [PM (index)] [Longitude] [Latitude] [DaysUntilDueDate]
df = pd.read_csv('point_data_test.csv',index_col=['PM'])
numProjects = len(df)
K = numProjects // 3 # Around three projects can be worked per day
print("Number of projects: ", numProjects)
print("K-clusters: ", K)
for k in range(1, K):
# Create a kmeans model on our data, using k clusters.
# Random_state helps ensure that the algorithm returns the
# same results each time.
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
# These are our fitted labels for clusters --
# the first cluster has label 0, and the second has label 1.
labels = kmeans_model.labels_
# Sum of distances of samples to their closest cluster center
SSE = kmeans_model.inertia_
print("k:",k, " SSE:", SSE)
# Add labels to df
df['Labels'] = labels
#print(df)
df.to_csv('test_KMeans_out.csv')
It seems the issue is with the syntax of iloc[1:4].
From your question it appears you changed:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[:, :])
to:
kmeans_model = KMeans(n_clusters=k, random_state=1).fit(df.iloc[1:4])
It seems to me that either you have a typo or you don't understand how iloc works. So I will explain.
You should start by reading Indexing and Selecting Data from the pandas documentation.
But in short .iloc is an integer based indexing method for selecting data by position.
Let's say you have the dataframe:
A B C
1 2 3
4 5 6
7 8 9
10 11 12
The use of iloc in the example you provided iloc[:,:] selects all rows and columns and produces the entire dataframe. In case you aren't familiar with Python's slice notation take a look at the question Explain slice notation or the docs for An Informal Introduction to Python. The example you said caused your error iloc[1:4] selects the rows at index 1-3. This would result in:
A B C
4 5 6
7 8 9
10 11 12
Now, if you think about what you are trying to do and the error you received you will realize that you have selected fewer samples form your data than you are looking for clusters. 3 samples (rows 1, 2, 3) but you're telling KMeans to find 4 clusters, which just isn't possible.
What you really intended to do (as I understand it) was to select all rows and columns 1-3 that correspond to your lat, lng, and z values. To do this just add a colon as the first argument to iloc like so:
df.iloc[:, 1:4]
Now you will have selected all of your samples and the columns at index 1, 2, and 3. Now, assuming you have enough samples, KMeans should work as you intended.