I have two datasets(data.csv, label.csv).The "label.csv" dataset contains the gene names of the samples on "data.csv" dataset. I want to run k-means clustering algorithm in those datasets and make a scatter plot. But I am having difficulties in it. Below I have given the code that I performed for getting the clustering done.
import pandas as pd
from sklearn.cluster import KMeans
X = pd.read_csv("data.csv")
Y = pd.read_csv("labels.csv")
#reduce the dimension of X
import time
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
X = X.dropna()
# drop the first column which only contains strings
X = X.drop(X.columns[X.columns.str.contains('unnamed', case=False)], axis=1)
# label encode the multiple class string into integer values
Y = Y.drop(Y.columns[0], axis=1)
le = LabelEncoder()
le.fit(Y)
class_names = list(le.classes_)
Y = Y.apply(LabelEncoder().fit_transform)
Y_data = Y.values.flatten()
# use TSNE to visualize the high dimension data in 2D
t0 = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300, random_state=100)
tsne_results = tsne.fit_transform(X)
t1 = time.time()
print("TSNE took at %.2f seconds" % (t1 - t0))
# visualize TSNE
x_axis = tsne_results[:,0]
y_axis = tsne_results[:,1]
plt.scatter(x_axis, y_axis, c=Y_data, cmap=plt.cm.get_cmap("jet", 100))
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
plt.title("TSNE Visualization")
plt.show()
The code above gives a scatter plot(given below) of the 5 different classes in my datasets. the classes are colored differently from 0 to 4 (5 classes).
But when I apply K-means clustering code(given below) it shows 5 clusters but the classes colors are not from 0 0 to 4 instead it ranges from 0 to 9. The scatter plot image is given below the k means cluster code.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
model=KMeans(n_clusters=5)
model.fit(tsne_results)
label=model.predict(tsne_results)
#centroid calculation
xs =tsne_results[:,0]
ys =tsne_results[:,1]
plt.scatter(xs,ys,c=Y_data,alpha=0.5)
centroids = model.cluster_centers_
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
plt.show()
Now, in which part of the code do I need to bring change in order to get a scatter plot like the first one showing the colored classes from 0 to 4 for 5 clusters.
I don't see what the problem here is. You have 5 different clusters colored differently, and i think it's just the coloring that ranges from 0-9, with the centroids being their own color. So unless you specifically need them to range 0-4 for some reason, i would use this plot. You can see the clusters, and the centroids, which is all you really need.
The code to look at though would be
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
Related
I have asked a similar question here: How to apply KMeans to get the centroid using dataframe with multiple features and I received some valuable responses. However, I have not succeeded in getting KMeans clustering working on a dataframe with more than 4 columns.
The dataframe in question has 5 columns as below:
col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22
I have a simple KMeans clustering python code that I try to apply on the 5 column dataframe like below.
from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
X = np.array(df)
model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()
When I run the code it complains about the line pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4]), with the error message 'ValueError: Unrecognized marker style [[0.14 0.44 0.22]]'. However, if I remove the 5th column from the dataframe (i.e. col5) and remove X[row_ix, 4] from the code, the clustering works.
What do I need to do to get KMeans working on my example dataframe?
[Updated: 2 or 3 dimension at a time]
From the previous post, it was suggested I could split the task by representing 2 or 3 dimensions at a time using the below function. However, the function does not produce the expected clustering output (see attached output.png)
def plot(self):
import itertools
combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination
for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
for i, index in enumerate(self.clusters):
point = self.X[index].T
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py)
for point in self.centroids:
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py, marker="x", color='black', linewidth=2)
ax.set_title('feature {} vs feature {}'.format(x,y))
plt.show()
How can I fix the above function to get the clustering output.
As mentioned in the other answer and comments, you cannot plot all 5 axis together. One way is use dimension reduction such as PCA to reduce it to 2 dimensions and plot:
import numpy as np
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_csv('test.csv')
model = KMeans(n_clusters=5)
model.fit(df)
yhat = model.predict(df)
clusters = np.unique(yhat)
dims = PCA(n_components=2).fit_transform(X)
dims = pd.DataFrame(dims,columns=['PC1','PC2'])
fig,ax = plt.subplots(1,1)
for cluster in clusters:
ix = yhat == cluster
ax.scatter(x=dims.loc[ix,'PC1'],y=dims.loc[ix,'PC2'],label=cluster)
ax.legend()
Or you do use seaborn and visualize all your variables, which is ok if you only have 5 variables:
import seaborn as sns
df['cluster'] = yhat
sns.pairplot(data=df,hue='cluster',diag_kind=None)
Your KMeans work but the way you want to display the result is not proper.
If you look at the documentation of matplotlib scatter function (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html), you will see that the four first arguments of the function can accept an array-like while the fifth only accept a 'MarkerStyle'. That's why you get an error only when when you add the fifth argument.
Actually, you are trying to plot a 5 dimension dataset in a 2 dimension plane what is not possible without doing a dimensionality reduction beforehand.
A PCA or a PLSDA could be a good option to reduce the dimensionality of your dataset.
I have a set of data that I've been assigned to apply PCA and retain one component and then visualize the distribution in a scatter plot which indicates the class of each data point.
For context: The data we're working with has three columns. X is column 1 and 2 and y is column 3 which contains the class of each data point.
It was implied that the resulting visualization should be a horizontal line, but I'm not seeing that. The resulting visualization is a scatter plot that looks like a positive linear distribution.
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
df = pd.read_csv("data.csv", header=None)
X = df.iloc[:, 0:2].values
y = df.iloc[:,-1].values
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=np.random)
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
pcaObj1 = PCA(n_components=1)
X_train_PCA = pcaObj1.fit_transform(X_train)
X_test_PCA = pcaObj1.transform(X_test)
X_set, y_set = X_test_PCA, y_test
X3 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01))
X3 = np.array(X3)
plt.xlim(X3.min(), X3.max())
plt.ylim(X3.min(), X3.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 0],
c = ListedColormap(('purple', 'yellow'))(i), label = j)
I see that you have a test set in addition to a training set, however this not the usual setup for PCA. PCA has multiple applications, but one of the main ones is dimensionality reduction. Dimensionality reduction is about removing variables, and PCA serves this purpose by changing the basis of your data and ordering them by the amount (or relative amount) of the total variation that they linearly explain. Since this does not require test data, we can think of this as unsupervised machine learning, although many would also prefer to call this feature engineering as it is often used to preprocess data to improve the performance of models trained on that preprocessed data.
Let me generate a random dataset with 10 variables and 1000 entries for the sake of example. Fitting the PCA transform for 1 component, you're selecting a new variable (feature) that is a linear combination of the original variables that attempts to linearly explain the most variance in the data. As you say, it is a number line; just as a quick-and-easy plot let's just use the x-axis as the index of the new variable array and the y-axis as the value of the variable.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)
pcaObj1 = PCA(n_components=1)
X_PCA = pcaObj1.fit_transform(X_train)
plt.scatter(range(len(y_labels)), X_PCA, c=['red' if i==0 else 'green' for i in y_labels])
plt.show()
You can see this produces a 1000 x 1 array representing your new variable.
>>> X_PCA.shape
(1000, 1)
If you had selected n_components=2 instead, you'd have a 1000 x 2 array with two such variables. Let's see that as example. This time I'll plot the two principal components against each other instead of using a single principal component against its index.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.random((1000, 10))
y_labels = np.array([0] * 500 + [1] * 500)
pcaObj1 = PCA(n_components=2)
X_PCA = pcaObj1.fit_transform(X_train)
plt.scatter(X_PCA[:,0], X_PCA[:,1], c=['red' if i==0 else 'green' for i in y_labels])
plt.show()
Now, my randomly-generated data may not have the same properties as your data set. If you really expect the output to be a line, then I'd say certainly not as my example generates a very eratic trace. You'll see even in the 2D case that the data doesn't seem structured by class, but that's what you would expect from random data.
This example should give some clarity. Make sure you read all the comments so you can follow what's going on.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we take the first four features.
y = iris.target
print(X.sample(5))
print(y.sample(5))
# see how many samples we have of each species
data["species"].value_counts()
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns = X.columns)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();
# do same for petals
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "petal_width", "sepal_width") \
.add_legend();
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
seed = 10
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
# first, convert species to an arbitrary number
y_id_array = pd.Categorical(data['species']).codes
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
I am currently following a course on the basics of machine learning provided by IBM. After the teacher finished building the model, I noticed that he does not use the normalized data to fit the model, but rather uses regular data and in the end he gets a good cluster and non-overlapping clusters. But when I tried to use the normalized data to train the model, I got a catastrophe and I got nested clusters, as the code and image show. Why did the process of normalization lead to that? Although it is always good "as I know" to use normalization in mathematical basis algorithms.
code does not use normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_featur = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(X)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=kmeans.labels_.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
CLUSTERS WITH OUT USING NORMALIZATION
code using normalized data
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
cust_df = pd.read_csv('D:\machine learning\Cust_Segmentation.csv')
cust_df.head()
df = cust_df.drop('Address', axis = 1)
X = df.values[:, 1:]
X = np.nan_to_num(X)
from sklearn.preprocessing import StandardScaler
norm_feature = StandardScaler().fit_transform(X)
clusterNum = 3
kmeans = KMeans(init = 'k-means++', n_clusters = clusterNum, n_init = 12)
kmeans.fit(norm_feature)
k_means_labels = kmeans.labels_
df['cluster'] = kmeans.labels_
k_means_cluster_centers = kmeans.cluster_centers_
area = np.pi * ( norm_feature[:, 1])**2
plt.scatter(norm_feature[:, 0], norm_feature[:, 3], s=area, c=kmeans.labels_.astype(np.float),
alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()
CLUSTER AFTER NORMALIZATION
Income and age are on fairly different scales here. In your first plot, a difference of ~100 in income is about the same as a difference of ~10 in age. But in k-means, that difference in income is considered 10x larger. The vertical axis easily dominates the clustering.
This is probably 'wrong', unless you happen to believe that a change of 1 in income is 'the same as' a change of in 10 age, for purposes of figuring out what's similar. This is why you standardize, which makes a different assumption: that they are equally important.
Your second plot doesn't quite make sense; k-means can't produce 'overlapping' clusters. The problem is that you have only plotted 2 of the 4 (?) dimensions you clustered on. You can't plot 4D data, but I suspect that if you applied PCA to the result to reduce to 2 dimensions first and plotted it, you'd see separated clusters.
I have a dataset with 2 parameters, looking like this (I have added density contour plots):
My goal is to separate this sample in 2 subsets like this:
This image comes from QUENCHING OF STAR FORMATION IN SDSS GROUPS:CENTRALS, SATELLITES, AND GALACTIC CONFORMITY, Knobel et. al., The Astrophysical Journal, 800:24 (20pp), 2015 February 1, available here. The
separation line has been drawn by eye and is not perfect.
What I need is something like the red line (maximizing distances) in this nice Wikipedia graph:
Unfortunately, all linear classification that seem close to what I'm looking for (SVM, SVC, etc.) are supervised learning.
I have tried unsupervised learning, like KMeans 2 clusteers, this way(CompactSFR[['lgm_tot_p50','sSFR']] being the Pandas dataset you can find at the end of this post):
X = CompactSFR[['lgm_tot_p50','sSFR']]
from sklearn.cluster import KMeans
kmeans2 = KMeans(n_clusters=2)
# Fitting the input data
kmeans2 = kmeans2.fit(X)
# Getting the cluster labels
labels2 = kmeans2.predict(X)
# Centroid values
centroids = kmeans2.cluster_centers_
f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(10, 5), sharey=True)
ax1.scatter(CompactSFR['lgm_tot_p50'],CompactSFR['sSFR'],c=labels2);
X2 = kmeans2.transform(X)
ax1.set_title("Kmeans 2 clusters", fontsize=15)
ax1.set_xlabel('$\log_{10}(M)$',fontsize=10) ;
ax1.set_ylabel('sSFR',fontsize=10) ;
f.subplots_adjust(hspace=0)
but the classification I get is this:
Which doesn't work.
Furthermore, what I want is not a simple classification but the equation of the separation line (which is obviously very different from a linear regression).
I would like to avoid developing a Bayesian model of maximum likelihood if something already exists.
You can find a small sample (959 points) here.
NB : this question doesn't correspond to my case.
The following code will do it with a Gaussian Mixture model of 2 components, and produces this result.
First, read the data from your file and remove outliers:
import pandas as pd
import numpy as np
from sklearn.neighbors import KernelDensity
frm = pd.read_csv(FILE, index_col=0)
kd = KernelDensity(kernel='gaussian')
kd.fit(frm.values)
density = np.exp(kd.score_samples(frm.values))
filtered = frm.values[density>0.05,:]
Then fit a Gaussian Mixture Model:
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=2, covariance_type='full')
model.fit(filtered)
cl = model.predict(filtered)
To obtain the plot:
import matplotlib.pyplot as plt
plt.scatter(filtered[cl==0,0], filtered[cl==0,1], color='Blue')
plt.scatter(filtered[cl==1,0], filtered[cl==1,1], color='Red')
the question topic is a little complex because I need a lot of help lol. To explain, I have a csv of data with labels (names) and numerical data...
name,post_count,follower_count,following_count,anonymous_pic,is_private,...
adam,3,997,435,0,0,1,0,0,0,0 bob,2,723,600,0,0,1,0,0,0,0
jill,11,3193,962,0,0,1,0,0,0,0 sara,0,225,298,0,0,1,0,0,0,0
.
.
and so on. This data is loaded into a pandas dataframe from the csv. Now, I wish to pass only the numerical parts of this data into a sklearn.manifold class called TSNE (t-distributed stochastic neighbor embedding) which will output a list the same size as the input data, where each element of the new list is is list of size k (where k is the number of components specified as an argument to the TSNE class). In my case k = 2.
I'm graphing this data on a 2-D scatter plot from matplotlib, and I'd like to be able to inspect the labels on the data. I know matplotlib has an annotate feature in which points can be labeled, but how do I go about separating these labels from the data for TSNE? and if i just separate the labels prior to transformation, how can I go about ensuring that i'm relabeling the right points?
I'd like to be able to inspect these names, because I need to see if the transformation is useful on my data. This way I can analyze a few really bizarre places and see if something interesting is happening. Here is my code if you find it useful (Although I'll admit its just scratchwork)
# Data structuring
import pandas as pd
import numpy as np
# Plotting
import seaborn as sns
import matplotlib.pyplot as plt
sns.set() # for plot styling
# Load data
df = pd.read_csv('user_data.csv')
print(df.head())
# sklearn
from sklearn.mixture import GMM
from sklearn.manifold import TSNE
tsne = TSNE(n_components = 2, init = 'random', random_state = 0)
lab_proj = tsne.fit_transform(df)
x = [i[0] for i in lab_proj]
y = [i[1] for i in lab_proj]
print(len(lab_proj))
df['PCA1'] = x
df['PCA2'] = y
model = GMM(n_components = 1, covariance_type = 'full')
model.fit(df)
y_gmm = model.predict(df)
df['cluster'] = y_gmm
sns.lmplot('PCA1', 'PCA2', data = df, col='cluster', fit_reg = False)
plt.show()
Thanks!