Principal Component Analysis - three classes mixed on three seperated groups - python

I have a Dataset with 3 labels and 27 features. I was trying to use the PCA on it and reduce the dimensions to 2. The results are a bit confusing. Honestly, I didn't expect too good results, but I got the first picture and I was very surprised.
Since I have three labels, I thought that I got my three classes pretty clear. However, when I apply the colors, I get the following picture:
I am a bit wondered about the fact that the three classes are totally mixed on three clearly seperated groups. I also tried it in 3D an the results looks exactly the same.
Is there any error in my code or does anyone know a reason why this could happen?
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import (StandardScaler, MaxAbsScaler, RobustScaler,
Normalizer, QuantileTransformer, PowerTransformer, MinMaxScaler)
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
Dataset = pd.read_csv("...", header=0)
feature_spalten = ['...']
x = Dataset[feature_spalten]
y = Dataset.Classifier
sc = StandardScaler()
x = sc.fit_transform(x)
p = PCA()
p.fit(x)
x_transformed = p.transform(x)
plt.figure()
plt.scatter(x_transformed[:, 0], x_transformed[:, 1])
plt.figure()
for label in y.unique():
x_transformed_filtered = x_transformed[y == label, :]
plt.scatter(x_transformed_filtered[:, 0], x_transformed_filtered[:, 1],
label=label, s = 25)
plt.legend()
plt.show()

This is suggestive that your data is clustered in high dimensional space, with each cluster comprised of instances with an assortment of the labels.
The objective of PCA is to find a lower-dimensional projection that preserves the variance of the data. The following hypothetical example shows how linearly separable two-dimensional data (with three clusters) can be projected to one dimension, with the clusters in the projection not corresponding to labels (red versus blue).

Related

Plotting each Cluster value percentage individually

So I have been working on this problem for a bit and seem to be stuck..so I am asking for some guidance here.
This is my code
from clusteval import clusteval
from sklearn.datasets import make_blobs
import pandas as pd
X, labels = make_blobs(n_samples=50, centers=2, n_features=5, cluster_std=1)
X = abs(X)
X = pd.DataFrame(X, columns=['Feature_1','Feature_2','Feature_3','Feature_4','Feature_5'])
ce = clusteval('kmeans', metric='euclidean', linkage='complete')
results = ce.fit(X)
X['Cluster_labels'] = results['labx']
X.groupby('Cluster_labels').Feature_1.value_counts(normalize=True).plot(kind='bar')
plt.tight_layout()
plt.show()
This produces this image:
This image is really close to what I want but notice that both clusters show up in the same graph. I would like to produce the same graph represents only one cluster. essentially for every cluster I have I want a graph like this. So if I had 10 clusters, I would have 10 graphs that showed the percentage of each value within that cluster and that cluster only.
Any guidance or help is appreciated. Thanks.
I can suggest two alternative plots. Both would benefit from visual refinement (label all axes, clean up underscores, pick nicer font sizes, etc.) but hopefully are useful starting points.
Using pandas:
axes = X.hist('Feature_1', by='Cluster_labels')
for ax in axes:
ax.set_title('Cluster_labels = ' + ax.get_title())
Using seaborn:
import seaborn as sns
sns.displot(X,
x='Feature_1',
col='Cluster_labels',
binwidth=0.5)

Scatterplot of clustered data, to show Clusters and Centers

I found the better number of clusters and my clusters for eacch data.
Now hoe can i plot my scatter based on centers and clusters to see datas?
This is my dataset.
This is code i using.
x = df_diabetes_normalizado['Glicose']
y = df_diabetes_normalizado['Massa Corporal']
Cluster = df_diabetes_normalizado['clusters']
centers = np.random.randn(1, 2)
fig = plt.figure(figsize=(14,9))
ax = fig.add_subplot(111)
scatter = ax.scatter(x,y,c=Cluster,s=50)
for i,j in centers:
ax.scatter(i,j,s=50,c='red',marker='+')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.show()
However the plot is so cofuse for me.
Could you please give me some guide how i can fix my script to generare my correct scater based in centers and clustering distribution?
Because
You're plotting the wrong variable: your dependent variable should be 'Classe' (1/0, presumably for diabetic or not) Not 'clusters', which is merely an integer telling you how many clusters exhibit those characteristics, not whether they're in Classe==0 or 1.
clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
1) clusters is merely an integer telling you how many clusters exhibit those characteristics, not whether each cluster is in Classe==0 or 1.
Cluster = df_diabetes_normalizado['clusters']
...
scatter = ax.scatter(x,y,c=Cluster, ...)
Your plot is wrongly using color to show c=Cluster i.e. the number of clusters, you're not plotting Classe anywhere. Plot Classe instead. (You might choose to use size=Clusters, so larger clusters plot larger)
2) 'Generate the correct scatterplot [of two variables]' is not well-defined; clearly you have 8 variables ('Numero Gravida', 'Glicose', 'Pressao', ..., 'Idade') and your dependent variable ('Classe') is a function of all 8 of them, not just the two you arbitrarily picked to plot: x='Glicose' and y='Massa Corporal'.
Assuming you don't want to do a 3D or n-dimensional plot, you either do:
some dimensional reduction with PCA (Principal Component Analysis), then plot the most important two/three pseudovariables (see e.g. this example...)
or else build a model based on a custom cluster distance function.
If you post MCVE for your dataset, and you tell us what sort of plot you actually want, then can post code.
Example using iris dataset:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:,0:2]
y = iris.target
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
assignments = kmeans.labels_ # this is the CLUSTERS column in your case
plt.figure(figsize=(12,8))
classes = np.unique(assignments)
colors= ['r','b','k','y'] # 4 CLUSTERS SO 4 COLORS HERE
for s,l in enumerate(classes):
xs = X[:,0]
ys = X[:,1]
plt.scatter(xs[assignments==s], ys[assignments==s], c = colors[s]) # color based on group
plt.plot(kmeans.cluster_centers_[0][0], kmeans.cluster_centers_[0][1], 'ro',markersize=16, alpha = 0.5, label='')
plt.plot(kmeans.cluster_centers_[1][0], kmeans.cluster_centers_[1][1], 'bo',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[2][0], kmeans.cluster_centers_[2][1], 'ko',markersize=16, alpha = 0.5)
plt.plot(kmeans.cluster_centers_[3][0], kmeans.cluster_centers_[3][1], 'yo',markersize=16, alpha = 0.5)
plt.grid()

t-SNE map into 2D or 3D plot

features = ["Ask1", "Bid1", "smooth_midprice", "BidSize1", "AskSize1"]
client = InfluxDBClient(host='127.0.0.1', port=8086, database='data',
username=username, password=password)
series = "DCIX_2016_11_15"
sql = "SELECT * FROM {} where time >= '{}' AND time <= '{}' ".format(series,FROMT,TOT)
df = pd.DataFrame(client.query(sql).get_points())
#Separating out the features
X = df.loc[:, features].values
# Standardizing the features
X = StandardScaler().fit_transform(X)
tsne = TSNE(n_components=3, n_jobs=5).fit_transform(X)
I would like map my 5 features into a 2D or 3D plot. I am a bit confused how to do that. How can I build a plot from that information?
You already have most of the work done. t-SNE is a common visualization for understanding high-dimensional data, and right now the variable tsne is an array where each row represents a set of (x, y, z) coordinates from the obtained embedding. You could use other visualizations if you would like, but t-SNE is probably a good starting place.
As far as actually seeing the results, even though you have the coordinates available you still need to plot them somehow. The matplotlib library is a good option, and that's what we'll use here.
To plot in 2D you have a couple of options. You can either keep most of your code the same and simply perform a 2D t-SNE with
tsne = TSNE(n_components=2, n_jobs=5).fit_transform(X)
Or you can just use the components you have and only look at two of them at a time. The following snippet should handle either case:
import matplotlib.pyplot as plt
plt.scatter(*zip(*tsne[:,:2]))
plt.show()
The zip(*...) transposes your data so that you can pass the x coordinates and the y coordinates individually to scatter(), and the [:,:2] piece selects two coordinates to view. You could ignore it if your data is already 2D, or you could replace it with something like [:,[0,2]] to view, for example, the 0th and 2nd features in higher-dimensional data rather than just the first 2.
To plot in 3D the code looks much the same, at least for a minimal version.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(*zip(*tsne))
plt.show()
The main differences are a use of 3D plotting libraries and making a 3D subplot.
Adding color: t-SNE visualizations are typically more helpful if they're color-coded somehow. One example might be the smooth midprice you currently have stored in X[:,2]. For exploratory visualizations, I find 2D plots more helpful, so I'll use that as the example:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2])
You still need the imports and whatnot, but by passing the keyword argument c you can color code the scatter plot. To adjust how that numeric data is displayed, you could use a different color map like so:
plt.scatter(*zip(*tsne[:,:2]), c=X[:,2], cmap='RdBu')
As the name might suggest, this colormap consists of a gradient between red and blue, and the lower values of X[:,2] will correspond to red.

How can I plot the probability density function for a fitted Gaussian mixture model under scikit-learn?

I'm struggling with a rather simple task. I have a vector of floats to which I would like to fit a Gaussian mixture model with two Gaussian kernels:
from sklearn.mixture import GMM
gmm = GMM(n_components=2)
gmm.fit(values) # values is numpy vector of floats
I would now like to plot the probability density function for the mixture model I've created, but I can't seem to find any documentation on how to do this. How should I best proceed?
Edit:
Here is the vector of data I'm fitting. And below is a more detailed example of how I'm doing things:
from sklearn.mixture import GMM
from matplotlib.pyplot import *
import numpy as np
try:
import cPickle as pickle
except:
import pickle
with open('/path/to/kde.pickle') as f: # open the data file provided above
kde = pickle.load(f)
gmm = GMM(n_components=2)
gmm.fit(kde)
x = np.linspace(np.min(kde), np.max(kde), len(kde))
# Plot the data to which the GMM is being fitted
figure()
plot(x, kde, color='blue')
# My half-baked attempt at replicating the scipy example
fit = gmm.score_samples(x)[0]
plot(x, fit, color='red')
The fitted curve doesn't look anything like what I'd expect. It doesn't even seem Gaussian, which is a bit strange given it was produced by a Gaussian process. Am I crazy?
I followed some examples mentioned in this thread and others and managed to get closer to the solution, but the final probability density function does not integrate to one. I guess, that I will post the question for this in another thread.
import ntumpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
np.random.seed(1)
mus = np.array([[0.2], [0.8]])
sigmas = np.array([[0.1], [0.1]]) ** 2
gmm = GaussianMixture(2)
gmm.means_ = mus
gmm.covars_ = sigmas
gmm.weights_ = np.array([0.5, 0.5])
#Fit the GMM with random data from the correspondent gaussians
gaus_samples_1 = np.random.normal(mus[0], sigmas[0], 10).reshape(10,1)
gaus_samples_2 = np.random.normal(mus[1], sigmas[1], 10).reshape(10,1)
fit_samples = np.concatenate((gaus_samples_1, gaus_samples_2))
gmm.fit(fit_samples)
fig = plt.figure()
ax = fig.add_subplot(111)
x = np.linspace(0, 1, 1000).reshape(1000,1)
logprob = gmm.score_samples(x)
pdf = np.exp(logprob)
#print np.max(pdf) -> 19.8409464401 !?
ax.plot(x, pdf, '-k')
plt.show()
Take a look at this link:
http://www.astroml.org/book_figures/chapter4/fig_GMM_1D.html
They show how to plot a 1D GMM in 3 different ways:
Take a look at the one of scikit-learn examples on Github
https://github.com/scikit-learn/scikit-learn/blob/master/examples/mixture/plot_gmm_pdf.py
The idea is to generate meshgrid, get their score from the gmm, and plot it.
The example shows
I think, this is an excellent resource - https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

Add more sample points to data

Given some data of shape 20x45, where each row is a separate data set, say 20 different sine curves with 45 data points each, how would I go about getting the same data, but with shape 20x100?
In other words, I have some data A of shape 20x45, and some data B of length 20x100, and I would like to have A be of shape 20x100 so I can compare them better.
This is for Python and Numpy/Scipy.
I assume it can be done with splines, so I am looking for a simple example, maybe just 2x10 to 2x20 or something, where each row is just a line, to demonstrate the solution.
Thanks!
Ubuntu beat me to it while I was typing this example, but his example just uses linear interpolation, which can be more easily done with numpy.interpolate... (The difference is only a keyword argument in scipy.interpolate.interp1d, however).
I figured I'd include my example, as it shows using scipy.interpolate.interp1d with a cubic spline...
import numpy as np
import scipy as sp
import scipy.interpolate
import matplotlib.pyplot as plt
# Generate some random data
y = (np.random.random(10) - 0.5).cumsum()
x = np.arange(y.size)
# Interpolate the data using a cubic spline to "new_length" samples
new_length = 50
new_x = np.linspace(x.min(), x.max(), new_length)
new_y = sp.interpolate.interp1d(x, y, kind='cubic')(new_x)
# Plot the results
plt.figure()
plt.subplot(2,1,1)
plt.plot(x, y, 'bo-')
plt.title('Using 1D Cubic Spline Interpolation')
plt.subplot(2,1,2)
plt.plot(new_x, new_y, 'ro-')
plt.show()
One way would be to use scipy.interpolate.interp1d:
import scipy as sp
import scipy.interpolate
import numpy as np
x=np.linspace(0,2*np.pi,45)
y=np.zeros((2,45))
y[0,:]=sp.sin(x)
y[1,:]=sp.sin(2*x)
f=sp.interpolate.interp1d(x,y)
y2=f(np.linspace(0,2*np.pi,100))
If your data is fairly dense, it may not be necessary to use higher order interpolation.
If your application is not sensitive to precision or you just want a quick overview, you could just fill the unknown data points with averages from neighbouring known data points (in other words, do naive linear interpolation).

Categories