Related
I've used GaussianMixture to analyze a multimodal distribution. From the GaussianMixture class I can access the means and covariances using the attributes means_ and covariances_. How can I use them to now plot the two underlying unimodal distributions?
I thought of using scipy.stats.norm but I don't know what to select as parameters for loc and scale. The desired output would be analogously as shown in the attached figure.
The example code of this question was modified from the answer here.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture
from scipy.stats import norm
ls = np.linspace(0, 60, 1000)
multimodal_norm = norm.pdf(ls, 0, 5) + norm.pdf(ls, 20, 10)
plt.plot(ls, multimodal_norm)
# concatenate ls and multimodal to form an array of samples
# the shape is [n_samples, n_features]
# we reshape them to create an additional axis and concatenate along it
samples = np.concatenate([ls.reshape((-1, 1)), multimodal_norm.reshape((-1,1))], axis=-1)
print(samples.shape)
gmix = mixture.GaussianMixture(n_components = 2, covariance_type = "full")
fitted = gmix.fit(samples)
print(fitted.means_)
print(fitted.covariances_)
# The idea is something like the following (not working):
new_norm1 = norm.pdf(ls, fitted.means_, fitted.covariances_)
new_norm2 = norm.pdf(ls, fitted.means_, fitted.covariances_)
plt.plot(ls, new_norm1, label='Norm 1')
plt.plot(ls, new_norm2, label='Norm 2')
It is not entirely clear what you are trying to accomplish. You are fitting a GaussianMixture model to the concatenation of the sum of the values of pdfs of two gaussians sampled on a uniform grid, and the unifrom grid itself. This is not how a Gaussian Mixture model is meant to be fitted. Typically one fits a model to random observations drawn from some distribution (typically unknown but could be a simulated one).
Let me assume that you want to fit the GaussianMixture model to a sample drawn from a Gaussian Mixture distribution. Presumably to test how well the fit works given you know what the expected outcome is. Here is the code for doing this, both to simulate the right distribution and to fit the model. It prints the parameters that the fit recovered from the sample -- we observe that they are indeed close to the ones we used to simulate the sample. Plot of the density of the GaussianMixture distribution that fits to the data is generated at the end
import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture
from scipy.stats import norm
# set simulation parameters
mean1, std1, w1 = 0,5,0.5
mean2, std2, w2 = 20,10,1-w1
# simulate constituents
n_samples = 100000
np.random.seed(2021)
gauss_sample_1 = np.random.normal(loc = mean1,scale = std1,size = n_samples)
gauss_sample_2 = np.random.normal(loc = mean2,scale = std2,size = n_samples)
binomial = np.random.binomial(n=1, p=w1, size = n_samples)
# simulate gaussian mixture
mutlimodal_samples = (gauss_sample_1 * binomial + gauss_sample_2 * (1-binomial)).reshape(-1,1)
# define and fit the mixture model
gmix = mixture.GaussianMixture(n_components = 2, covariance_type = "full")
fitted = gmix.fit(mutlimodal_samples)
print('fitted means:',fitted.means_[0][0],fitted.means_[1][0])
print('fitted stdevs:',np.sqrt(fitted.covariances_[0][0][0]),np.sqrt(fitted.covariances_[1][0][0]))
print('fitted weights:',fitted.weights_)
# Plot component pdfs and a joint pdf
ls = np.linspace(-50, 50, 1000)
new_norm1 = norm.pdf(ls, fitted.means_[0][0], np.sqrt(fitted.covariances_[0][0][0]))
new_norm2 = norm.pdf(ls, fitted.means_[1][0], np.sqrt(fitted.covariances_[1][0][0]))
multi_pdf = w1*new_norm1 + (1-w1)*new_norm2
plt.plot(ls, new_norm1, label='Norm pdf 1')
plt.plot(ls, new_norm2, label='Norm pdf 2')
plt.plot(ls, multi_pdf, label='multi-norm pdf')
plt.legend(loc = 'best')
plt.show()
The results are
fitted means: 22.358448018824642 0.8607494960575028
fitted stdevs: 8.770962351118127 5.58538485134623
fitted weights: [0.42517515 0.57482485]
as we see they are close (up to their ordering, which of course the model cannot recover but it is irrelevant anyway) to what went into the simulation:
mean1, std1, w1 = 0,5,0.5
mean2, std2, w2 = 20,10,1-w1
And the plot of the density and its parts. Recall that the pdf of the GaussianMixture is not the sum of the pdfs but a weighted average with weights w1, 1-w1:
I am currently running an exploratory factor analysis in Python, which works well with the factor_analyzer package (https://factor-analyzer.readthedocs.io/en/latest/factor_analyzer.html). To choose the appropriate number of factors, I used the Kaiser criterion and the Scree plot. However, I would like to confirm my results using Horn's parallel analysis (Horn, 1965). In R I would use the parallel function from the psych package. Does anyone know an equivalent method / function / package in Python? I've been searching for some time now, but unfortunately without success.
Thanks a lot for your help!
Best regards
You've probably figured out a solution by now but, for the sake of others who might be looking for it, here's some code that I've used to mimic the parallel analysis from the psych library:
import pandas as pd
from factor_analyzer import FactorAnalyzer
import numpy as np
import matplotlib.pyplot as plt
def _HornParallelAnalysis(data, K=10, printEigenvalues=False):
################
# Create a random matrix to match the dataset
################
n, m = data.shape
# Set the factor analysis parameters
fa = FactorAnalyzer(n_factors=1, method='minres', rotation=None, use_smc=True)
# Create arrays to store the values
sumComponentEigens = np.empty(m)
sumFactorEigens = np.empty(m)
# Run the fit 'K' times over a random matrix
for runNum in range(0, K):
fa.fit(np.random.normal(size=(n, m)))
sumComponentEigens = sumComponentEigens + fa.get_eigenvalues()[0]
sumFactorEigens = sumFactorEigens + fa.get_eigenvalues()[1]
# Average over the number of runs
avgComponentEigens = sumComponentEigens / K
avgFactorEigens = sumFactorEigens / K
################
# Get the eigenvalues for the fit on supplied data
################
fa.fit(data)
dataEv = fa.get_eigenvalues()
# Set up a scree plot
plt.figure(figsize=(8, 6))
################
### Print results
################
if printEigenvalues:
print('Principal component eigenvalues for random matrix:\n', avgComponentEigens)
print('Factor eigenvalues for random matrix:\n', avgFactorEigens)
print('Principal component eigenvalues for data:\n', dataEv[0])
print('Factor eigenvalues for data:\n', dataEv[1])
# Find the suggested stopping points
suggestedFactors = sum((dataEv[1] - avgFactorEigens) > 0)
suggestedComponents = sum((dataEv[0] - avgComponentEigens) > 0)
print('Parallel analysis suggests that the number of factors = ', suggestedFactors , ' and the number of components = ', suggestedComponents)
################
### Plot the eigenvalues against the number of variables
################
# Line for eigenvalue 1
plt.plot([0, m+1], [1, 1], 'k--', alpha=0.3)
# For the random data - Components
plt.plot(range(1, m+1), avgComponentEigens, 'b', label='PC - random', alpha=0.4)
# For the Data - Components
plt.scatter(range(1, m+1), dataEv[0], c='b', marker='o')
plt.plot(range(1, m+1), dataEv[0], 'b', label='PC - data')
# For the random data - Factors
plt.plot(range(1, m+1), avgFactorEigens, 'g', label='FA - random', alpha=0.4)
# For the Data - Factors
plt.scatter(range(1, m+1), dataEv[1], c='g', marker='o')
plt.plot(range(1, m+1), dataEv[1], 'g', label='FA - data')
plt.title('Parallel Analysis Scree Plots', {'fontsize': 20})
plt.xlabel('Factors/Components', {'fontsize': 15})
plt.xticks(ticks=range(1, m+1), labels=range(1, m+1))
plt.ylabel('Eigenvalue', {'fontsize': 15})
plt.legend()
plt.show();
If you call the above like this:
_HornParallelAnalysis(myDataSet)
You should get something like the following:
Example output for parallel analysis:
Thanks for sharing Eric and Reza.
Here I also provide a faster solution for those readers who do a PCA parallel analysis only. The above code is taking too long for me (apparently because of my very large dataset of size 33 x 15498) with no answer (I waited 1 day running it), so if anyone have only a PCA parallel analysis like my case, you can use this simple and very fast code, just you need to put your dataset in a csv file, this program reads in the csv and very fastly provides you with a PCA parallel analysis plot:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
shapeMatrix = pd.read_csv("E:\\projects\\ankle_imp_ssm\\results\\parallel_analysis\\data\\shapeMatrix.csv")
shapeMatrix.dropna(axis=1, inplace=True)
normalized_shapeMatrix=(shapeMatrix-shapeMatrix.mean())/shapeMatrix.std()
pca = PCA(shapeMatrix.shape[0]-1)
pca.fit(normalized_shapeMatrix)
transformedShapeMatrix = pca.transform(normalized_shapeMatrix)
#np.savetxt("pca_data.csv", pca.explained_variance_, delimiter=",")
random_eigenvalues = np.zeros(shapeMatrix.shape[0]-1)
for i in range(100):
random_shapeMatrix = pd.DataFrame(np.random.normal(0, 1, [shapeMatrix.shape[0], shapeMatrix.shape[1]]))
pca_random = PCA(shapeMatrix.shape[0]-1)
pca_random.fit(random_shapeMatrix)
transformedRandomShapeMatrix = pca_random.transform(random_shapeMatrix)
random_eigenvalues = random_eigenvalues+pca_random.explained_variance_ratio_
random_eigenvalues = random_eigenvalues / 100
#np.savetxt("pca_random.csv", random_eigenvalues, delimiter=",")
plt.plot(pca.explained_variance_ratio_, '--bo', label='pca-data')
plt.plot(random_eigenvalues, '--rx', label='pca-random')
plt.legend()
plt.title('parallel analysis plot')
plt.show()
Byy running this piece of code on the matrix of shapes for which I created a statistical shape model. (Shape matrix is of size: 33 x 15498) and it takes just a few seconds to run.
I would like to cluster the following set of data in two clusters corresponding to each line ("\" and "/" ) of the "X". I was thinking that it could be done using the Pearson correlation coefficients as distance metric in Scikit-learn Agglomerative clustering as indicated here (How to use Pearson Correlation as distance metric in Scikit-learn Agglomerative clustering). But it doesn't seem to work.
Plot of the raw data
Data:
-6.5955882 11.344538
-6.1911765 12.027311
-5.4191176 10.346639
-4.7573529 7.5105042
-2.9191176 7.7205882
-1.5955882 6.6176471
-2.9558824 6.039916
-1.1544118 3.9915966
-0.088235294 4.7794118
-0.088235294 2.8361345
0.53676471 -1.2079832
2.7794118 0
3.4044118 -4.3592437
5.2794118 -3.9915966
6.75 -8.5609244
7.4485294 -6.8802521
5.1691176 -5.7247899
-7.1470588 -2.8361345
-6.7058824 -1.2605042
-4.4264706 -1.1554622
-3.5073529 0.78781513
-0.86029412 0.31512605
-1.0808824 2.1533613
-2.8823529 -0.42016807
1.0514706 2.2584034
1.9338235 4.4117647
4.6544118 5.5147059
3.7352941 7.0378151
6.0147059 8.2457983
7.0808824 7.7205882
The code I've tried:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.stats import pearsonr
nc=2
data = np.loadtxt("cross-data_2.dat")
plt.scatter(data[:,0], data[:,1], s=100, cmap='viridis')
def pearson_affinity(M):
return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M])
hc = AgglomerativeClustering(n_clusters=nc, affinity = pearson_affinity, linkage = 'average')
y_hc = hc.fit_predict(data)
plt.figure()
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=100, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=100, c='black')
plt.show()
The results of the clustering:
Is there something wrong in the code or should I simply use another method?
I propose yet another method for this, Gaussian Mixture Models.
X = (your data)
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2,
init_params='random',
n_init=5,
random_state=123)
y_pred = gmm.fit_predict(X)
plt.scatter(*X.T, c=y_pred)
I can propose a alternative method to achieve this. Since, you are trying to cluster points along same angle, we can first transform data to polar (r-theta) coordinates and then use simple KMeans clustering.
r = np.sqrt(x[:, 0]**2 + x[:, 1]**2)
theta = np.arctan(x[:, 1]/x[:, 0])
xr = np.vstack((r*np.sin(theta), r*np.cos(theta))).T
from sklearn.cluster import KMeans
km = KMeans(2)
xx = km.fit_predict(xr)
plt.scatter(x[:, 0], x[:, 1], c=xx)
I plotted some data points using K-Means clustering. The screenshot is available at "https://imageshack.com/i/pomMJXMkj". When I visualize these data points, it's clearly visible that many points are not in their respective clusters and this green point is one amongst them which is far away from its centroid and is clearly very near to the blue centroid. According to K-Means algorithm the point is added to that cluster with the nearest centroid. But why isn't that the case here?
The code for the following visual is mentioned below and the link for the dataset is "https://github.com/Vivek-Nimmagadda/Player-Prediction-Using-Python/blob/master/Bowlers/Bowlers.csv":
# Importing the Batsmen Dataset
dataset = pd.read_csv('Bowlers\Bowlers.csv')
X = dataset.iloc[:, [1, 2, 3, 4, 5, 6, 7]].values
# Using Elbow Method to find the optimal number of Clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
# Fitting K-Meaens Clustering Algorithm to the Dataset
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=300, random_state=0)
y_kmeans = kmeans.fit_predict(X)
# Visualising the Clusters
plt.scatter(X[y_kmeans == 0,2], X[y_kmeans == 0,4], s = 100, c = 'blue', label = 'Good Form')
plt.scatter(X[y_kmeans == 1,2], X[y_kmeans == 1,4], s = 100, c = 'purple', label = 'Average Touch')
plt.scatter(X[y_kmeans == 2,2], X[y_kmeans == 2,4], s = 100, c = 'green', label = 'Peek Form')
plt.scatter(X[y_kmeans == 3,2], X[y_kmeans == 3,4], s = 100, c = 'red', label = 'Poor Form')
plt.scatter(kmeans.cluster_centers_[:, 2], kmeans.cluster_centers_[:, 4], s = 150, c = 'cyan', label = 'Centroids')
plt.title('Recent Form of Bowlers Based on their Stats')
plt.xlabel('Wickets')
plt.ylabel('Average')
plt.legend()
plt.show()
My expected result is to visualize all the data points accurately in their respective clusters. Whereas it's randomly displaying the points. Can anyone please help me in correcting this error?
From the looks of it, you're clustering a data set based on features in 7 dimensions/variables. If we were able to view 7 dimensions at a time, you would see that these points do actually cluster together correctly.
But unfortunately we can't. The plot you are viewing contains just two of those dimensions, and information contained in the other dimensions (variables) is lost. This loss of information makes it look as if the points don't cluster together, but in their original higher-dimensional space they do, which is what was found by your clustering algorithm.
A dimensionality reduction technique such as principal component analysis (also available in sklearn), can reduce the data down to 2 dimensions more "effectively", projecting the data to the axis containing the most variance in the original space. But even here you might not see the clustering behaviour you desire. If this is the case, you will just have to believe your clustering algorithm!
After doing PCA on my data and plotting the kmeans clusters, my plot looks really weird. The centers of the clusters and scatter plot of the points do not make sense to me. Here is my code:
#clicks, conversion, bounce and search are lists of values.
clicks=[2,0,0,8,7,...]
conversion = [1,0,0,6,0...]
bounce = [2,4,5,0,1....]
X = np.array([clicks,conversion, bounce]).T
y = np.array(search)
num_clusters = 5
pca=PCA(n_components=2, whiten=True)
data2D = pca.fit_transform(X)
print data2D
>>> [[-0.07187948 -0.17784291]
[-0.07173769 -0.26868727]
[-0.07173789 -0.26867958]
...,
[-0.06942414 -0.25040886]
[-0.06950897 -0.19591147]
[-0.07172973 -0.2687937 ]]
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(X)
labels=km.labels_
centers2D = pca.fit_transform(km.cluster_centers_)
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1], marker='x', c='r')
plt.show()
The red crosses are the center of the clusters. Any help would be great.
Your ordering of PCA and KMeans is screwing things up...
Here is what you need to do:
Normalize your data.
Perform PCA on X to reduce the dimensions from 5 to 2 and produce Data2D
Normalize again
Cluster Data2D with KMeans
Plot the Centroids on top of Data2D.
Where as, here is what you have done above:
Perform PCA on X to reduce the dimensions from 5 to 2 to produce Data2D
Cluster the original data, X, in 5 dimensions.
Perform a separate PCA on your cluster centroids, which produces a completely different 2D subspace for the centroids.
Plot the PCA reduced Data2D with the PCA reduced centroids on top even though these no longer are coupled properly.
Normalization:
Take a look at the code below and you'll see that it puts the centroids right where they need to be. The normalization is key and is completely reversible. ALWAYS normalize your data when you cluster as the distance metrics need to move through all of the spaces equally. Clustering is one of the most important times to normalize your data, but in general... ALWAYS NORMALIZE :-)
A heuristic discussion that goes beyond your original question:
The entire point of dimensionality reduction is to make the KMeans clustering easier and to project out dimensions which don't add to the variance of the data. So you should pass the reduced data to your clustering algorithm. I'll add that there are very few 5D datasets which can be projected down to 2D without throwing out a lot of variance i.e. look at the PCA diagnostics to see whether 90% of the original variance has been preserved. If not, then you might not want to be so aggressive in your PCA.
New Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('/Users/angus/Desktop/Downloads/stackoverflow.csv', usecols[0, 2, 4],names=['freq', 'visit_length', 'conversion_cnt'],header=0).dropna()
df.describe()
#Normalize the data
df_norm = (df - df.mean()) / (df.max() - df.min())
num_clusters = 5
pca=PCA(n_components=2)
UnNormdata2D = pca.fit_transform(df_norm)
# Check the resulting varience
var = pca.explained_variance_ratio_
print "Varience after PCA: ",var
#Normalize again following PCA: data2D
data2D = (UnNormdata2D - UnNormdata2D.mean()) / (UnNormdata2D.max()-UnNormdata2D.min())
print "Data2D: "
print data2D
km = KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit_transform(data2D)
labels=km.labels_
centers2D = km.cluster_centers_
colors=['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
plt.scatter( data2D[:,0], data2D[:,1], c=label_color)
plt.hold(True)
plt.scatter(centers2D[:,0], centers2D[:,1],marker='x',s=150.0,color='purple')
plt.show()
Plot:
Output:
Varience after PCA: [ 0.65725709 0.29875307]
Data2D:
[[-0.00338421 -0.0009403 ]
[-0.00512081 -0.00095038]
[-0.00512081 -0.00095038]
...,
[-0.00477349 -0.00094836]
[-0.00373153 -0.00094232]
[-0.00512081 -0.00095038]]
Initialization complete
Iteration 0, inertia 51.225
Iteration 1, inertia 38.597
Iteration 2, inertia 36.837
...
...
Converged at iteration 31
Hope this helps!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# read your data, replace 'stackoverflow.csv' with your file path
df = pd.read_csv('stackoverflow.csv', usecols=[0, 2, 4], names=['freq', 'visit_length', 'conversion_cnt'], header=0).dropna()
df.describe()
Out[3]:
freq visit_length conversion_cnt
count 289705.0000 289705.0000 289705.0000
mean 0.2624 20.7598 0.0748
std 0.4399 55.0571 0.2631
min 0.0000 1.0000 0.0000
25% 0.0000 6.0000 0.0000
50% 0.0000 10.0000 0.0000
75% 1.0000 21.0000 0.0000
max 1.0000 2500.0000 1.0000
# binarlize freq and conversion_cnt
df.freq = np.where(df.freq > 1.0, 1, 0)
df.conversion_cnt = np.where(df.conversion_cnt > 0.0, 1, 0)
feature_names = df.columns
X_raw = df.values
transformer = PCA(n_components=2)
X_2d = transformer.fit_transform(X_raw)
# over 99.9% variance captured by 2d data
transformer.explained_variance_ratio_
Out[4]: array([ 9.9991e-01, 6.6411e-05])
# do clustering
estimator = KMeans(n_clusters=5, init='k-means++', n_init=10, verbose=1)
estimator.fit(X_2d)
labels = estimator.labels_
colors = ['#000000','#FFFFFF','#FF0000','#00FF00','#0000FF']
col_map=dict(zip(set(labels),colors))
label_color = [col_map[l] for l in labels]
fig, ax = plt.subplots()
ax.scatter(X_2d[:,0], X_2d[:,1], c=label_color)
ax.scatter(estimator.cluster_centers_[:,0], estimator.cluster_centers_[:,1], marker='x', s=50, c='r')
KMeans tries to minimize within-group Euclidean distance, and this may or may not be appropriate for your data. Just based on the graph, I would consider a Gaussian Mixture Model to do the unsupervised clustering.
Also, if you have superior knowledge on which observations might be classified into which category/label, you can do a semi-supervised learning.