Related
I would like to write a simple function to sample points from some d dimensional simplex, also specifying how many values I want to be different than zero.
For example, if d=5 and I want only two non-zero values then it could sample the points
np.array([0.,0.,0.,0.25,0.75])
np.array([0.5, 0.5, 0., 0., 0.])
np.array([0.0, 0.0, 0.8 ,0.2,0.])
Assuming d is the dimension of the simplex and n the number of non-null values, and that you want to sample uniformly at random.
You can decompose this as a two-step process, with:
1- Pick n elements in the [0, d[ range without replacement, which can be done with the np.random.choice
2- Sample on the n-dimensional simplex for those n elements. See here for details on this part: https://cs.stackexchange.com/questions/3227/uniform-sampling-from-a-simplex
import numpy as np
def simplex_sample(dim):
xs = np.random.uniform(0, 1, dim-1)
xs = np.append(xs, [0, 1])
xs = np.sort(xs)
xs = xs[1:] - xs[0:-1]
return xs
def simplex_sample_with_non_zeros(dim, n):
xs = simplex_sample(n)
ys = np.zeros(dim)
idx = np.random.choice(dim, n, replace=False)
ys[idx] = xs
return ys
Here's what it looks like visually:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(projection='3d')
n = 100
data = np.array([simplex_sample_with_non_zeros(3, 2) for i in range(n)])
ax.scatter(data[:,0], data[:,1], data[:,2])
plt.show()
See https://1000words-hq.com/n/1x4Wk0FtZM5 for a live implementation.
I am trying to generate 100 samples from 2 different Gaussian distributions, such that G1 occurs with probability 0.7 and G2 occurs with 0.3. I have the following code snippet:
from scipy.stats import norm
import numpy as np
x = [norm.rvs(0, 1, size=5), norm.rvs(10, 1, 5)]
draw = np.random.choice([0, 1], 100, p=[0.7, 0.3])
y = [x[i].rvs() for i in draw]
z = np.array(y)
When I compile this, I get the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'rvs'
Is there something I am missing? Or, is there a fundamental flaw?
In this line
x = [norm.rvs(0, 1, size=5), norm.rvs(10, 1, 5)]
you are creating two arrays of random values. So in this line
[x[i].rvs() for i in draw]
you can't use those to create more random values (.rvs), since you have numpy arrays:
norm.rvs(0, 1, size=5)
# Out: array([-1.61758314, 1.19288111, -0.55599284, -0.17926848, -0.78759 ])
You want to create a list of normal distribution objects, which you then use to draw random values from:
x = [norm(0, 1), norm(10, 1)]
draw = np.random.choice([0, 1], 100, p=[0.7, 0.3])
y = [x[i].rvs() for i in draw]
I have been using a solution found in several places on stack overflow for fitting a piecewise function:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([5, 7, 9, 11, 13, 15, 28.92, 42.81, 56.7, 70.59, 84.47, 98.36, 112.25, 126.14, 140.03])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y)
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
(for example, here: How to apply piecewise linear fit in Python?)
The first time I try it in the console I get an OptimizeWarning.
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
After that I just get a straight line for my fit. It seems as though there is clearly a bend in the data that the fit isn't following, although I cannot figure out why.
For the dataset I am using there are about 3200 points in each x and y, is this part of the problem?
Here are some fake data that kind of simulate mine (same problem occurs where fit is not piecewise):
x = np.append(np.random.uniform(low=10.0, high=40.2, size=(1500,)), np.random.uniform(low=-10.0, high=20.2, size=(1500,)))
y = np.append(np.random.uniform(low=-3000, high=0, size=(1500,)), np.random.uniform(low=-2000, high=1000, size=(1500,)))
Just to complete the question with the answer provided in the comment above:
The issue was not due to the large number of points, but the fact that I had such large values on my y axis. Since the default initial values are 1, my values of around 1000 were too large. To fix that an initial guess for the line fit was used for parameter p0. From the docs for scipy.optimize.curve_fit it looks like:
p0 : None, scalar, or N-length sequence, optional
Initial guess for the parameters. If None, then the initial values will all be 1 (if the number of parameters for the function can be determined using introspection, otherwise a ValueError is raised).
So my final code ended up looking like this:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15], dtype=float)
y = np.array([500, 700, 900, 1100, 1300, 1500, 2892, 4281, 5670, 7059, 8447, 9836, 11225, 12614, 14003])
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p, e = optimize.curve_fit(piecewise_linear, x, y, p0=(10, -2500, 0, -500))
xd = np.linspace(-5, 30, 100)
plt.plot(x, y, ".")
plt.plot(xd, piecewise_linear(xd, *p))
plt.show()
Just for fun (very scattered case) :
Since the original data was not available, the coordinates of the points are obtained from the figure published in the Rachel W's question, thanks to a graphical scan and the record of the blue pixels. They are some artefact due to the straight line and the grid which, after scanning, appear in white.
The result of a piecewise regression (two segments) is drawn in red on the above figure.
The equation of the fitted function is :
The regression method used is not iterative and don't require initial guess. The code is very simple : pp.12-13 in this paper https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
How could I smooth the x[1,3] and x[3,2] elements of the array,
x = np.array([[0,0,0,0,0],[0,0,0,1,0],[0,0,0,0,0],[0,0,1,0,0],[0,0,0,0,0]])
with two two-dimensional gaussian functions of width 1 and 2, respectively? In essence I need a function that allows me to smooth single "point like" array elements with gaussians of differing widths, such that I get an array with smoothly varying values.
I am a little confused with the question you asked and the comments you have posted. It seems to me that you want to use scipy.ndimage.filters.gaussian_filter but I don't understand what you mean by:
[...] gaussian functions with different sigma values to each pixel. [...]
In fact, since you use a 2-dimensional array x the gaussian filter will have 2 parameters. The rule is: one sigma value per dimension rather than one sigma value per pixel.
Here is a short example:
import matplotlib.pyplot as pl
import numpy as np
import scipy as sp
import scipy.ndimage
n = 200 # widht/height of the array
m = 1000 # number of points
sigma_y = 3.0
sigma_x = 2.0
# Create input array
x = np.zeros((n, n))
i = np.random.choice(range(0, n * n), size=m)
x[i / n, i % n] = 1.0
# Plot input array
pl.imshow(x, cmap='Blues', interpolation='nearest')
pl.xlabel("$x$")
pl.ylabel("$y$")
pl.savefig("array.png")
# Apply gaussian filter
sigma = [sigma_y, sigma_x]
y = sp.ndimage.filters.gaussian_filter(x, sigma, mode='constant')
# Display filtered array
pl.imshow(y, cmap='Blues', interpolation='nearest')
pl.xlabel("$x$")
pl.ylabel("$y$")
pl.title("$\sigma_x = " + str(sigma_x) + "\quad \sigma_y = " + str(sigma_y) + "$")
pl.savefig("smooth_array_" + str(sigma_x) + "_" + str(sigma_y) + ".png")
Here is the initial array:
Here are some results for different values of sigma_x and sigma_y:
This allows to properly account for the influence of the second parameter of scipy.ndimage.filters.gaussian_filter.
However, according to the previous quote, you might be more interested in the assigement of different weights to each pixel. In this case, scipy.ndimage.filters.convolve is the function you are looking for. Here is the corresponding example:
import matplotlib.pyplot as pl
import numpy as np
import scipy as sp
import scipy.ndimage
# Arbitrary weights
weights = np.array([[0, 0, 1, 0, 0],
[0, 2, 4, 2, 0],
[1, 4, 8, 4, 1],
[0, 2, 4, 2, 0],
[0, 0, 1, 0, 0]],
dtype=np.float)
weights = weights / np.sum(weights[:])
y = sp.ndimage.filters.convolve(x, weights, mode='constant')
# Display filtered array
pl.imshow(y, cmap='Blues', interpolation='nearest')
pl.xlabel("$x$")
pl.ylabel("$y$")
pl.savefig("smooth_array.png")
And the corresponding result:
I hope this will help you.
On the Wikipedia page, an elbow method is described for determining the number of clusters in k-means. The built-in method of scipy provides an implementation but I am not sure I understand how the distortion as they call it, is calculated.
More precisely, if you graph the percentage of variance explained by
the clusters against the number of clusters, the first clusters will
add much information (explain a lot of variance), but at some point
the marginal gain will drop, giving an angle in the graph.
Assuming that I have the following points with their associated centroids, what is a good way of calculating this measure?
points = numpy.array([[ 0, 0],
[ 0, 1],
[ 0, -1],
[ 1, 0],
[-1, 0],
[ 9, 9],
[ 9, 10],
[ 9, 8],
[10, 9],
[10, 8]])
kmeans(pp,2)
(array([[9, 8],
[0, 0]]), 0.9414213562373096)
I am specifically looking at computing the 0.94.. measure given just the points and the centroids. I am not sure if any of the inbuilt methods of scipy can be used or I have to write my own. Any suggestions on how to do this efficiently for large number of points?
In short, my questions (all related) are the following:
Given a distance matrix and a mapping of which point belongs to which
cluster, what is a good way of computing a measure that can be used
to draw the elbow plot?
How would the methodology change if a different distance function such as cosine similarity is used?
EDIT 2: Distortion
from scipy.spatial.distance import cdist
D = cdist(points, centroids, 'euclidean')
sum(numpy.min(D, axis=1))
The output for the first set of points is accurate. However, when I try a different set:
>>> pp = numpy.array([[1,2], [2,1], [2,2], [1,3], [6,7], [6,5], [7,8], [8,8]])
>>> kmeans(pp, 2)
(array([[6, 7],
[1, 2]]), 1.1330618877807475)
>>> centroids = numpy.array([[6,7], [1,2]])
>>> D = cdist(points, centroids, 'euclidean')
>>> sum(numpy.min(D, axis=1))
9.0644951022459797
I guess the last value does not match because kmeans seems to be dividing the value by the total number of points in the dataset.
EDIT 1: Percent Variance
My code so far (should be added to Denis's K-means implementation):
centres, xtoc, dist = kmeanssample( points, 2, nsample=2,
delta=kmdelta, maxiter=kmiter, metric=metric, verbose=0 )
print "Unique clusters: ", set(xtoc)
print ""
cluster_vars = []
for cluster in set(xtoc):
print "Cluster: ", cluster
truthcondition = ([x == cluster for x in xtoc])
distances_inside_cluster = (truthcondition * dist)
indices = [i for i,x in enumerate(truthcondition) if x == True]
final_distances = [distances_inside_cluster[k] for k in indices]
print final_distances
print np.array(final_distances).var()
cluster_vars.append(np.array(final_distances).var())
print ""
print "Sum of variances: ", sum(cluster_vars)
print "Total Variance: ", points.var()
print "Percent: ", (100 * sum(cluster_vars) / points.var())
And following is the output for k=2:
Unique clusters: set([0, 1])
Cluster: 0
[1.0, 2.0, 0.0, 1.4142135623730951, 1.0]
0.427451660041
Cluster: 1
[0.0, 1.0, 1.0, 1.0, 1.0]
0.16
Sum of variances: 0.587451660041
Total Variance: 21.1475
Percent: 2.77787757437
On my real dataset (does not look right to me!):
Sum of variances: 0.0188124746402
Total Variance: 0.00313754329764
Percent: 599.592510943
Unique clusters: set([0, 1, 2, 3])
Sum of variances: 0.0255808508714
Total Variance: 0.00313754329764
Percent: 815.314672809
Unique clusters: set([0, 1, 2, 3, 4])
Sum of variances: 0.0588210052519
Total Variance: 0.00313754329764
Percent: 1874.74720416
Unique clusters: set([0, 1, 2, 3, 4, 5])
Sum of variances: 0.0672406353655
Total Variance: 0.00313754329764
Percent: 2143.09824556
Unique clusters: set([0, 1, 2, 3, 4, 5, 6])
Sum of variances: 0.0646291452839
Total Variance: 0.00313754329764
Percent: 2059.86465055
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7])
Sum of variances: 0.0817517362176
Total Variance: 0.00313754329764
Percent: 2605.5970695
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8])
Sum of variances: 0.0912820650486
Total Variance: 0.00313754329764
Percent: 2909.34837831
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Sum of variances: 0.102119601368
Total Variance: 0.00313754329764
Percent: 3254.76309585
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Sum of variances: 0.125549475536
Total Variance: 0.00313754329764
Percent: 4001.52168834
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Sum of variances: 0.138469402779
Total Variance: 0.00313754329764
Percent: 4413.30651542
Unique clusters: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
The distortion, as far as Kmeans is concerned, is used as a stopping criterion (if the change between two iterations is less than some threshold, we assume convergence)
If you want to calculate it from a set of points and the centroids, you can do the following (the code is in MATLAB using pdist2 function, but it should be straightforward to rewrite in Python/Numpy/Scipy):
% data
X = [0 1 ; 0 -1 ; 1 0 ; -1 0 ; 9 9 ; 9 10 ; 9 8 ; 10 9 ; 10 8];
% centroids
C = [9 8 ; 0 0];
% euclidean distance from each point to each cluster centroid
D = pdist2(X, C, 'euclidean');
% find closest centroid to each point, and the corresponding distance
[distortions,idx] = min(D,[],2);
the result:
% total distortion
>> sum(distortions)
ans =
9.4142135623731
EDIT#1:
I had some time to play around with this.. Here is an example of KMeans clustering applied on the 'Fisher Iris Dataset' (4 features, 150 instances). We iterate over k=1..10, plot the elbow curve, pick K=3 as number of clusters, and show a scatter plot of the result.
Note that I included a number of ways to compute the within-cluster variances (distortions), given the points and the centroids. The scipy.cluster.vq.kmeans function returns this measure by default (computed with Euclidean as a distance measure). You can also use the scipy.spatial.distance.cdist function to calculate the distances with the function of your choice (provided you obtained the cluster centroids using the same distance measure: #Denis have a solution for that), then compute the distortion from that.
import numpy as np
from scipy.cluster.vq import kmeans,vq
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
# load the iris dataset
fName = 'C:\\Python27\\Lib\\site-packages\\scipy\\spatial\\tests\\data\\iris.txt'
fp = open(fName)
X = np.loadtxt(fp)
fp.close()
##### cluster data into K=1..10 clusters #####
K = range(1,10)
# scipy.cluster.vq.kmeans
KM = [kmeans(X,k) for k in K]
centroids = [cent for (cent,var) in KM] # cluster centroids
#avgWithinSS = [var for (cent,var) in KM] # mean within-cluster sum of squares
# alternative: scipy.cluster.vq.vq
#Z = [vq(X,cent) for cent in centroids]
#avgWithinSS = [sum(dist)/X.shape[0] for (cIdx,dist) in Z]
# alternative: scipy.spatial.distance.cdist
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/X.shape[0] for d in dist]
##### plot ###
kIdx = 2
# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')
# scatter plot
fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(X[:,2],X[:,1], s=30, c=cIdx[k])
clr = ['b','g','r','c','m','y','k']
for i in range(K[kIdx]):
ind = (cIdx[kIdx]==i)
ax.scatter(X[ind,2],X[ind,1], s=30, c=clr[i], label='Cluster %d'%i)
plt.xlabel('Petal Length')
plt.ylabel('Sepal Width')
plt.title('Iris Dataset, KMeans clustering with K=%d' % K[kIdx])
plt.legend()
plt.show()
EDIT#2:
In response to the comments, I give below another complete example using the NIST hand-written digits dataset: it has 1797 images of digits from 0 to 9, each of size 8-by-8 pixels. I repeat the experiment above slightly modified: Principal Components Analysis is applied to reduce the dimensionality from 64 down to 2:
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.spatial.distance import cdist,pdist
from sklearn import datasets
from sklearn.decomposition import RandomizedPCA
from matplotlib import pyplot as plt
from matplotlib import cm
##### data #####
# load digits dataset
data = datasets.load_digits()
t = data['target']
# perform PCA dimensionality reduction
pca = RandomizedPCA(n_components=2).fit(data['data'])
X = pca.transform(data['data'])
##### cluster data into K=1..20 clusters #####
K_MAX = 20
KK = range(1,K_MAX+1)
KM = [kmeans(X,k) for k in KK]
centroids = [cent for (cent,var) in KM]
D_k = [cdist(X, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
tot_withinss = [sum(d**2) for d in dist] # Total within-cluster sum of squares
totss = sum(pdist(X)**2)/X.shape[0] # The total sum of squares
betweenss = totss - tot_withinss # The between-cluster sum of squares
##### plots #####
kIdx = 9 # K=10
clr = cm.spectral( np.linspace(0,1,10) ).tolist()
mrk = 'os^p<dvh8>+x.'
# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(KK, betweenss/totss*100, 'b*-')
ax.plot(KK[kIdx], betweenss[kIdx]/totss*100, marker='o', markersize=12,
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
ax.set_ylim((0,100))
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained (%)')
plt.title('Elbow for KMeans clustering')
# show centroids for K=10 clusters
plt.figure()
for i in range(kIdx+1):
img = pca.inverse_transform(centroids[kIdx][i]).reshape(8,8)
ax = plt.subplot(3,4,i+1)
ax.set_xticks([])
ax.set_yticks([])
plt.imshow(img, cmap=cm.gray)
plt.title( 'Cluster %d' % i )
# compare K=10 clustering vs. actual digits (PCA projections)
fig = plt.figure()
ax = fig.add_subplot(121)
for i in range(10):
ind = (t==i)
ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='%d'%i)
plt.legend()
plt.title('Actual Digits')
ax = fig.add_subplot(122)
for i in range(kIdx+1):
ind = (cIdx[kIdx]==i)
ax.scatter(X[ind,0],X[ind,1], s=35, c=clr[i], marker=mrk[i], label='C%d'%i)
plt.legend()
plt.title('K=%d clusters'%KK[kIdx])
plt.show()
You can see how some clusters actually correspond to distinguishable digits, while others don't match a single number.
Note: An implementation of K-means is included in scikit-learn (as well as many other clustering algorithms and various clustering metrics). Here is another similar example.
A simple cluster measure:
1) draw "sunburst" rays from each point to its nearest cluster centre,
2) look at the lengths — distance( point, centre, metric=... ) — of all the rays.
For metric="sqeuclidean" and 1 cluster,
the average length-squared is the total variance X.var(); for 2 clusters, it's less ... down to N clusters, lengths all 0.
"Percent of variance explained" is 100 % - this average.
Code for this, under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means:
def distancestocentres( X, centres, metric="euclidean", p=2 ):
""" all distances X -> nearest centre, any metric
euclidean2 (~ withinss) is more sensitive to outliers,
cityblock (manhattan, L1) less sensitive
"""
D = cdist( X, centres, metric=metric, p=p ) # |X| x |centres|
return D.min(axis=1) # all the distances
Like any long list of numbers, these distances can be looked at in various ways: np.mean(), np.histogram() ... Plotting, visualization, is not easy.
See also stats.stackexchange.com/questions/tagged/clustering, in particular
How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?