How to find mean in kmeans in single shot using numpy - python

I have a function:
def update(points, closest, centroids):
return np.array([points[closest==k].mean(axis=0) for k in range(centroids.shape[0])])
It basically the update of centroids step in kmeans algorithm.
Basically, points is a matrix, closest is an assignment of a point to a cluster..
and then all i am doing is finding the new mean based on points in a cluster..
but I was wondering if i can get rid of that for loop?
which is if i can find the cluster mean in one shot?

Here's a vectorized approach based on np.add.reduceat -
c = np.bincount(closest,minlength=centroids.shape[0])
mask = c != 0
pts_grp = points[closest.argsort()]
cut_idx = np.append(0,c[mask].cumsum()[:-1])
out = np.full((centroids.shape[0],points.shape[1]),np.nan)
out[mask] = np.add.reduceat(pts_grp,cut_idx,axis=0)/c[mask,None].astype(float)

Related

Is there a faster way to perform this neighbour finding operation

I'm trying to calculate Moran's I in Python (This is the underlying equation). My inputs are a coords Nx3 array containing the coordinates of each point and a Nx3 array z which contains the values minus the overall mean. The operation requires each value of z to be multiplied with every point within a set distance (here set to 1.99). My problem is that in my case N=~2 Million and so the find_neighbours operation is very slow. Is there a way I could speed this up?
def find_neighbours(coords,idx,k):
distances = np.sqrt(np.power(coords - coords[idx], 2).sum(axis=1))
distances[idx] = np.inf
return np.argwhere(distances<=k)
z = x - np.mean(x)
n = len(coords)
A = 0
B = np.sum([z[idx]**2 for idx,coord in enumerate(coords)])
S_0 = 0
for idx in range(len(coords)):
neighbours = find_neighbours(coords,idx,1.99)
S_0 += len(neighbours)
A += np.sum([(z[neighbour]*z[idx]) for neighbour in neighbours])
I = (n/S_0)*(A/B)
This is a classical problem with plenty of literature about. It's called Radius Neighbor Search in Three-dimensional Point Clouds . You need to store your points in a better data structure to do the search faster. I would suggest an octree.
Check python code here and adapt to your case.
For explanations, check this paper.

How to create distance table using geodesic

I'm calculating with Python. Let's say i have this kind of DataFrame where it consists of long lat of some points
import pandas as pd
dfa=pd.DataFrame(([1,2],[1,3],[1,1],[1,4]), columns=['y','x'])
before, i used distance matrix from scipy.spatial and create another DataFrame with this code. but it seems that it can't precisely calculate the distance between points (with long lat)
from scipy.spatial import distance_matrix
pd.DataFrame(distance_matrix(dfa.values, dfa.values), index=dfa.index, columns=dfa.index)
Do you think it's possible to change the calculation with geodesic? here what i've tried.
from geopy.distance import geodesic
pd.DataFrame(geodesic(dfa.values[0], dfa.values[0]).kilometers, index=dfa.index, columns=dfa.index)
# i don't know how to change [0] adjusted to column and index
any suggestion?
Given a list or list-like object locations, you can do
distances = pd.DataFrame([[geodesic(a,b) for a in locations]
for b in locations])
This will be redundant, though, since it will calculate distance for both a,b and b,a, even though they should be the same. Depending on the cost of geodesic, you may find the some of the following alternatives faster:
distances = pd.DataFrame([[geodesic(a,b) if a > b else 0
for a in locations]
for b in locations])
distances = distances.add(distances.T)
size = len(locations)
distances = pd.DataFrame(columns = range(size), index = range(size))
def get_distance(i,j):
if distances.loc[j,i]:
return distances.loc[j,i]
if i == j:
return 0
return geodesic(locations[i], locations[j])
for i in range(size):
for j in range(size):
distances.loc[i,j] = get_distance(i,j)
You can also store the data as a dictionary with the keys being output from itertools.combinations. There's also this article on creating a symmetric matrix class.

Manually find the distance between centroid and labelled data points

I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.

k-Nearest Neighbors rundown

I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean

fastest way to find euclidean distance in python

I have 2 sets of 2D points (A and B), each set have about 540 points. I need to find the points in set B that are farther than a defined distance alpha from all the points in A.
I have a solution, but is not fast enough
# find the closest point of each of the new point to the target set
def find_closest_point( self, A, B):
outliers = []
for i in range(len(B)):
# find all the euclidean distances
temp = distance.cdist([B[i]],A)
minimum = numpy.min(temp)
# if point is too far away from the rest is consider outlier
if minimum > self.alpha :
outliers.append([i, B[i]])
else:
continue
return outliers
I am using python 2.7 with numpy and scipy. Is there another way to do this that I may gain a considerable increase in speed?
Thanks in advance for the answers
>>> from scipy.spatial.distance import cdist
>>> A = np.random.randn(540, 2)
>>> B = np.random.randn(540, 2)
>>> alpha = 1.
>>> ind = np.all(cdist(A, B) > alpha, axis=0)
>>> outliers = B[ind]
gives you the points you want.
If you have a very large set of points you could calculate x & y bounds of a add & subtract aplha then eliminate all the points in b from specific consideration that lay outside of that boundary.

Categories