k-Nearest Neighbors rundown - python

I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.

I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean

Related

Is there a faster way to perform this neighbour finding operation

I'm trying to calculate Moran's I in Python (This is the underlying equation). My inputs are a coords Nx3 array containing the coordinates of each point and a Nx3 array z which contains the values minus the overall mean. The operation requires each value of z to be multiplied with every point within a set distance (here set to 1.99). My problem is that in my case N=~2 Million and so the find_neighbours operation is very slow. Is there a way I could speed this up?
def find_neighbours(coords,idx,k):
distances = np.sqrt(np.power(coords - coords[idx], 2).sum(axis=1))
distances[idx] = np.inf
return np.argwhere(distances<=k)
z = x - np.mean(x)
n = len(coords)
A = 0
B = np.sum([z[idx]**2 for idx,coord in enumerate(coords)])
S_0 = 0
for idx in range(len(coords)):
neighbours = find_neighbours(coords,idx,1.99)
S_0 += len(neighbours)
A += np.sum([(z[neighbour]*z[idx]) for neighbour in neighbours])
I = (n/S_0)*(A/B)
This is a classical problem with plenty of literature about. It's called Radius Neighbor Search in Three-dimensional Point Clouds . You need to store your points in a better data structure to do the search faster. I would suggest an octree.
Check python code here and adapt to your case.
For explanations, check this paper.

Trying to code the nearest neighbours algorithm - euclidean distance function only calculates the distances for one row of the test set - why?

I am trying to code the Nearest Neighbours Algorithm from scratch and have come across a problem - my algorithm was only giving the index/classification of the nearest neighbour for one row/point of the the training set. I went through every part of my code and realised that the problem is my Euclidean distance function. It only gives the result for one row.
This is the code I have written for Euclidean distance:
def euclidean_dist(r1, r2):
dist = 0
for j in range(0, len(r2)-1):
dist = dist + (r2[j] - r1[j])**2
return dist**0.5
Within my Nearest Neighbours algorithm this is the implementation of the Euclidean distance function:
for i in range(len(x_test)):
dist1 = []
dist2 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
dist2.append(distances)
dist1 = np.array(dist1)
sorting(dist1) #separate sorting function to sort the distances from lowest to highest,
#the aim was to get one array, dist1, with the euclidean distances for each row sorted
#and one array with the unsorted euclidean distances, dist2, (to be able to search for index later in the code)
I noticed the problem when using the iris dataset and trying out this part of the function with it. I split the data set into testing and training (X_test, X_train and y_test).
When this was implemented with the data set I got the following array for dist2:
[0.3741657386773946,
1.643167672515499,
3.389690251335658,
2.085665361461421,
1.284523257866513,
3.9572717874818752,
0.9539392014169458,
3.5805027579936315,
0.7211102550927979,
...
0.8062257748298555,
0.4242640687119287,
0.5196152422706631]
Its length is 112 which is the same length as X_train, but these are only the Euclidean distances for the first row or point of the X_test set. The dist1 array is the same except it is sorted.
Why am I not getting the Euclidean distances for every row/point of the test set? I thought I iterated through correctly with the for loops, but clearly something is not quite right. Any advice or help would be appreciated.
Using numpy for speed, built-in distance, and code length:
x_test_array = np.array(x_test)
x_train_array = np.array(x_train)
distance_matrix = np.linalg.norm(x_test[:,np.newaxis,:]-x_train[np.newaxis,:,:], axis=2)
Cell i,j in the matrix corresponds to the distance between x_train[i] and x_test[j].
You can then do sorting.
Edit: How to create the distance matrix without numpy:
matrix = []
for i in range(len(x_test)):
dist1 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
matrix.append(dist1)

Computing Mahalanobis Distance Component Wise

I have 60000 vectors of 784 dimensions. This data has 10 classes.
I must evaluate a function that takes out one dimension and computes the distance metric again. This function is computing the distance of each vector to it's classes' mean. In code:
def objectiveFunc(self, X, y, indices):
subX = np.array([X[:,i] for i in indices]).T
d = np.zeros((10,1))
for n in range(10):
C = subX[np.where(y == n)]
u = np.mean(C, axis = 0)
Sinv = pinv(covariance(C))
d[n] = np.mean(np.apply_along_axis(mahalanobis, axis = 1, arr=C, v=u, VI=Sinv))
where indices are fed in with one index removed during each iteration.
As you can imagine, I am computing a lot of individual components during the computation for Mahalanobis distance. Is there a way for me to store all the 784 component distances?
Alternatively, what's the fastest way to compute Mahalanobis distance?
First of all and to make it easier to understand, this is the Mahalanobis Distance formula:
So, to compute the mahalanobis distance for each element according to its class, we can do:
X_train=X_train.reshape(-1,784)
def mahalanobis(element,classe):
part=np.where(y_train==classe)[0]
ave=np.mean(X_train[part])
distance_example=np.sqrt(((np.mean(X_train[part[[element]]])-ave)**2)/np.var(X_train[part]))
return distance_example
mahalanobis(20,2)
# Out[91]: 0.13947337027828757
Then you can create a for statement to calculate all distances. For instance, class 0:
[mahalanobis(i,0) for i in range(0,len(X_train[np.where(y_train==0)[0]]))]

Optimization function for the sum Google Maps distance

I am trying to find a point (latitude/longitude) that minimizes the sum of Google maps distance to all other N points.
I was able to extract the Google Maps distances between my latitude and longitude arrays but I wasn't able to minimize my function.
Code
def minimize_g(input_g):
gmaps1 = googlemaps.Client(key="xxx")
def distance_f(x):
dist = gmaps1.distance_matrix([x], np.array(input_g)[:,1:3])
sum_ = 0
for obs in range(len(np.array(df[:3]))):
sum_+= dist['rows'][0]['elements'][obs]['distance']['value']
return sum_
#initial guess: centroid
centroid = input_g.mean(axis=0)
optimization = minimize(distance_f, centroid, method='COBYLA')
return optimization.x
Thanks!
If you are looking for any point on the map that results in shortest distance to all coordinates in your list, you can try writing a function that calculates the distance from one coordinate to another coordinate. If you have that function ready to go, its a matter of calculating the total distance to all your points from a test point.
Then, from some artificially created coordinates, you would minimize the distances to all your points with something along the lines of
import numpy as np
lats = [12.3, 12.4, 12.5]
lons = [16.1, 15.1, 14.1]
def total_distance_to_lats_and_lons(lat, lon):
# some summation over distances from lat, lon to lats, lons
# create two lists with 0.01 degree precision as an artificial grid of possibilities
test_lats = np.arange(min(lats), max(lats), 0.01)
test_lons = np.arange(min(lons), max(lons), 0.01)
test_distances = [] # empty list to fill with the total_distance to each combination of test_lat, test_lon
coordinate_index_combinations = [] # corresponding coordinates
for test_lat in test_lats:
for test_lon in test_lons:
coordinate_combinations.append([test_lat, test_lon]) # add a combination of indices
test_distances.append(total_distance_to_lats_and_lons(test_lat, test_lon)) # add a distance
index_of_best_test_coordinate = np.argmin(test_distances) # find index of the minimum value
print('Best match is index {}'.format(index_of_best_test_coordinate))
print('Coordinates: {}'.format(coordinate_combinations[index_of_best_test_coordinate]))
print('Total distance: {}'.format(test_distances[index_of_best_test_coordinate]))
This brute force approach has some precision limitations and becomes an expensive loop quite quickly, so you can also apply this method iteratively with the minimum found after each round, so iteratively increasing precision and decreasing start and end points in the test coordinate lists. After a few iterations, you should have a pretty precise estimate. On the other hand, it is possible such an iterative method converges to one of multiple local minima, yielding only one of multiple solutions.

Numpy stateing that invalid value while calculating normalized mahalanobis distance

Note:
This is for a homework assignment in my data mining class.
I'm going to put relevant code snippets on this SO post, but you can find my entire program at http://pastebin.com/CzNFbLJ2
The dataset I'm using for this program can be found at http://archive.ics.uci.edu/ml/datasets/Iris
So I'm getting: RuntimeWarning: invalid value encountered in sqrt
return np.sqrt(m)
I am attempting to find the average Mahalanobis distance of the given iris dataset (for both raw and normalized datasets). The error is only happening on the normalized version of the dataset which is making me wonder if I have incorrectly understood what normalization means (both in code and mathematically).
I thought that normalization means that each component of a vector is divided by it's vector length (causing the vector to add up to 1). I found this SO question How to normalize a 2-dimensional numpy array in python less verbose? and thought it matched up to my concept of normalization. But now my code is reporting that the Mahalanobis distance over the normalized dataset is NAN
def mahalanobis(data):
import numpy as np;
import scipy.spatial.distance;
avg = 0
count = 0
covar = np.cov(data, rowvar=0);
invcovar = np.linalg.inv(covar)
for i in range(len(data)):
for j in range(i + 1, len(data)):
if(j == len(data)):
break
avg += scipy.spatial.distance.mahalanobis(data[i], data[j], invcovar)
count += 1
return avg / count
def normalize(data):
import numpy as np
row_sums = data.sum(axis=1)
norm_data = np.zeros((50, 4))
for i, (row, row_sum) in enumerate(zip(data, row_sums)):
norm_data[i,:] = row / row_sum
return norm_data
Probably too late, but check out page 64-65 in our textbook "Introduction to Data Mining". There's a section called "Normalization or Standardization", which explains the concept of normalized data that Hearne is looking for.
Basically, standardized data set x' = (x - mean(x)) / standardDeviation(x)
Since I see you're using python, here's how to do it using SciPy:
normalizedData = (data - data.mean(axis=0)) / data.std(axis=0, ddof=1)
Source: http://mail.scipy.org/pipermail/numpy-discussion/2011-April/056023.html
You can use pdist() to do the distance calculation without for loop:
from sklearn import datasets
iris = datasets.load_iris()
from scipy.spatial.distance import pdist, squareform
print squareform(pdist(iris.data, 'mahalanobis'))
Normalization in this context probably does mean subtracting the mean and scaling so the data has a unit covariance matrix.
However, to scale every vector in your dataset to unit norm use: norm_data=data/np.sqrt(np.sum(data*data,1))[:,None].
You need to divide by the L2 norm of each vector, which means squaring the value of each element, then taking the square root of the sum. Broadcasting allows you to avoid explicitly coding the loop (see the answer to the question you cited: https://stackoverflow.com/a/8904762/1149913).

Categories