I want to calculate the nearest cosine neighbors of a vector from the rows of a matrix, and have been testing the performance of a few Python functions for doing this.
def cos_loop_spatial(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with the numpy cosine function.
"""
neighbors = []
for row in range(matrix.shape[0]):
neighbors.append(scipy.spatial.distance.cosine(vector, matrix[row,:]))
return neighbors
def cos_loop(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with manually calculated cosine value.
"""
neighbors = []
for row in range(matrix.shape[0]):
vector_norm = np.linalg.norm(vector)
row_norm = np.linalg.norm(matrix[row,:])
cos_val = vector.dot(matrix[row,:]) / (vector_norm * row_norm)
neighbors.append(cos_val)
return neighbors
def cos_matrix_multiplication(matrix, vector):
"""
Calculating pairwise cosine distance using matrix vector multiplication.
"""
dotted = matrix.dot(vector)
matrix_norms = np.linalg.norm(matrix, axis=1)
vector_norm = np.linalg.norm(vector)
matrix_vector_norms = np.multiply(matrix_norms, vector_norm)
neighbors = np.divide(dotted, matrix_vector_norms)
return neighbors
cos_functions = [cos_loop_spatial, cos_loop, cos_matrix_multiplication]
# Test performance and plot the best results of each function
mat = np.random.randn(1000,1000)
vec = np.random.randn(1000)
cos_performance = {}
for func in cos_functions:
func_performance = %timeit -o func(mat, vec)
cos_performance[func.__name__] = func_performance.best
pd.Series(cos_performance).plot(kind='bar')
The cos_matrix_multiplication function is clearly the fastest of these, but I'm wondering if you have suggestions of further efficiency improvements for matrix vector cosine distance calculations.
use scipy.spatial.distance.cdist(mat, vec[np.newaxis,:], metric='cosine'), basically computes pairwise distance between every pairs of the two collections of vectors, represented by rows of the two input matrices.
Related
I am trying to code the Nearest Neighbours Algorithm from scratch and have come across a problem - my algorithm was only giving the index/classification of the nearest neighbour for one row/point of the the training set. I went through every part of my code and realised that the problem is my Euclidean distance function. It only gives the result for one row.
This is the code I have written for Euclidean distance:
def euclidean_dist(r1, r2):
dist = 0
for j in range(0, len(r2)-1):
dist = dist + (r2[j] - r1[j])**2
return dist**0.5
Within my Nearest Neighbours algorithm this is the implementation of the Euclidean distance function:
for i in range(len(x_test)):
dist1 = []
dist2 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
dist2.append(distances)
dist1 = np.array(dist1)
sorting(dist1) #separate sorting function to sort the distances from lowest to highest,
#the aim was to get one array, dist1, with the euclidean distances for each row sorted
#and one array with the unsorted euclidean distances, dist2, (to be able to search for index later in the code)
I noticed the problem when using the iris dataset and trying out this part of the function with it. I split the data set into testing and training (X_test, X_train and y_test).
When this was implemented with the data set I got the following array for dist2:
[0.3741657386773946,
1.643167672515499,
3.389690251335658,
2.085665361461421,
1.284523257866513,
3.9572717874818752,
0.9539392014169458,
3.5805027579936315,
0.7211102550927979,
...
0.8062257748298555,
0.4242640687119287,
0.5196152422706631]
Its length is 112 which is the same length as X_train, but these are only the Euclidean distances for the first row or point of the X_test set. The dist1 array is the same except it is sorted.
Why am I not getting the Euclidean distances for every row/point of the test set? I thought I iterated through correctly with the for loops, but clearly something is not quite right. Any advice or help would be appreciated.
Using numpy for speed, built-in distance, and code length:
x_test_array = np.array(x_test)
x_train_array = np.array(x_train)
distance_matrix = np.linalg.norm(x_test[:,np.newaxis,:]-x_train[np.newaxis,:,:], axis=2)
Cell i,j in the matrix corresponds to the distance between x_train[i] and x_test[j].
You can then do sorting.
Edit: How to create the distance matrix without numpy:
matrix = []
for i in range(len(x_test)):
dist1 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
matrix.append(dist1)
I am implementing an algorithm for k-means clustering. So far it works using Euclidean distances. Switching out Euclidean distances for Mahalanobis distances fails to cluster correctly.
For some reason, the Mahalanobis distance is negative at times. Turns out the covariance matrix has negative eigenvalues, which apparently is not good for covariance matrices.
Here are the functions I'm using:
#takes in data point x, centroid m, covariance matrix sigma
def mahalanobis(x, m, sigma):
return np.dot(np.dot(np.transpose(x - m), np.linalg.inv(sigma)), x - m)
#takes in centroid m and data (iris in 2d, dimensions: 2x150)
def covar_matrix(m, data):
d, n = data.shape
R = np.zeros((d,d))
for i in range(n):
R += np.dot(data[:,i:i+1] , np.transpose(data[:,i:i+1]))
R /= n
return R - np.dot(m, np.transpose(m))
#autocorrelation_matrix - centroid*centroid'
How I implemented the algorithm:
Set k
Randomly choose k centroids
Calculate covar_matrix() of each centroid
Calculate mahalanobis() of each data point to each centroid and add to closest cluster
Start looking for new centroids; for each data point* in each cluster, calculate the sum of mahalanobis() to every other point in the cluster; point with minimum sum becomes new centroid
Repeat 3-5 until old centroid and new centroids are the same
*Calculate covar_matrix() with this point
I expect a positive Mahalanobis distance and a positive definite covariance matrix (the latter will fix the former I hope).
I have 60000 vectors of 784 dimensions. This data has 10 classes.
I must evaluate a function that takes out one dimension and computes the distance metric again. This function is computing the distance of each vector to it's classes' mean. In code:
def objectiveFunc(self, X, y, indices):
subX = np.array([X[:,i] for i in indices]).T
d = np.zeros((10,1))
for n in range(10):
C = subX[np.where(y == n)]
u = np.mean(C, axis = 0)
Sinv = pinv(covariance(C))
d[n] = np.mean(np.apply_along_axis(mahalanobis, axis = 1, arr=C, v=u, VI=Sinv))
where indices are fed in with one index removed during each iteration.
As you can imagine, I am computing a lot of individual components during the computation for Mahalanobis distance. Is there a way for me to store all the 784 component distances?
Alternatively, what's the fastest way to compute Mahalanobis distance?
First of all and to make it easier to understand, this is the Mahalanobis Distance formula:
So, to compute the mahalanobis distance for each element according to its class, we can do:
X_train=X_train.reshape(-1,784)
def mahalanobis(element,classe):
part=np.where(y_train==classe)[0]
ave=np.mean(X_train[part])
distance_example=np.sqrt(((np.mean(X_train[part[[element]]])-ave)**2)/np.var(X_train[part]))
return distance_example
mahalanobis(20,2)
# Out[91]: 0.13947337027828757
Then you can create a for statement to calculate all distances. For instance, class 0:
[mahalanobis(i,0) for i in range(0,len(X_train[np.where(y_train==0)[0]]))]
I am trying to find a point (latitude/longitude) that minimizes the sum of Google maps distance to all other N points.
I was able to extract the Google Maps distances between my latitude and longitude arrays but I wasn't able to minimize my function.
Code
def minimize_g(input_g):
gmaps1 = googlemaps.Client(key="xxx")
def distance_f(x):
dist = gmaps1.distance_matrix([x], np.array(input_g)[:,1:3])
sum_ = 0
for obs in range(len(np.array(df[:3]))):
sum_+= dist['rows'][0]['elements'][obs]['distance']['value']
return sum_
#initial guess: centroid
centroid = input_g.mean(axis=0)
optimization = minimize(distance_f, centroid, method='COBYLA')
return optimization.x
Thanks!
If you are looking for any point on the map that results in shortest distance to all coordinates in your list, you can try writing a function that calculates the distance from one coordinate to another coordinate. If you have that function ready to go, its a matter of calculating the total distance to all your points from a test point.
Then, from some artificially created coordinates, you would minimize the distances to all your points with something along the lines of
import numpy as np
lats = [12.3, 12.4, 12.5]
lons = [16.1, 15.1, 14.1]
def total_distance_to_lats_and_lons(lat, lon):
# some summation over distances from lat, lon to lats, lons
# create two lists with 0.01 degree precision as an artificial grid of possibilities
test_lats = np.arange(min(lats), max(lats), 0.01)
test_lons = np.arange(min(lons), max(lons), 0.01)
test_distances = [] # empty list to fill with the total_distance to each combination of test_lat, test_lon
coordinate_index_combinations = [] # corresponding coordinates
for test_lat in test_lats:
for test_lon in test_lons:
coordinate_combinations.append([test_lat, test_lon]) # add a combination of indices
test_distances.append(total_distance_to_lats_and_lons(test_lat, test_lon)) # add a distance
index_of_best_test_coordinate = np.argmin(test_distances) # find index of the minimum value
print('Best match is index {}'.format(index_of_best_test_coordinate))
print('Coordinates: {}'.format(coordinate_combinations[index_of_best_test_coordinate]))
print('Total distance: {}'.format(test_distances[index_of_best_test_coordinate]))
This brute force approach has some precision limitations and becomes an expensive loop quite quickly, so you can also apply this method iteratively with the minimum found after each round, so iteratively increasing precision and decreasing start and end points in the test coordinate lists. After a few iterations, you should have a pretty precise estimate. On the other hand, it is possible such an iterative method converges to one of multiple local minima, yielding only one of multiple solutions.
I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean