I have 60000 vectors of 784 dimensions. This data has 10 classes.
I must evaluate a function that takes out one dimension and computes the distance metric again. This function is computing the distance of each vector to it's classes' mean. In code:
def objectiveFunc(self, X, y, indices):
subX = np.array([X[:,i] for i in indices]).T
d = np.zeros((10,1))
for n in range(10):
C = subX[np.where(y == n)]
u = np.mean(C, axis = 0)
Sinv = pinv(covariance(C))
d[n] = np.mean(np.apply_along_axis(mahalanobis, axis = 1, arr=C, v=u, VI=Sinv))
where indices are fed in with one index removed during each iteration.
As you can imagine, I am computing a lot of individual components during the computation for Mahalanobis distance. Is there a way for me to store all the 784 component distances?
Alternatively, what's the fastest way to compute Mahalanobis distance?
First of all and to make it easier to understand, this is the Mahalanobis Distance formula:
So, to compute the mahalanobis distance for each element according to its class, we can do:
X_train=X_train.reshape(-1,784)
def mahalanobis(element,classe):
part=np.where(y_train==classe)[0]
ave=np.mean(X_train[part])
distance_example=np.sqrt(((np.mean(X_train[part[[element]]])-ave)**2)/np.var(X_train[part]))
return distance_example
mahalanobis(20,2)
# Out[91]: 0.13947337027828757
Then you can create a for statement to calculate all distances. For instance, class 0:
[mahalanobis(i,0) for i in range(0,len(X_train[np.where(y_train==0)[0]]))]
Related
I'm working on an assignment where I am tasked to implement PCA in Python for an online course. Unfortunately, when I try to run a comparison (provided by the course) between my implementation and SKLearn's, my results appear to differ too greatly.
After many hours of review, I am still unsure where it is going wrong. If someone could take a look and determine what step I have coded or interpreted incorrectly, I would greatly appreciate it.
def normalize(X):
"""
Normalize the given dataset X to have zero mean.
Args:
X: ndarray, dataset of shape (N,D)
Returns:
(Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
with mean 0; mean is the sample mean of the dataset.
Note:
You will encounter dimensions where the standard deviation is zero.
For those ones, the process of normalization results in normalized data with NaN entries.
We can handle this by setting the std = 1 for those dimensions when doing normalization.
"""
# YOUR CODE HERE
### Uncomment and modify the code below
mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise. Setting it to 1 will compute the mean across rows.
std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.
std_filled = std.copy()
std_filled[std == 0] = 1
# Compute the normalized data as Xbar
Xbar = (X - mu)/std_filled
return Xbar, mu, # std_filled
def eig(S):
"""
Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.
Args:
S: ndarray, covariance matrix
Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors
Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
# YOUR CODE HERE
# Uncomment and modify the code below
# Compute the eigenvalues and eigenvectors
# You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
eigvals, eigvecs = np.linalg.eig(S)
# The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
# We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
# of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
sort_indices = np.argsort(eigvals)[::-1]
# Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
return eigvals[sort_indices], eigvecs[:, sort_indices]
def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by the columns of `B`
Args:
B: ndarray of dimension (D, M), the basis for the subspace
Returns:
P: the projection matrix
"""
# YOUR CODE HERE
P = B # (np.linalg.inv(B.T # B)) # B.T
return P
def select_components(eig_vals, eig_vecs, num_components):
"""
Selects the n components desired for projecting the data upon.
Args:
eig_vals: The eigenvalues sorted in descending order of magnitude.
eig_vecs: The eigenvectors sorted in order relative to that of the eigenvalues.
num_components: the number of principal components to use.
Returns:
The number of desired components to keep for projection of the data upon.
"""
principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]
return principal_vals, principal_components
def PCA(X, num_components):
"""
Projects normalized data onto the 'n' desired principal components.
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
the reconstructed data, the sample mean of the X, principal values
and principal components
"""
# Normalize to have mean 0 and variance 1.
Z, mean_vec = normalize(X)
# Calculate the covariance matrix
S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables. Set bias = True to ensure normalization is done with N and not N-1
# Calculate the (unit) eigenvectors and eigenvalues of S. Sort them in descending order of importance relative to the magnitude of the eigenvalues.
eig_vals, eig_vecs = eig(S)
# Keep only the n largest Principle Components of the sorted unit eigenvectors.
principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
# Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.
#P = projection_matrix(eig_vecs[:, :num_components])
P = projection_matrix(principal_components)
# Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
X_reconst = (P # X.T).T
return X_reconst, mean_vec, principal_vals, principal_components
And here is the test case I'm supposed to pass:
random = np.random.RandomState(0)
X = random.randn(10, 5)
from sklearn.decomposition import PCA as SKPCA
for num_component in range(1, 4):
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver="full")
sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
reconst, _, _, _ = PCA(X, num_component)
# The difference in the result should be very small (<10^-20)
print(
"difference in reconstruction for num_components = {}: {}".format(
num_component, np.square(reconst - sklearn_reconst).sum()
)
)
np.testing.assert_allclose(reconst, sklearn_reconst)
As far as I can tell, there are a few things wrong with your code.
Your projection matrix is wrong.
If the eigenvectors of your covariance matrix is B with dimension D x M where M is the number of components you select and D is the dimension of the original data, then the projection matrix is just B # B.T.
In standard implementation of PCA, we typically do not scale the data by the inverse of the standard deviation. You seem to be trying to do an approximation of a whitened PCA (ZCA), but even then it looks wrong.
As a quick test, you can compute the normalized data without dividing by the standard deviation, and when you compute the covariance matrix, set bias=False.
You should also subtract the mean from the data before multiplying it by the projection operator, and adding it back after that, i.e.,
X_reconst = (P # (X - mean_vec).T).T + mean_vec.
PCA essentially is just a change of basis, followed by discarding coordinates corresponding to directions with low variance. The eigenvectors of the covariance matrix corresponds to the new orthogonal basis, and the eigenvalues tells you the variance of the data along the direction of the corresponding eigenvectors. P = B # B.T is just the change of basis followed to the new basis (and discarding some coordinates), B, followed by a change back to the original basis.
Edit
I'm curious to know which online course teaches people to implement PCA this way.
I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.
I am writing some code to calculate the real distance between one point and the rest of the points from the same array. The array holds positions of particles in 3D space. There is N-particles so the array's shape is (N,3). I choose one particle and calculate the distance between this particle and the rest of the particles, all within one array.
Would anyone here have any idea how to do this?
What I have so far:
xbox = 10
ybox = 10
zbox = 10
nparticles =15
positions = np.empty([nparticles, 3])
for i in range(nparticles):
xrandomalocation = random.uniform(0, xbox)
yrandomalocation = random.uniform(0, ybox)
zrandomalocation = random.uniform(0, zbox)
positions[i, 0] = xrandomalocation
positions[i, 1] = yrandomalocation
positions[i, 2] = zrandomalocation
And that's pretty much all I have right now. I was thinking of using np.linalg.norm however I am not sure at all how to implement it to my code (or maybe use it in a loop)?
It sounds like you could use scipy.distance.cdist or scipy.distance.pdist for this. For example, to get the distances from point X to the points in coords:
>>> from scipy.spatial import distance
>>> X = [(35.0456, -85.2672)]
>>> coords = [(35.1174, -89.9711),
... (35.9728, -83.9422),
... (36.1667, -86.7833)]
>>> distance.cdist(X, coords, 'euclidean')
array([[ 4.70444794, 1.6171966 , 1.88558331]])
pdist is similar, but only takes one array, and you get the distances between all pairs.
i am using this function:
from scipy.spatial import distance
def closest_node(node, nodes):
closest = distance.cdist([node], nodes)
index = closest.argmin()
euclidean = closest[0]
return nodes[index], euclidean[index]
where node is the single point in the space you want to compare with an array of points called nodes. it returns the point and the euclidean distance to your original node
I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean
I want to calculate the nearest cosine neighbors of a vector from the rows of a matrix, and have been testing the performance of a few Python functions for doing this.
def cos_loop_spatial(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with the numpy cosine function.
"""
neighbors = []
for row in range(matrix.shape[0]):
neighbors.append(scipy.spatial.distance.cosine(vector, matrix[row,:]))
return neighbors
def cos_loop(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with manually calculated cosine value.
"""
neighbors = []
for row in range(matrix.shape[0]):
vector_norm = np.linalg.norm(vector)
row_norm = np.linalg.norm(matrix[row,:])
cos_val = vector.dot(matrix[row,:]) / (vector_norm * row_norm)
neighbors.append(cos_val)
return neighbors
def cos_matrix_multiplication(matrix, vector):
"""
Calculating pairwise cosine distance using matrix vector multiplication.
"""
dotted = matrix.dot(vector)
matrix_norms = np.linalg.norm(matrix, axis=1)
vector_norm = np.linalg.norm(vector)
matrix_vector_norms = np.multiply(matrix_norms, vector_norm)
neighbors = np.divide(dotted, matrix_vector_norms)
return neighbors
cos_functions = [cos_loop_spatial, cos_loop, cos_matrix_multiplication]
# Test performance and plot the best results of each function
mat = np.random.randn(1000,1000)
vec = np.random.randn(1000)
cos_performance = {}
for func in cos_functions:
func_performance = %timeit -o func(mat, vec)
cos_performance[func.__name__] = func_performance.best
pd.Series(cos_performance).plot(kind='bar')
The cos_matrix_multiplication function is clearly the fastest of these, but I'm wondering if you have suggestions of further efficiency improvements for matrix vector cosine distance calculations.
use scipy.spatial.distance.cdist(mat, vec[np.newaxis,:], metric='cosine'), basically computes pairwise distance between every pairs of the two collections of vectors, represented by rows of the two input matrices.