One dimensional Mahalanobis Distance in Python - python

I've been trying to validate my code to calculate Mahalanobis distance written in Python (and double check to compare the result in OpenCV)
My data points are of 1 dimension each (5 rows x 1 column).
In OpenCV (C++), I was successful in calculating the Mahalanobis distance when the dimension of a data point was with above dimensions.
The following code was unsuccessful in calculating Mahalanobis distance when dimension of the matrix was 5 rows x 1 column. But it works when the number of columns in the matrix are more than 1:
import numpy;
import scipy.spatial.distance;
s = numpy.array([[20],[123],[113],[103],[123]]);
covar = numpy.cov(s, rowvar=0);
invcovar = numpy.linalg.inv(covar)
print scipy.spatial.distance.mahalanobis(s[0],s[1],invcovar);
I get the following error:
Traceback (most recent call last):
File "/home/abc/Desktop/Return.py", line 6, in <module>
invcovar = numpy.linalg.inv(covar)
File "/usr/lib/python2.6/dist-packages/numpy/linalg/linalg.py", line 355, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
IndexError: tuple index out of range

One-dimensional Mahalanobis distance is really easy to calculate manually:
import numpy as np
s = np.array([[20], [123], [113], [103], [123]])
std = s.std()
print np.abs(s[0] - s[1]) / std
(reducing the formula to the one-dimensional case).
But the problem with scipy.spatial.distance is that for some reason np.cov returns a scalar, i.e. a zero-dimensional array, when given a set of 1d variables. You want to pass in a 2d array:
>>> covar = np.cov(s, rowvar=0)
>>> covar.shape
()
>>> invcovar = np.linalg.inv(covar.reshape((1,1)))
>>> invcovar.shape
(1, 1)
>>> mahalanobis(s[0], s[1], invcovar)
2.3674720531046645

Covariance needs 2 arrays to compare. In both np.cov() and Opencv CalcCovarMatrix, it expects the two arrays to be stacked on top of each other (Use vstack). You can also have the 2 arrays to be side by side if you change the Rowvar to false in numpy or use COVAR_COL in opencv. If your arrays are multidimentional, just flatten() them first.
So if I want to compare two 24x24 images, I flatten them both into 2 1x1024 images, then stack the two to get a 2x1024, and that is the first argument of np.cov().
You should then get a large square matrix, where it shows the results of comparing each element in array1 with each element in array2. In my example it will be 1024x1024. THAT is what you pass into your invert function.

Related

Python numpy comparing two 3D Arrays for similarity

I am trying to compare two 3D numpy arrays to calculate similarity. I have found these two posts, which I am trying to stich together to something useful.
Comparing NumPy Arrays for Similarity
Subtracting numpy arrays of different shape efficiently
To make a long story short, I have two arrays created from 3D point clouds so they are filled with 3D coordinates, but because the 3D objects are different, the arrays have different lengths.
If requested, I can post some sample arrays, but they are +1000 points, so that would be a lot of text to post.
Here is what I am trying to do now. You can get array1 and array2 data here: https://pastebin.com/WbNvRUwG (array2 starts at line 1858).
array1 = [long np array with 3D coordinates]
array2 = [long np array with 3D coordinates]
array1_original = array1.copy()
if len(array1) < len(array2):
array1, array2 = array2, array1
array_difference = np.subtract(array1, array2[:,None]) # The [:,None] is from the second link to make the arrays have same length to enable subtractraction
array_abs_difference = np.absolute(array_difference)
array_total_difference = np.sum(array_abs_difference)
similarity = 1 - (array_total_difference /
np.sum(array1_original))
My array differences are fine and represent what I want, so the most similar arrays have small differences, but when I do the sum of array1_original it comes out way smaller than my differences and therefore my similarity score becomes negative.
I also tried to calculate the difference from an array filled with zeros to array1_original, but it comes out about the same.
Can anyone tell me why np.sum(array1_original) would not be bigger than np.sum(array_abs_difference)?
The numpy comparison ended up being to slow, so I just used open3D instead. It works for me

Error getting more than two eigenvalues in PCA

I am trying to perform PCA from scratch on a subset of MNIST data(digits 0 and 1) using Python.
(NOTE: x_train_0_scaled has dimensions : 5923x784 where 5923 is the number of images and 784 is the 28*28 flattened pixel values)
Here's my code to find eigenvalues:
# matrix multiplication using numpy
covar_matrix = np.matmul(x_train_0_scaled.T, x_train_0_scaled)
print("The shape of variance matrix = ", covar_matrix.shape)
# the parameter 'eigvals' is defined (low value to heigh value)
# eigh function will return the eigen values in asending order
# this code generates only the top 2 (782 and 783)(index) eigenvalues.
values, vectors = eigh(covar_matrix, eigvals=(782, 783))
print("Shape of eigen vectors = ", vectors.shape)
However when I try to get more than two eigenvalues, I get this error:
values, vectors = eigh(covar_matrix, eigvals=(782, 783, 781))
File "/usr/local/lib/python3.8/site-packages/scipy/linalg/decomp.py", line 484, in eigh
lo, hi = [int(x) for x in subset_by_index]
ValueError: too many values to unpack (expected 2)
The reason I want more than two eigenvectors is because as per the image below, I guess my data is not clearly seperable so I want to find more dimensions to plot my data on. Is my intuition correct?
The issue is resolved. eigvals takes arguments as (lo, hi). So instead of specifying (781,782, 783) I needed to specify lo=781 and hi=783 to get the top 3 eigenvalues.

Python - euclidean distance different size vectors

I have a numpy array size (9126,12) and two reference cluster points (2,12) that I'm trying to calculate the distance to for the array in order to label them. I understand in practice how this is meant to happen but just can't implement it due to sending different size arrays.
Know I can use numpy.linalg but it's part of a home work assignment so not allowed to do so.
ValueError: operands could not be broadcast together with shapes (9126,12) (2,12)
def euclid_dist(v1, v2):
return np.sqrt(((v1-v2)**2).sum(axis = 1))
def check_euclid_dist(data, reference_vectors):
npdata = data.to_numpy()
dst = euclid_dist(npdata, reference_vectors)
# Get the indices of minimum element in numpy array
result = np.where(dst == np.amin(dst))
print(result)
return result
You can compute the distance of each vector to each of the reference points by inserting an extra dimension in both arrays and let Numpy broadcast them against each other:
distances = np.linalg.norm(npdata[:, None, ...] - reference_vectors[None, ...], axis=-1)
Then you can find the nearest cluster by using np.argmin:
cluster_id = np.argmin(distances, axis=1)

Python: Functions of arrays that return arrays of the same shape

Note: I'm using numpy
import numpy as np
Given 4 arrays of the same (but arbitrary) shape, I am trying to write a function that forms 2x2 matrices from each corresponding element of the arrays, finds the eigenvalues, and returns two arrays of the same shape as the original four, with its elements being eigenvalues (i.e. the resulting arrays would have the same shape as the input, with array1 holding all the first eigenvalues and array2 holding all the second eigenvalues).
I tried doing the following, but unsurprisingly, it gives me an error that says the array is not square.
temp = np.linalg.eig([[m1, m2],[m3, m4]])[0]
I suppose I can make an empty temp variable in the same shape,
temp = np.zeros_like(m1)
and go over each element of the original arrays and repeat the process. My problem is that I want this generalised for arrays of any arbitrary shape (need not be one dimensional). I would guess that finding the shape of the arrays and designing loops to go over each element would not be a very good way of doing it. How do I do this efficiently?
Construct a 2x2x... array:
temp = np.array([[m1, m2], [m3, m4]])
Move the first two dimensions to the end for a ...x2x2 array:
for _ in range(2):
temp = np.rollaxis(temp, 0, temp.ndim)
Call np.linalg.eigvals (which broadcasts) for a ...x2 array of eigenvalues:
eigvals = np.linalg.eigvals(temp)
And split this into an array of first eigenvalues and an array of second eigenvalues:
eigvals1, eigvals2 = eigvals[..., 0], eigvals[..., 1]

Calculating Correlation Coefficient with Numpy

I have a list of values and a 1-d numpy array, and I would like to calculate the correlation coefficient using numpy.corrcoef(x,y,rowvar=0). I get the following error:
Traceback (most recent call last):
File "testLearner.py", line 25, in <module>
corr = np.corrcoef(valuesToCompare,queryOutput,rowvar=0)
File "/usr/local/lib/python2.6/site-packages/numpy/lib/function_base.py", line 2003, in corrcoef
c = cov(x, y, rowvar, bias, ddof)
File "/usr/local/lib/python2.6/site-packages/numpy/lib/function_base.py", line 1935, in cov
X = concatenate((X,y), axis)
ValueError: array dimensions must agree except for d_0
I printed out the shape for my numpy array and got (400,1). When I convert my list to an array with numpy.asarray(y) I get (400,)!
I believe this is the problem. I did an array.reshape to (400,1) and printed out the shape, and I still get (400,). What am I missing?
Thanks in advance.
I think you might have assumed that reshape modifies the value of the original array. It doesn't:
>>> a = np.random.randn(5)
>>> a.shape
(5,)
>>> b = a.reshape(5,1)
>>> b.shape
(5, 1)
>>> a.shape
(5,)
np.asarray treats a regular list as a 1d array, but your original numpy array that you said was 1d is actually 2d (because its shape is (400,1)). If you want to use your list like a 2d array, there are two easy approaches:
np.asarray(lst).reshape((-1, 1)) – -1 means "however many it needs" for that dimension".
np.asarray([lst]).T – .T means array transpose, which switches from (1,5) to (5,1).-
You could also reshape your original array to 1d via ary.reshape((-1,)).

Categories