Distance of an array of vector from it's own element - python

I've got an array of vector and I want to build a matrix that shows me the distance between its own vector. For example I've got that matrix with those 2 vectors:
[[a, b , c]
[d, e , f]]
and I want to get that where dist is an euclidian distance for example:
[[dist(vect1,vect1), dist(vect1,vect2)]
[dist(vect2,vect1), dist(vect2,vect2)]]
So obviously I'm expecting a symmetric matrix with null value on the diagonal. I tried something using scikit-learn.
#Create clusters containing the similar vectors from the clustering algo
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
list_cluster = [[] for x in range(0,n_clusters_ + 1)]
for index, label in enumerate(labels):
if label == -1:
list_cluster[n_clusters_].append(sparse_matrix[index])
else:
list_cluster[label].append(sparse_matrix[index])
vector_rows = []
for cluster in list_cluster:
for row in cluster:
vector_rows.append(row)
#Create my array of vectors per cluster order
sim_matrix = np.array(vector_rows)
#Build my resulting matrix
sim_matrix = metrics.pairwise.pairwise_distances(sim_matrix, sim_matrix)
The problem is my resulting matrix is not symmetric so I guess there is something wrong in my code.
I add a little sample if you want to test, I did it with an euclidean distance vector per vector:
input_matrix = [[0, 0, 0, 3, 4, 1, 0, 2],
[0, 0, 0, 2, 5, 2, 0, 3],
[2, 1, 1, 0, 4, 0, 2, 3],
[3, 0, 2, 0, 5, 1, 1, 2]]
expected_result = [[0, 2, 4.58257569, 4.89897949],
[2, 0, 4.35889894, 4.47213595],
[4.58257569, 4.35889894, 0, 2.64575131],
[4.89897949, 4.47213595, 2.64575131, 0]]

The functions pdist and squareform will do the trick:
In [897]: import numpy as np
...: from scipy.spatial.distance import pdist, squareform
In [898]: input_matrix = np.asarray([[0, 0, 0, 3, 4, 1, 0, 2],
...: [0, 0, 0, 2, 5, 2, 0, 3],
...: [2, 1, 1, 0, 4, 0, 2, 3],
...: [3, 0, 2, 0, 5, 1, 1, 2]])
In [899]: squareform(pdist(input_matrix))
Out[899]:
array([[0. , 2. , 4.58257569, 4.89897949],
[2. , 0. , 4.35889894, 4.47213595],
[4.58257569, 4.35889894, 0. , 2.64575131],
[4.89897949, 4.47213595, 2.64575131, 0. ]])
As expected, the resulting distance matrix is a symmetric array.
By default pdist computes the euclidean distance. You can calculate a different distance by passing the proper value to parameter metric in the function call. For example:
In [900]: squareform(pdist(input_matrix, metric='jaccard'))
Out[900]:
array([[0. , 1. , 0.875 , 0.71428571],
[1. , 0. , 0.875 , 0.85714286],
[0.875 , 0.875 , 0. , 1. ],
[0.71428571, 0.85714286, 1. , 0. ]])

Related

How can I extract a column and create a vector out of them using NumPy?

mat = [[ 1. 2. 3. 4. 5.]
[ 6. 7. 8. 9. 10.]
[11. 12. 13. 14. 15.]]
Suppose, I have this NumPy array.
Say, I need to extract the 2nd column of each row, convert them into binary, and then create a vector out of them.
How can I do it using NumPy?
For instance, if I select 2nd column of this NumPy array, my output should look as follows:
[[0 0 1 0],
[0 1 1 1],
[1 1 0 0]]
I tried as follows:
my_data = np.genfromtxt('data_input')
print(my_data)
my_data_2nd_column = my_data[:, 1]
my_data_2nd_column_binary = Utils.encode(my_data_2nd_column)
my_2nd_column_binary = np.apply_along_axis(Utils.encode, 1, my_data)
print(my_2nd_column_binary)
Numpy has a built-in function for this. First, you can get a particular column using indexing:
>>> arr
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
>>> arr[:, [1]]
array([[ 2],
[ 7],
[12]])
Then, you could use the built-in function, but make sure you convert to unsigned, 8-bit integers:
>>> np.unpackbits(arr[:, [1]].astype(np.uint8), axis=1)
array([[0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 0, 0]], dtype=uint8)
Of course, if you need the second dimension to be rank 4, just use slicing again, although, it is probably worth copying if you are going to do lots of operations on the resulting array:
>>> np.unpackbits(arr[:, [1]].astype(np.uint8), axis=1)[:, -4:]
array([[0, 0, 1, 0],
[0, 1, 1, 1],
[1, 1, 0, 0]], dtype=uint8)
I had done this example, but without using the numpy library.
I commented on all functions.
mat = [[ 1, 2, 3, 4, 1,],
[ 6, 7, 8, 9, 40,],
[11, 12, 13, 14, 15,]]
# convert the binary into a vector of elements
def split(word):
return [int(char) for char in word]
# returns the vector size of the largest binary
def binaryBig(lista):
maior = max(lista, key=int)
temp = "{0:b}".format(maior)
return len(split(temp))
# convert the element to binary
def binary(x,big):
temp = split(format(x, "b"))
for n in range(len(temp),big):
temp.insert(0,0)
return temp
# create the matrix with the binaries
def createBinaryMat(lista):
big = binaryBig(lista)
mat = []
for i in lista:
mat.append(binary(i,big))
return mat
# select the column and return the created matrix
def binaryElementsOfColum(colum,mat):
lista = []
for i in mat:
lista.append(i[colum])
return createBinaryMat(lista)
for i in binaryElementsOfColum(4,mat):
print(i)
Output:
[0, 0, 0, 0, 0, 1]
[1, 0, 1, 0, 0, 0]
[0, 0, 1, 1, 1, 1]

New multidimensional array assigning values from array to indices of second array

While I don't feel this is overly complex, I'm struggling with how to even search for similar questions/answers.
I have two arrays.
indices_array: [0, 1, 1, 0, 0, 1, 0]
value_array: [1, 2, 3, 4, 5, 6, 7]
I want to create a new array using the first array as indices for assignment and the second array for values. This should result in a new array with two values per index, however, one value being zero and the other being the value from my second array.
Using my example arrays above, it should result in:
[[1, 0],
[0, 2],
[0, 3],
[4, 0],
[5, 0],
[0, 6],
[7, 0]]
I can easily create an empty version of my desired array using: np.zeros((total_len, values_per_index))
My intuition fails when attempting something like: target_array[indices_array] = value_array
I believe I understand why my attempted method fails, but how to actually accomplish this eludes me. Is there a simple way of doing this? Python is not my best language by far and some of the numpy tricks seem overly magical in nature at times..
Edit: I know a for loop would accomplish this, but I'm truly looking to understand numpy better and to ideally avoid iteration when possible for code cleanliness as much as readability.
You could do:
import numpy as np
indices = np.array([0, 1, 1, 0, 0, 1, 0])
values = np.array([1, 2, 3, 4, 5, 6, 7])
result = np.zeros((len(indices), 2))
result[np.arange(len(indices)), indices] = values
print(result)
Output
[[1. 0.]
[0. 2.]
[0. 3.]
[4. 0.]
[5. 0.]
[0. 6.]
[7. 0.]]
See indexing in numpy.
you can use multiplication like below
indices_array = np.array([0, 1, 1, 0, 0, 1, 0])
value_array = np.array([1, 2, 3, 4, 5, 6, 7])
## you can simply do this
value_array = np.c_[value_array * (indices_array==0), value_array * (indices_array==1)]
display(value_array)
Does this work for you?
import numpy as np
a = np.array([0, 1, 1, 0, 0, 1, 0])
b = np.array([1, 2, 3, 4, 5, 6, 7])
print(np.array([a*b,(1-a)*b]))
[[0 2 3 0 0 6 0]
[1 0 0 4 5 0 7]]
You may simply use column_stack and multiply using numpy broadcast
i_arr = np.array([0, 1, 1, 0, 0, 1, 0])
v_arr = np.array([1, 2, 3, 4, 5, 6, 7])
np.column_stack((1-i_arr, i_arr)) * v_arr[:,None]
Out[61]:
array([[1, 0],
[0, 2],
[0, 3],
[4, 0],
[5, 0],
[0, 6],
[7, 0]])

Neighbour Cells of a matrix

Suppose I have a matrix called "grid":
grid = [ [1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1] ]
I want to try and define a function that takes the contents of each neighbour in a 1 cell radius and add those values into a new matrix like so:
grid = [ [3, 5, 5, 3],
[5, 8, 8, 5],
[5, 8, 8, 5],
[3, 5, 5, 3] ]
One approach is to treat this as a 2-D convolution problem. You just need to define the appropriate mask.
In this case, you can use a 3x3 matrix of ones and zero out the center element.
import numpy as np
mask = np.ones((3, 3))
mask[1, 1] = 0
print(mask)
#[[1. 1. 1.]
# [1. 0. 1.]
# [1. 1. 1.]]
Now to the convolution:
from scipy.signal import convolve2d
print(convolve2d(grid, mask, mode='same'))
#[[3. 5. 5. 3.]
# [5. 8. 8. 5.]
# [5. 8. 8. 5.]
# [3. 5. 5. 3.]]
I found that solution - quick and dirty :)
grid = [ [1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1] ]
rows = len(grid)
cols = len(grid[0])
def get_sum_of_neighbours(grid, r, c):
neighbours = [
(r - 1, c - 1), (r - 1, c), (r - 1, c + 1),
(r, c - 1), (r, c + 1),
(r + 1, c - 1), (r + 1, c), (r + 1, c + 1),
]
return sum([grid[r_n][c_n]
for r_n, c_n in neighbours
if 0 <= r_n < rows and 0 <= c_n < cols])
resultgrid = []
for r in range(rows):
row = []
for c in range(cols):
row.append(get_sum_of_neighbours(grid, r, c))
resultgrid.append(row)
for row in resultgrid:
print(row)
without third party tools like scipy or numpy...

Calculating cosine distance between the rows of matrix

I'm trying to calculate cosine distance in python between the rows in matrix and have couple a questions.So I'm creating matrix matr and populating it from the lists, then reshaping it for analysis purposes:
s = []
for i in range(len(a)):
for j in range(len(b_list)):
s.append(a[i].count(b_list[j]))
matr = np.array(s)
d = matr.reshape((22, 254))
The output of d gives me smth like:
array([[0, 0, 0, ..., 0, 0, 0],
[2, 0, 0, ..., 1, 0, 0],
[2, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])
Then I want to use scipy.spatial.distance.cosine package to calculate cosine from first row to every other else in the d matrix.
How can I perform that? Should it be some for loop for that? Not too much experience with matrix and array operations.
So how can I use for loop for second argument (d[1],d[2], and so on) in that construction not to launch it every time:
from scipy.spatial.distance import cosine
x=cosine (d[0], d[6])
You said "calculate cosine from first row to every other else in the d matrix" [sic]. If I understand correctly, you can do that with scipy.spatial.distance.cdist, passing the first row as the first argument and the remaining rows as the second argument:
In [31]: from scipy.spatial.distance import cdist
In [32]: matr = np.random.randint(0, 3, size=(6, 8))
In [33]: matr
Out[33]:
array([[1, 2, 0, 1, 0, 0, 0, 1],
[0, 0, 2, 2, 1, 0, 1, 1],
[2, 0, 2, 1, 1, 2, 0, 2],
[2, 2, 2, 2, 0, 0, 1, 2],
[0, 2, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 1, 2, 2, 2, 2]])
In [34]: cdist(matr[0:1], matr[1:], metric='cosine')
Out[34]: array([[ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045]])
If it turns out that you want to compute all the pairwise distances in matr, you can use scipy.spatial.distance.pdist.
For example,
In [35]: from scipy.spatial.distance import pdist
In [36]: pdist(matr, metric='cosine')
Out[36]:
array([ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045,
0.36039785, 0.27625314, 0.49748109, 0.41498206, 0.2799177 ,
0.76429774, 0.37117185, 0.41808563, 0.5765951 , 0.67661917])
Note that the first five values returned by pdist are the same values returned above using cdist.
For further explanation of the return value of pdist, see How does condensed distance matrix work? (pdist)
You can just use a simple for loop with scipy.spatial.distance.cosine:
import scipy.spatial.distance
dists = []
for row in matr:
dists.append(scipy.spatial.distance.cosine(matr[0,:], row))
Here's how you might calculate it easily by hand:
from numpy import array as a
from numpy.random import random_integers as randi
from numpy.linalg.linalg import norm
from numpy import set_printoptions
M = randi(10, size=a([5,5])); # create demo matrix
# dot products of rows against themselves
DotProducts = M.dot(M.T);
# kronecker product of row norms
NormKronecker = a([norm(M, axis=1)]) * a([norm(M, axis=1)]).T;
CosineSimilarity = DotProducts / NormKronecker
CosineDistance = 1 - CosineSimilarity
set_printoptions(precision=2, suppress=True)
print CosineDistance
Output:
[[-0. 0.15 0.1 0.11 0.22]
[ 0.15 0. 0.15 0.13 0.06]
[ 0.1 0.15 0. 0.15 0.14]
[ 0.11 0.13 0.15 0. 0.18]
[ 0.22 0.06 0.14 0.18 -0. ]]
This matrix is e.g. interpreted as "the cosine distance between row 3 against row 2 (or, equally, row 2 against row 3) is 0.15".

python: calculate center of mass

I have a data set with 4 columns: x,y,z, and value, let's say:
x y z value
0 0 0 0
0 1 0 0
0 2 0 0
1 0 0 0
1 1 0 1
1 2 0 1
2 0 0 0
2 1 0 0
2 2 0 0
I would like to calculate the center of mass CM = (x_m,y_m,z_m) of all values. In the present example, I would like to see (1,1.5,0) as output.
I thought this must be a trivial problem, but I can't find a solution to it in the internet. scipy.ndimage.measurements.center_of_mass seems to be the right thing, but unfortunately, the function always returns two values (instead of 3). In addition, I can't find any documentation on how to set up an ndimage from an array: Would I use a numpy array N of shape (9,4)? Would then N[:,0] be the x-coordinate?
Any help is highly appreciated.
The simplest way I can think of is this: just find an average of the coordinates of mass components weighted by each component's contribution.
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
nonZeroMasses = masses[numpy.nonzero(masses[:,3])] # Not really necessary, can just use masses because 0 mass used as weight will work just fine.
CM = numpy.average(nonZeroMasses[:,:3], axis=0, weights=nonZeroMasses[:,3])
Another option is to use the scipy center of mass:
from scipy import ndimage
import numpy
masses = numpy.array([[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]])
ndimage.measurements.center_of_mass(masses)
How about:
# x y z value
table = np.array([[ 5. , 1.3, 8.3, 9. ],
[ 6. , 6.7, 1.6, 5.9],
[ 9.1, 0.2, 6.2, 3.7],
[ 2.2, 2. , 6.7, 4.6],
[ 3.4, 5.6, 8.4, 7.3],
[ 4.8, 5.9, 5.7, 5.8],
[ 3.7, 1.1, 8.2, 2.2],
[ 0.3, 0.7, 7.3, 4.6],
[ 8.1, 1.9, 7. , 5.3],
[ 9.1, 8.2, 3.3, 5.3]])
def com(xyz, mass):
mass = mass.reshape((-1, 1))
return (xyz * mass).mean(0)
print(com(table[:, :3], table[:, 3]))
Why did ndimage.measurements.center_of_mass not give the expected result?
The key is in how the input data masses was represented by an array of 4-tuples (x, y, z, value)
# x y z value
[[0, 0, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 1],
[1, 2, 0, 1],
[2, 0, 0, 0],
[2, 1, 0, 0],
[2, 2, 0, 0]]
The array masses here represents the 3-D position and weights of each mass.
Note however that this python array structure is only a 2-D array. It's shape is (9, 4).
The input you need to pass to ndimage to get the expected result is a 3-D array containing zeros everywhere and the weight of each mass at the appropriate coordinates within the array, like this:
from scipy import ndimage
import numpy
masses = numpy.zeros((3, 3, 1))
# x y z value
masses[1, 1, 0] = 1
masses[1, 2, 0] = 1
CM = ndimage.measurements.center_of_mass(masses)
# x y z
# (1.0, 1.5, 0.0)
Which is exactly the expected output.
Note the limitation of this solution (and the ndimage library) is it requires non-negative integer coordinates. Also will not be efficient for large and/or sparse volumes because each "pixel" of the ndimage needs to be instantiated in memory.

Categories