Calculating cosine distance between the rows of matrix - python

I'm trying to calculate cosine distance in python between the rows in matrix and have couple a questions.So I'm creating matrix matr and populating it from the lists, then reshaping it for analysis purposes:
s = []
for i in range(len(a)):
for j in range(len(b_list)):
s.append(a[i].count(b_list[j]))
matr = np.array(s)
d = matr.reshape((22, 254))
The output of d gives me smth like:
array([[0, 0, 0, ..., 0, 0, 0],
[2, 0, 0, ..., 1, 0, 0],
[2, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])
Then I want to use scipy.spatial.distance.cosine package to calculate cosine from first row to every other else in the d matrix.
How can I perform that? Should it be some for loop for that? Not too much experience with matrix and array operations.
So how can I use for loop for second argument (d[1],d[2], and so on) in that construction not to launch it every time:
from scipy.spatial.distance import cosine
x=cosine (d[0], d[6])

You said "calculate cosine from first row to every other else in the d matrix" [sic]. If I understand correctly, you can do that with scipy.spatial.distance.cdist, passing the first row as the first argument and the remaining rows as the second argument:
In [31]: from scipy.spatial.distance import cdist
In [32]: matr = np.random.randint(0, 3, size=(6, 8))
In [33]: matr
Out[33]:
array([[1, 2, 0, 1, 0, 0, 0, 1],
[0, 0, 2, 2, 1, 0, 1, 1],
[2, 0, 2, 1, 1, 2, 0, 2],
[2, 2, 2, 2, 0, 0, 1, 2],
[0, 2, 0, 2, 1, 0, 0, 0],
[0, 0, 0, 1, 2, 2, 2, 2]])
In [34]: cdist(matr[0:1], matr[1:], metric='cosine')
Out[34]: array([[ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045]])
If it turns out that you want to compute all the pairwise distances in matr, you can use scipy.spatial.distance.pdist.
For example,
In [35]: from scipy.spatial.distance import pdist
In [36]: pdist(matr, metric='cosine')
Out[36]:
array([ 0.65811827, 0.5545646 , 0.1752139 , 0.24407105, 0.72499045,
0.36039785, 0.27625314, 0.49748109, 0.41498206, 0.2799177 ,
0.76429774, 0.37117185, 0.41808563, 0.5765951 , 0.67661917])
Note that the first five values returned by pdist are the same values returned above using cdist.
For further explanation of the return value of pdist, see How does condensed distance matrix work? (pdist)

You can just use a simple for loop with scipy.spatial.distance.cosine:
import scipy.spatial.distance
dists = []
for row in matr:
dists.append(scipy.spatial.distance.cosine(matr[0,:], row))

Here's how you might calculate it easily by hand:
from numpy import array as a
from numpy.random import random_integers as randi
from numpy.linalg.linalg import norm
from numpy import set_printoptions
M = randi(10, size=a([5,5])); # create demo matrix
# dot products of rows against themselves
DotProducts = M.dot(M.T);
# kronecker product of row norms
NormKronecker = a([norm(M, axis=1)]) * a([norm(M, axis=1)]).T;
CosineSimilarity = DotProducts / NormKronecker
CosineDistance = 1 - CosineSimilarity
set_printoptions(precision=2, suppress=True)
print CosineDistance
Output:
[[-0. 0.15 0.1 0.11 0.22]
[ 0.15 0. 0.15 0.13 0.06]
[ 0.1 0.15 0. 0.15 0.14]
[ 0.11 0.13 0.15 0. 0.18]
[ 0.22 0.06 0.14 0.18 -0. ]]
This matrix is e.g. interpreted as "the cosine distance between row 3 against row 2 (or, equally, row 2 against row 3) is 0.15".

Related

Efficient way to substitute repeating np.vstack in python?

I am trying to implement this post in python.
import numpy as np
x = np.array([0,0,0])
for r in range(3):
x = np.vstack((x, np.array([-r, r, -r])))
x gets this value
array([[ 0, 0, 0],
[ 0, 0, 0],
[-1, 1, -1],
[-2, 2, -2]])
I am concerned the runtime efficiency about the repeating np.vstack. Is there a more efficient way to do this?
Build a list of arrays or lists, and apply np.array (or vstack) to that once:
In [598]: np.array([[-r,r,-r] for r in [0,0,1,2]])
Out[598]:
array([[ 0, 0, 0],
[ 0, 0, 0],
[-1, 1, -1],
[-2, 2, -2]])
But if the column pattern is consistent, broadcasting two arrays against each other will be faster
In [599]: np.array([-1,1,-1])*np.array([0,0,1,2])[:,None]
Out[599]:
array([[ 0, 0, 0],
[ 0, 0, 0],
[-1, 1, -1],
[-2, 2, -2]])
Would it be useful to use numpy.tile?
N = 3
A = np.array([[0, *range(0, -N, -1)]]).T
B = np.tile(A, (1, N))
B[:,1] = -B[:,1]
The first line sets the expected number of rows after the first row of zeroes. The second creates a NumPy array by creating an initial value of 0, followed by the linear sequence of 0, -1, -2, up to -N + 1. Note the use of the splat operator which unpacks the range object and creates elements in an individual list. These are concatenated with the first value of 0, and we create a 2D NumPy array that is a column vector. The third line tiles this vector N times horizontally to get the desired output. Finally the fourth line negates the second column to get your desired output
Example Run
In [175]: N = 3
In [176]: A = np.array([[0, *range(0, -N, -1)]]).T
In [177]: B = np.tile(A, (1, N))
In [178]: B[:,1] = -B[:,1]
In [178]: B
Out[178]:
array([[ 0, 0, 0],
[ 0, 0, 0],
[-1, 1, -1],
[-2, 2, -2]])
You can use np.block as following:
First create a block which you are currently doing inside the for loop
Finally, vertically stack a row of zeros using np.vstack to get the final desired answer
import numpy as np
size = 3
sign = np.ones(3)*((-1)**np.arange(1, size+1)) # General sign array of repeating -1, 1
A = np.ones((size, size), int)
B = np.arange(0, size) * A
B = sign * np.block([B.T])
# array([[ 0, 0, 0],
# [ -1, 1, -1],
# [ -2, 2, -2]])
answer = np.vstack([B[0], B])
# array([[ 0, 0, 0],
# [ 0, 0, 0],
# [ -1, 1, -1],
# [ -2, 2, -2]])

Distance of an array of vector from it's own element

I've got an array of vector and I want to build a matrix that shows me the distance between its own vector. For example I've got that matrix with those 2 vectors:
[[a, b , c]
[d, e , f]]
and I want to get that where dist is an euclidian distance for example:
[[dist(vect1,vect1), dist(vect1,vect2)]
[dist(vect2,vect1), dist(vect2,vect2)]]
So obviously I'm expecting a symmetric matrix with null value on the diagonal. I tried something using scikit-learn.
#Create clusters containing the similar vectors from the clustering algo
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
list_cluster = [[] for x in range(0,n_clusters_ + 1)]
for index, label in enumerate(labels):
if label == -1:
list_cluster[n_clusters_].append(sparse_matrix[index])
else:
list_cluster[label].append(sparse_matrix[index])
vector_rows = []
for cluster in list_cluster:
for row in cluster:
vector_rows.append(row)
#Create my array of vectors per cluster order
sim_matrix = np.array(vector_rows)
#Build my resulting matrix
sim_matrix = metrics.pairwise.pairwise_distances(sim_matrix, sim_matrix)
The problem is my resulting matrix is not symmetric so I guess there is something wrong in my code.
I add a little sample if you want to test, I did it with an euclidean distance vector per vector:
input_matrix = [[0, 0, 0, 3, 4, 1, 0, 2],
[0, 0, 0, 2, 5, 2, 0, 3],
[2, 1, 1, 0, 4, 0, 2, 3],
[3, 0, 2, 0, 5, 1, 1, 2]]
expected_result = [[0, 2, 4.58257569, 4.89897949],
[2, 0, 4.35889894, 4.47213595],
[4.58257569, 4.35889894, 0, 2.64575131],
[4.89897949, 4.47213595, 2.64575131, 0]]
The functions pdist and squareform will do the trick:
In [897]: import numpy as np
...: from scipy.spatial.distance import pdist, squareform
In [898]: input_matrix = np.asarray([[0, 0, 0, 3, 4, 1, 0, 2],
...: [0, 0, 0, 2, 5, 2, 0, 3],
...: [2, 1, 1, 0, 4, 0, 2, 3],
...: [3, 0, 2, 0, 5, 1, 1, 2]])
In [899]: squareform(pdist(input_matrix))
Out[899]:
array([[0. , 2. , 4.58257569, 4.89897949],
[2. , 0. , 4.35889894, 4.47213595],
[4.58257569, 4.35889894, 0. , 2.64575131],
[4.89897949, 4.47213595, 2.64575131, 0. ]])
As expected, the resulting distance matrix is a symmetric array.
By default pdist computes the euclidean distance. You can calculate a different distance by passing the proper value to parameter metric in the function call. For example:
In [900]: squareform(pdist(input_matrix, metric='jaccard'))
Out[900]:
array([[0. , 1. , 0.875 , 0.71428571],
[1. , 0. , 0.875 , 0.85714286],
[0.875 , 0.875 , 0. , 1. ],
[0.71428571, 0.85714286, 1. , 0. ]])

Python sparse intersection of matrices non-zero values

I have two sparse* adjacency matrices A1 and A2 of type 'numpy.int64'.
The nodes of the corresponding graphs are labeled by integers and the indices of the matrices correspond to these nodes (the matrix value being the link weight between the nodes).
I'm trying to compute a similarity measure between the graphs. To do this I need to find the adjacency matrix for the subgraph of each graph, which contains the nodes common to both graphs.
Nothing about the equals sizes of the matrices, or common nodes between them is assured.
The result should be the same adjacency matrices with values for nodes not in both graphs equal to zero.
Example:
A1:
array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2:
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
Outcome:
A1':
array([[ 0, 0, 2, 0],
[ 0, 0, 0, 0],
[ 2, 0, 0, 0],
[ 0, 0, 0, 0]])
A2':
array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
The size of matrices I'm using are on order of 10^5 X 10^5. The resulting size doesn't matter, I'll slice down the size of the smallest afterwards.
I'll be repeating this operation many times and so speed is important.
Attempts so far:
I can get the list of common nodes by:
np.intersect1d(A1.nonzero()[0], A2.nonzero()[0])
But I can't find a way of using this as a filter to map the values for indices not in this list to 0.
*I don't think I necessarily need to use sparse though is very preferable for scalability later.
If I understand your question correctly, based on the example you have provided, you can simply use the numpy.in1d method to give you a boolean array indices, for example
A1 = np.array([[ 0, 1, 2, 1],
[ 1, 0, 0, 0],
[ 2, 0, 0, 0],
[ 1, 0, 0, 0]])
A2 = np.array([[ 0, 0, 1],
[ 0, 0, 0],
[ 1, 0, 0]])
idx = np.in1d(A1,A2).reshape(A1.shape)
A1[idx] = 0
print(A1)
# prints
[[0 0 2 0]
[0 0 0 0]
[2 0 0 0]
[0 0 0 0]]
For sparse matrices, the right solution depends on which sparse format you are using. If you are using csr or csc formats then you can apply the same technique on the coefficients (V_IJ) of the matrices A1.data and then use resulting array (idx) to modify the corresponding indices (I and J) i.e. A1.indices and A1.indptr.

How Do I create a Binomial Array of specific size

I am trying to generate a numpy array of length 100 randomly filled with sets of 5 1s and 0s as such:
[ [1,1,1,1,1] , [0,0,0,0,0] , [0,0,0,0,0] ... [1,1,1,1,1], [0,0,0,0,0] ]
Essentially there should be a 50% chance that at each position there will be 5 1s and a 50% chance there will be 5 0's
Currently, I have been messing about with numpy.random.binomial(), and tried running:
numpy.random.binomial(1, .5 , (100,5))
but this creates an array as such:
[ [0,1,0,0,1] , [0,1,1,1,0] , [1,1,0,0,1] ... ]
I need each each set of elements to be consistent and not random. How can I do this?
Use numpy.random.randint to generate a random column of 100 1s and 0s, then use tile to repeat the column 5 times:
>>> numpy.tile(numpy.random.randint(0, 2, size=(100, 1)), 5)
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 0, 0],
...
You want to create a temporary array of zeros and ones, and then randomly index into that array to create a new array. In the code below, the first line creates an array whose 0'th row contains all zeros and whose 1st row contains all ones. The function randint returns a random sequence of zeros and ones, which can be used as indices into the temporary array.
import numpy as np
...
def make_coded_array(n, k=5):
_ = np.array([[0]*k,[1]*k])
return _[np.random.randint(2, size=500)]
import numpy as np
import random
c = random.sample([[0,0,0,0,0],[1,1,1,1,1]], 1)
for i in range(99):
c = np.append(c, random.sample([[0,0,0,0,0],[1,1,1,1,1]], 1))
Not the most efficient way though
Use numpy.ones and numpy.random.binomial
>>> numpy.ones((100, 5), dtype=numpy.int64) * numpy.random.binomial(1, .5, (100, 1))
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
...

Scikit-learn χ² (chi-squared) statistic and corresponding contingency table

In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.
Consider a column x of X. sklearn.feature_selection.chi2 tests whether
the frequencies of the y values where x is 1 agree with the frequencies of y in
the full population. (#larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y
where x is 0 also contribute to the test.
Suppose we form the contingency table for x and y:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1 in x.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x and y is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.
Given your data,
>>> X = array([[1, 0, 0, 0, 1],
... [1, 1, 0, 1, 1],
... [1, 0, 0, 0, 0],
... [0, 0, 0, 0, 0],
... [0, 0, 0, 0, 1],
... [1, 0, 0, 0, 1],
... [1, 0, 1, 1, 1],
... [0, 1, 1, 0, 0],
... [1, 0, 1, 1, 1],
... [1, 1, 1, 1, 0]])
>>> y = array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
this is what feature_selection.chi2 computes:
>>> Y = np.vstack([1 - y, y])
>>> observed = np.dot(Y, X)
>>> observed
array([[3, 1, 1, 2, 2],
[4, 2, 3, 2, 4]])
These are the observed feature frequencies, per class, i.e. the contingency table. Then the expected values:
>>> feature_count = X.sum(axis=0)
>>> class_prob = Y.mean(axis=1)
>>> expected = np.dot(feature_count.reshape(-1, 1), class_prob.reshape(1, -1)).T
>>> expected
array([[ 2.8, 1.2, 1.6, 1.6, 2.4],
[ 4.2, 1.8, 2.4, 2.4, 3.6]])
Finally, it runs a χ² test:
>>> from scipy.stats import chisquare
>>> score, pval = chisquare(observed, expected)
>>> score
array([ 0.02380952, 0.05555556, 0.375 , 0.16666667, 0.11111111])
>>> pval
array([ 0.87737056, 0.81366372, 0.54029137, 0.6830914 , 0.73888268])
The scores are the relevant bit: they're used to sort the features by discriminative power. Note that you get one score and one p-value per feature.

Categories