python hcluster, distance matrix and condensed distance matrix - python

I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!

I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt

Related

Implementation of Hellinger distance with numpy only

I got this task to implement a python function using NumPy.
The function should compute the Hellinger distance between two matrices P and Q with dimensions (n, k). p_i is the vector of row i of P and p_i,j is the value of row i in column j of P.
The Hellinger distance for matrices is defined as followed:
h_i = i/sqrt(2) * sqrt(sum(j=1,k) of (sqrt(p_i,j)-sqrt(q_i,j))^2)
H is a vector of length n and h_i is the value i of H, with i = 1,...,n. So the Hellinger distance between two matrices is equivalent to the Hellinger distance between the rows of the matrices. For each row, the distance is stored in the output vector H.
The task now is to implement the function (using NumPy), which will compute the above-described problem. It gets handed over two 2D-NumPy-Arrays P and Q, and it should return a 1D-Numpy-Array H of the right length.
I never worked with NumPy before, so I would be very grateful for any suggestions.
I informed myself a little bit on the NumPy-Docs but I would love to get any suggentions.
I found out that you need to use the axis argument in certain NumPy functions (e.g. np.sum()) in order to tell NumPy if it should iterate over the rows or columns of an array. I did exactly that: return np.sqrt(1/2) * np.sqrt( np.sum((np.sqrt(P) - np.sqrt(Q))**2,axis=1) ) and it works.
The only problem is that it still gives back negative values. How is that possible, since the subtraction is taken to the power of 2?

Generating invertible matrices in numpy/tensorflow

I would like to generate invertible matrices (specifically those from GL(n), a general linear group of size n) using Tensorflow and/or Numpy for use with my neural network.
How can this be done and what would be the best way of doing so?
I understand there is a way to generate symmetric invertible matrices by computing (A + A.T)/2 for arbitrary square matrices A, however, I would like mine to not just be symmetric.
I happened to have found one way which I believe can generate a large variety of random invertible matrices using diagonal dominance.
The theorem is that given an nxn matrix, if the abs of the diagonal element is larger than the sum of the abs of all the row elements with respect to the row the diagonal element is in, and this holds true for all rows, then the underlying matrix is invertible. (here is the corresponding wikipedia article: https://en.wikipedia.org/wiki/Diagonally_dominant_matrix)
Therefore the following code snippet generates an arbitrary invertible matrix.
n = 5 # size of invertible matrix I wish to generate
m = np.random.rand(n, n)
mx = np.sum(np.abs(m), axis=1)
np.fill_diagonal(m, mx)

Matrix inverse in numpy/python not giving correct matrix?

I have a nxn matrix C and use inv from numpy.linalg to take the inverse to get Cinverse. My Cmatrix has elements of order 10**4 but my Cinverse matrix has elements of order 10**12 and higher (not sure if thats correct). When I do numpyp.dot(C,Cinverse), I do not get the identity matrix. Why is this?
I have a vector x which I multiply by itself to get a matrix.
x=array([ 121.41191662, 74.22830468, 73.23156336, 75.48354975,
79.89580817])
c=np.outer(xvector,xvector)
this is a 5x5 matrix.
then I get its inverse by
from numpy.linalg import inv
cinverse=inv(c)
then I want to see if I can get identity matrix back.
identity=np.dot(C00,C00inv)
However, I do not get the identity matrix. cinverse has very large matrix elements
around 10**13 and higher while c has matrix elements around 10,000.
The outer product of two vectors (be they the same or not) is not invertible. Since it is just a stack of scaled copies of the same vector its rank is one. Rank defective matrices cannot be inverted.
I'm surprised that numpy is not raising an exception or at least giving a warning.
So here is some code that generates the inverse matrix, and I will comment about it afterwards.
import numpy as np
x = np.random.rand(5,5)*10000 # makes a 5x5 matrix with elements around 10000
xin = np.linalg.inv(x)
iden = np.dot(x,xinv)
Now the first line of your iden matrix probably looks something like this:
[ 1.00000000e+00, -2.05382445e-16, -5.61067365e-16, 1.99719718e-15, -2.12322957e-16]
. Notice that the first element is exactly 1, as it should be, but there others are not exactly 0, however they are essentially zero and should be regarded as zero according to machine precision.

Array is too big for euclidean distance

my_array is a sparse 78000 x 200 matrix of zeros and ones; many rows are only zeros. I am trying to calculate the euclidean distance. Further I use the multidimensional scaling to get the coordinates of every column (every column is a word in a vocabulary).
I get the error "array is too big" while calculating the euclidean distance. There are other similar questions, but I don't know how to apply it in this case. What I imagine is that if I reduce the precision of the "dist array" it will be less big, but I don't know how to do that. May also be working with sparse matrices or np.memmap, but the my_array is not a problem. The problem starts when it tries to keep all the distance values, so I need to integrate it during the dist array calculation. The dist array is a 78000 x 78000 matrix.
So my question is, how do I integrate any of these techniques in the calculation of the euclidean distance?
Or would it make sense to loop through dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))? And adjust the data type somewhere in there?
from sklearn.manifold import MDS
from sklearn.metrics.pairwise import euclidean_distances
my_array = np.array(Y[2:])
dist = euclidean_distances(my_array)
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories