Looking for an effective way to calculate affinity matrix - python

Well, suppose that I have an image with dimensions n by n. which is ordered lexicographically on a column vector with dimensions n^2 by 1.
I want to calculate following affinity matrix for image:
Clearly, the matrix is symmetric and we can reduce the calculations using this property, this is my attempt to implement the above matrix calculation:
def w(R,sigma,x):
n=x.shape[0]
w=np.zeros_like(x).astype(np.float64)
img=x.ravel()
for i in range(img.shape[0]):
for j in range(img.shape[0]):
if(max(abs(i//n-j//n),abs(i%n-j%n))>R or i==j):
w[i//n,j%n]=0
continue
w[i//n,j%n] = np.exp((-1*(int(img[i])-int(img[j]))**2)/sigma)
return w
where R is the threshold for norm, sigma is the exponential damper and x is the image.
However, this code is very ineffective, and its order is n^2, do you have any better idea for implementing the code faster?

Related

Generating invertible matrices in numpy/tensorflow

I would like to generate invertible matrices (specifically those from GL(n), a general linear group of size n) using Tensorflow and/or Numpy for use with my neural network.
How can this be done and what would be the best way of doing so?
I understand there is a way to generate symmetric invertible matrices by computing (A + A.T)/2 for arbitrary square matrices A, however, I would like mine to not just be symmetric.
I happened to have found one way which I believe can generate a large variety of random invertible matrices using diagonal dominance.
The theorem is that given an nxn matrix, if the abs of the diagonal element is larger than the sum of the abs of all the row elements with respect to the row the diagonal element is in, and this holds true for all rows, then the underlying matrix is invertible. (here is the corresponding wikipedia article: https://en.wikipedia.org/wiki/Diagonally_dominant_matrix)
Therefore the following code snippet generates an arbitrary invertible matrix.
n = 5 # size of invertible matrix I wish to generate
m = np.random.rand(n, n)
mx = np.sum(np.abs(m), axis=1)
np.fill_diagonal(m, mx)

How to find the Autocoorelation Matrix of a 1D array/vector

I've 1D array of size n which represents a signal in the time domain, I need to find the Autocorrelation Matrix of this signal using python, then I'll compute the eigenvectors and eigenvalues of this matrix.
what I've tried is to use the Toeplitz method from scipy.linalg as follows
res = scipy.linalg.toeplitz(c=np.asarray(signal),r=np.asarray(signal))
eigenValues,eigenVectors = numpy.linalg.eig(res)
I'm not sure if that's correct because on Matlab forums I saw a quite different solution Matlab solution
Terminology about correlations is confusing, so let me take care in defining what it sounds like you want to compute.
Autocorrelation matrix of a random signal
"Autocorrelation matrix" is usually understood as a characterization of random vectors: for a random vector (X[1], ..., X[N]) where each element is a real-valued random variable, the autocorrelation matrix is an NxN symmetric matrix R_XX whose (i,j)th element is
R_XX[i,j] = E[X[i] ⋅ X[j]]
and E[⋅] denotes expectation.
To reasonably estimate an autocorrelation matrix, you need multiple observations of random vector X to estimate the expectations. But it sounds like you have only one 1D array x. If we nevertheless apply the above formula, expectations simplify away to
R_XX[i,j] = E[X[i] ⋅ X[j]] ~= x[i] ⋅ x[j].
In other words, the matrix degenerates to the outer product np.outer(x, x), a rank-1 matrix with one nonzero eigenvalue. But this is an awful estimate of R_XX and doesn't reveal new insight about the signal.
Autocorrelation for a WSS signal
In signal processing, a common modeling assumption is that a signal is "wide-sense-stationary or WSS", meaning that any time shift of the signal has the same statistics. This assumption is particularly such that the expectations above can be estimated from a single observation of the signal:
R_XX[i,j] = E[X[i] ⋅ X[j]] ~= sum_n (x[i + n] ⋅ x[j + n])
where the sum over n is over all samples. For simplicity, imagine in this description that x is a signal that goes on infinitely. In practice on a finite-length signal, something has to be done at the signal edges, but I'll gloss over this. Equivalently by the change of variable m = i + n we have
R_XX[i,j] = E[X[i] ⋅ X[j]] ~= sum_m (x[m] ⋅ x[j - i + m]),
with i and j only appearing together as a difference (j - i) in the right-hand side. So this autocorrelation is usually indexed in terms of the "lag" k = j - i,
R_xx[k] = sum_m (x[m] ⋅ x[j - i + m]).
Note that this results in a 1D array rather than a matrix. You can compute it for instance with scipy.signal.correlate(x, x) in Python or xcorr(x, x) in Matlab. Again, I'm glossing over boundary handling considerations at the signal edges. Please follow these links to read about the options that these implementations provide.
You can relate the 1D correlation array R_xx[k] with the matrix R_XX[i,j] by
R_XX[i,j] ~= R_xx[j - i]
which like you said is Toeplitz.

Covariance of large matrix

I have a large Matrix (70000x784) that I want to compute the covariance Matrix (70000x70000) of. I tried using numpy.cov(), but I get a memory error because there are too many observations (and yes I am running a 62 bit Version of Python on a 62-bit computer).
I attempted to calculate the covariance Matrix using a nested for loop (which is really slow), but I know it's not correct because the resulting covariance Matrix is not symmetrical (X[i,j]!=X[j,i]).
Surely, there must be an easier and quicker way to do this?
Here is my attempt, where the input Matrix With Dimensions 70000x784 is X_scaled:
Xt = np.transpose(X_scaled)
aveRows = np.mean(Xt,axis=0)
for i, val in enumerate(X[:,0]):
for j, val in enumerate(X[:,0]):
cov_matrix[i,j] = np.mean((X_scaled[i,:]-aveRows[i])*(X_scaled[j,:]-aveRows[j]),axis=0)
#increase cov_matrix by one row and one column:
cov_matrix = np.lib.pad(cov_matrix, ((0,1),(0,1)), 'constant', constant_values=(0))
print(cov_matrix.shape)

Spectral norm 2x2 matrix in tensorflow

I've got a 2x2 matrix defined by the variables J00, J01, J10, J11 coming in from other inputs. Since the matrix is small, I was able to compute the spectral norm by first computing the trace and determinant
J_T = tf.reduce_sum([J00, J11])
J_ad = tf.reduce_prod([J00, J11])
J_cb = tf.reduce_prod([J01, J10])
J_det = tf.reduce_sum([J_ad, -J_cb])
and then solving the quadratic
L1 = J_T/2.0 + tf.sqrt(J_T**2/4.0 - J_det)
L2 = J_T/2.0 - tf.sqrt(J_T**2/4.0 - J_det)
spectral_norm = tf.maximum(L1, L2)
This works, but it looks rather ugly and it isn't generalizable to larger matrices. Is there cleaner way (maybe a method call that I'm missing) to compute spectral_norm?
The spectral norm of a matrix J equals the largest singular value of the matrix.
Therefore you can use tf.svd() to perform the singular value decomposition, and take the largest singular value:
spectral_norm = tf.svd(J,compute_uv=False)[...,0]
where J is your matrix.
Notes:
I use compute_uv=False since we are interested only in singular values, not singular vectors.
J does not need to be square.
This solution works also for the case where J has any number of batch dimensions (as long as the two last dimensions are the matrix dimensions).
The elipsis ... operation works as in NumPy.
I take the 0 index because we are interested only in the largest singular value.

python hcluster, distance matrix and condensed distance matrix

I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!
I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt

Categories