I have two arrays R3_mod with shape (21,21) containing many zeros and P2 with shape (21,) containing many zeros. I am getting the inverse of R3_mod using np.linalg.pinv() and eventually multiplying it to P2 as shown below. Is there a more efficient way to invert such arrays and then multiply?
Since the arrays are too big, you can access it here: https://drive.google.com/drive/u/0/folders/1NjEiNoneMaCbmbmObEs2GCNIb08NFIy3
import numpy as np
X = np.linalg.pinv(R3_mod).dot(P2)
Assuming that the matrix R3_mod is indeed invertible, I think it's best to use np.linalg.inv instead of linalg.pinv.
inv computes the inverse of the matrix directly, where pinv (stands for pseudo-inverse, see https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse) computes the matrix A' that minimizes |AA'-I|. If the input matrix is invertible, pinv should return the same result as inv.
i'm trying to compute the cumulative distribution function of a multivariate normal using scipy.
i'm having trouble with the "input matrix must be symmetric positive definite" error.
to my knowledge, a diagonal matrix with positive diagonal entries is positive definite (see page 1 problem 2)
However, for different (relatively) small values of these diagonal values, the error shows up for the smaller values.
For example, this code:
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2)).cdf([2,1])
returns 0.15865525393145702
while changing the third line with:
std = np.array([0.00001, 2])
causes the error to show up.
i'm guessing that it has something to do with computation error of floats.
The problem is, when the dimension of the cov matrix is larger, the accepted positive values on the diagoanal are bigger and bigger.
I tried multiple values on the diagonal of the covariance matrix of dimension 9x9. It seems that when other diagonal values are very large, small values cause the error.
Examining the stack trace you will see that it assumes the condition number as
1e6*np.finfo('d').eps ~ 2.2e-10 in _eigvalsh_to_eps
In your example the difference the smaller eigenvalue is 5e-6**2 times smaller than the largest eigenvalue so it will be treated as zero.
You can pass allow_singular=True to get it working
import numpy as np
from scipy.stats import multivariate_normal
std = np.array([0.000001, 2])
mean = np.array([1.23, 3])
multivariate_normal(mean=mean, cov=np.diag(std**2), allow_singular=True).cdf([2,1])
I'm computing covariance in two ways that I think should tie out, but they do not.
Method 1: Compute the covariance matrix of a slice of an array of data
Method 2: Compute the covariance matrix of the full array of data, and reference an equivalent slice of that matrix.
The differences are tiny (order 1e-18), but these differences are growing with subsequent calculations in my code and preventing reproducibility. Is this a floating point issue? If so I'd still like to understand why it's happening and how to avoid it.
I'm on numpy 1.16.3
Thanks!
import numpy as np
state = np.random.RandomState(1)
X = state.rand(40,100)
A = np.cov(X[:20])
B = np.cov(X)[:20,:20]
print(np.array_equal(A, B))
diff = A - B
print(np.max(diff))
I would have expected a True result from array_equal, but I get False
I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!
I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt
My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python