finding eigen vectors and eigen values using np.linalg.svd()? - python

I am trying to find eigenvectors and eigenvalues of my covariance matrix for PCA.
My code:
values, vectors = np.linalg.eigh(covariance_matrix)
This is the output:
Eigen Vectors:
[[ 0.26199559 0.72101681 -0.37231836 0.52237162]
[-0.12413481 -0.24203288 -0.92555649 -0.26335492]
[-0.80115427 -0.14089226 -0.02109478 0.58125401]
[ 0.52354627 -0.6338014 -0.06541577 0.56561105]]
Eigen Values:
[0.02074601 0.14834223 0.92740362 2.93035378]
Then I found that np.linalg.svd() also returns the same.
U, S, V = np.linalg.svd(standardized_x.T)
print(U)
print(S)
print(V)
[[-0.52237162 -0.37231836 0.72101681 0.26199559]
[ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
[-0.58125401 -0.02109478 -0.14089226 -0.80115427]
[-0.56561105 -0.06541577 -0.6338014 0.52354627]]
[20.89551896 11.75513248 4.7013819 1.75816839]
[[ 1.08374515e-01 9.98503796e-02 1.13323362e-01 ... -7.27833114e-02
-6.58701606e-02 -4.59092965e-02]
[-4.30198387e-02 5.57547718e-02 2.70926177e-02 ... -2.26960075e-02
-8.64611208e-02 1.89567788e-03]
[ 2.59377669e-02 4.83370288e-02 -1.09498919e-02 ... -3.81328738e-02
-1.98113038e-01 -1.12476331e-01]
...
[ 5.42576376e-02 5.32189412e-03 2.76010922e-02 ... 9.89545817e-01
-1.40226565e-02 -7.86338250e-04]
[ 1.60581494e-03 8.56651825e-02 1.78415121e-01 ... -1.24233079e-02
9.52228601e-01 -2.19591161e-02]
[ 2.27770498e-03 6.44405862e-03 1.49430370e-01 ... -6.58105858e-04
-2.32385318e-02 9.77215825e-01]]
The resulted U(eigenvector) is the same for both np.linalg.eigh() & svd() but S(variance/eigenvalue) values are not the same.
Am I missing something?
Can anyone explain what are U, S and V stand in np.linalg.svd() function?

Related

Numpy dot product between a 3d matrix and 2d matrix

I have a 3d array that has shape (2, 10, 3) and a 2d array that has shape (2, 3) like this:
print(t) #2d array
Output:
[[1.003 2.32 3.11 ]
[1.214 5.32 2.13241]]
print(normal) #3d array
Output:
[[[0.69908573 0.0826756 0.84485978]
[0.51058213 0.4052637 0.5068118 ]
[0.45974276 0.25819549 0.10780089]
[0.27484999 0.33367648 0.128262 ]
[0.35963389 0.77600065 0.89393939]
[0.46937506 0.59291623 0.06620307]
[0.87603987 0.44414505 0.83394174]
[0.83186093 0.62491876 0.38160734]
[0.96819897 0.80183442 0.75102768]
[0.54182908 0.19403844 0.07925769]]
[[2.82248573 3.2341756 0.96825978]
[2.63398213 3.5567637 0.6302118 ]
[2.58314276 3.40969549 0.23120089]
[2.39824999 3.48517648 0.251662 ]
[2.48303389 3.92750065 1.01733939]
[2.59277506 3.74441623 0.18960307]
[2.99943987 3.59564505 0.95734174]
[2.95526093 3.77641876 0.50500734]
[3.09159897 3.95333442 0.87442768]
[2.66522908 3.34553844 0.20265769]]]
How can I get each row in the 2d array t to get the corresponding dot product in the 3d array normal such that the array I end up with a shape (2, 10) where each contains all 10 dot products between the nth row in 2d array and nth matrix in 3d array?
[0.62096458 0.62618459 0.37528887 0.5728386 1.19634398 0.79620507
1.997884 0.75229492 1.2236496 0.4210626 ]
[2.96347746 3.30738892 3.50596579 4.93082295 5.33811805 4.44872493
7.33480393 4.19173472 4.7406248 7.83229689]
You can use numpy.einsum (np.einsum('ijk,ik->ij', t, normal)) to get this result:
import numpy as np
normal = np.array([
[1.003,2.32,3.11],
[1.214,5.32,2.13241]
])
t = np.array([
[
[0.69908573, 0.0826756, 0.84485978],
[0.51058213, 0.4052637, 0.5068118 ],
[0.45974276, 0.25819549, 0.10780089],
[0.27484999, 0.33367648, 0.128262 ],
[0.35963389, 0.77600065, 0.89393939],
[0.46937506, 0.59291623, 0.06620307],
[0.87603987, 0.44414505, 0.83394174],
[0.83186093, 0.62491876, 0.38160734],
[0.96819897, 0.80183442, 0.75102768],
[0.54182908, 0.19403844, 0.07925769]
],
[
[2.82248573, 3.2341756, 0.96825978],
[2.63398213, 3.5567637, 0.6302118 ],
[2.58314276, 3.40969549, 0.23120089],
[2.39824999, 3.48517648, 0.251662 ],
[2.48303389, 3.92750065, 1.01733939],
[2.59277506, 3.74441623, 0.18960307],
[2.99943987, 3.59564505, 0.95734174],
[2.95526093, 3.77641876, 0.50500734],
[3.09159897, 3.95333442, 0.87442768],
[2.66522908, 3.34553844, 0.20265769]
]
])
np.einsum('ijk,ik->ij', t, normal)
This results in
array([[ 3.52050429, 3.02851036, 1.39539629, 1.44869879, 4.9411858 ,
2.05224039, 4.50264332, 3.47096686, 5.16705551, 1.24011516],
[22.69703871, 23.46350713, 21.76853041, 21.98926093, 26.07809129,
23.47223475, 24.81159677, 24.75511727, 26.64957859, 21.46600189]])
Which is the same as doing the two multiplications in order:
t[0] # normal[0]
t[1] # normal[1]
Gives the two:
array([3.52050429, 3.02851036, 1.39539629, 1.44869879, 4.9411858 ,
2.05224039, 4.50264332, 3.47096686, 5.16705551, 1.24011516])
array([22.69703871, 23.46350713, 21.76853041, 21.98926093, 26.07809129,
23.47223475, 24.81159677, 24.75511727, 26.64957859, 21.46600189])

How to calculate the distance between all atoms in a PDB file and create a distance matrix from that

I would like to calculate the distances between all atom in a pdb file and then create a distance matrix from the result of the PDB.
I currently have all the x, y and z coordinates but I am struggling to to do this distance calculation for all atoms.
distance = sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2)
For example:
Distance between Atom 1 and Atom 2 ,3 ,4...
Distance between Atom 2 and Atom 3, 4, 5...
And so forth for every Atom in the PDB file. I'm new to coding so any method to achieve the end result would be great.
pdb file in question - https://files.rcsb.org/download/6GCH.pdb
considering your code, you can:
x_y_z_ = list()
...
for atom in residue:
x = (atom.coord[0])
y = (atom.coord[1])
z = (atom.coord[2])
x_y_z_.append([x,y,z])
...
x_y_z_ = np.array(x_y_z_)
print( pairwise_distances(x_y_z_,x_y_z_) )
and them use pairwise_distances from sklearn, like:
from sklearn.metrics import pairwise_distances
import numpy as np
x_y_z_ = np.array([[120,130,123],
[655,123,666],
[111,444,333],
[520,876,222]])
print( pairwise_distances(x_y_z_,x_y_z_) )
out:
[[ 0. 762.31423967 377.8584391 852.24233643]
[762.31423967 0. 714.04901793 884.51681725]
[377.8584391 714.04901793 0. 605.1660929 ]
[852.24233643 884.51681725 605.1660929 0. ]]

Incorrect output for calculating the compact svd

I am trying to calculate the truncated svd for a given matrix. I have written the code but when I test it has an incorrect output. I'm not sure where I am going wrong. I think I may be calculating my pos_v incorrectly but I can't seem to find the issue, can anyone give any guidance?
Here is my code:
def compact_svd(A, tol=1e-6):
"""Compute the truncated SVD of A.
Parameters:
A ((m,n) ndarray): The matrix (of rank r) to factor.
tol (float): The tolerance for excluding singular values.
Returns:
((m,r) ndarray): The orthonormal matrix U in the SVD.
((r,) ndarray): The singular values of A as a 1-D array.
((r,n) ndarray): The orthonormal matrix V^H in the SVD.
"""
lambda_, v = sp.linalg.eig((A.conj().T # A))
lambda_ = lambda_.real
sigma = np.sqrt(lambda_)
indices = np.argsort(sigma)[::-1]
v = v[:, indices]
r = 0
for i in range(len(sigma)):
if sigma[i] > tol:
r = r + 1
pos_sigma = sigma[:r]
pos_v = v[:,:r]
U = (A # pos_v) / pos_sigma
return U, pos_sigma, pos_v.conj().T
Here is my test matrix:
A = np.array([[9,9,9,3,2,9,3,7,7,8],
[4,4,7,4,2,4,8,7,1,8],
[1,4,7,4,5,6,8,4,1,6],
[5,5,1,8,9,4,9,4,2,7],
[7,7,7,9,4,7,4,3,7,1]],dtype = float)
print(compact_svd(A))
The correct output:
(array([[ 0.54036027+0.j, 0.58805563+0.j, -0.29423603+0.j,
-0.4346745 +0.j, -0.29442248+0.j],
[ 0.41227593+0.j, -0.21929894+0.j, -0.51747179+0.j,
0.08375491+0.j, 0.71214086+0.j],
[ 0.38514303+0.j, -0.32015959+0.j, -0.24745912+0.j,
0.60060756+0.j, -0.57201156+0.j],
[ 0.43356274+0.j, -0.61204283+0.j, 0.41057641+0.j,
-0.51216171+0.j, -0.080897 +0.j],
[ 0.44914089+0.j, 0.35916564+0.j, 0.64485588+0.j,
0.42544582+0.j, 0.26912684+0.j]]),
array([39.03360665, 11.91940614, 9.3387396 , 5.38285176, 3.33439025]),
array([[ 0.31278916-0.j, 0.34239004-0.j, 0.35924746-0.j,
0.31566457-0.j, 0.24413875-0.j, 0.35101654-0.j,
0.35095554-0.j, 0.28925585-0.j, 0.22009374-0.j,
0.34370454-0.j],
[ 0.29775734-0.j, 0.21717625-0.j, 0.28679345-0.j,
-0.17261926-0.j, -0.41403132-0.j, 0.21480395-0.j,
-0.5556673 -0.j, -0.00587411-0.j, 0.40832611-0.j,
-0.24296833-0.j],
[ 0.17147953-0.j, 0.09198514-0.j, -0.32960263-0.j,
0.55102537-0.j, 0.36556324-0.j, -0.00497598-0.j,
-0.07790604-0.j, -0.33140639-0.j, 0.26883294-0.j,
-0.47752981-0.j],
[-0.47542292-0.j, -0.14068908-0.j, 0.62131114-0.j,
0.21645498-0.j, -0.11266769-0.j, 0.17761373-0.j,
0.23467192-0.j, -0.15350902-0.j, -0.07515751-0.j,
-0.43906049-0.j],
[ 0.33174054-0.j, -0.18290668-0.j, 0.04021533-0.j,
0.43552649-0.j, -0.50269662-0.j, -0.50174342-0.j,
0.17580464-0.j, 0.33582599-0.j, -0.05960136-0.j,
-0.1162055 -0.j]])

Implementing a PCA (Eigenvector based) in Python

I try to implement a PCA in Python. My goal is to create a version which behaves similarly to Matlab's PCA implementation. However, I think I miss a crucial point as my tests partly produce a results with the wrong sign(+/-).
Can you find a mistake the algorithm? Why the signs are sometimes different?
An implementation of PCA based on eigen vectors:
new_array_rank=4
A_mean = np.mean(A, axis=0)
A = A - A_mean
covariance_matrix = np.cov(A.T)
eigen_values, eigen_vectors = np.linalg.eig(covariance_matrix)
new_index = np.argsort(eigen_values)[::-1]
eigen_vectors = eigen_vectors[:,new_index]
eigen_values = eigen_values[new_index]
eigen_vectors = eigen_vectors[:,:new_array_rank]
return np.dot(eigen_vectors.T, A.T).T
My test values:
array([[ 0.13298325, 0.2896928 , 0.53589224, 0.58164269, 0.66202221,
0.95414116, 0.03040784, 0.26290471, 0.40823539, 0.37783385],
[ 0.90521267, 0.86275498, 0.52696221, 0.15243867, 0.20894357,
0.19900414, 0.50607341, 0.53995902, 0.32014539, 0.98744942],
[ 0.87689087, 0.04307512, 0.45065793, 0.29415066, 0.04908066,
0.98635538, 0.52091338, 0.76291385, 0.97213094, 0.48815925],
[ 0.75136801, 0.85946751, 0.10508436, 0.04656418, 0.08164919,
0.88129981, 0.39666754, 0.86325704, 0.56718669, 0.76346602],
[ 0.93319721, 0.5897521 , 0.75065047, 0.63916306, 0.78810679,
0.92909485, 0.23751963, 0.87552313, 0.37663086, 0.69010429],
[ 0.53189229, 0.68984247, 0.46164066, 0.29953259, 0.10826334,
0.47944168, 0.93935082, 0.40331874, 0.18541041, 0.35594587],
[ 0.36399075, 0.00698617, 0.61030608, 0.51136309, 0.54185601,
0.81383604, 0.50003674, 0.75414875, 0.54689801, 0.9957493 ],
[ 0.27815017, 0.65417397, 0.57207255, 0.54388744, 0.89128334,
0.3512483 , 0.94441934, 0.05305929, 0.77389942, 0.93125228],
[ 0.80409485, 0.2749575 , 0.22270875, 0.91869706, 0.54683128,
0.61501493, 0.7830902 , 0.72055598, 0.09363186, 0.05103846],
[ 0.12357816, 0.29758902, 0.87807485, 0.94348706, 0.60896429,
0.33899019, 0.36310027, 0.02380186, 0.67207071, 0.28638936]])
My result of the PCA with eigen vectors:
array([[ 5.09548931e-01, -3.97079651e-01, -1.47555867e-01,
-3.55343967e-02, -4.92125732e-01, -1.78191399e-01,
-3.29543974e-02, 3.71406504e-03, 1.06404170e-01,
-1.66533454e-16],
[ -5.15879041e-01, 6.40833419e-01, -7.54601587e-02,
-2.00776798e-01, -7.07247669e-02, 2.68582368e-01,
-1.66124362e-01, 1.03414828e-01, 7.76738500e-02,
5.55111512e-17],
[ -4.42659342e-01, -5.13297786e-01, -1.65477203e-01,
5.33670847e-01, 2.00194213e-01, 2.06176265e-01,
1.31558875e-01, -2.81699724e-02, 6.19571305e-02,
-8.32667268e-17],
[ -8.50397468e-01, 5.14319846e-02, -1.46289906e-01,
6.51133920e-02, -2.83887201e-01, -1.90516618e-01,
1.45748370e-01, 9.49464768e-02, -1.05989648e-01,
4.16333634e-17],
[ -1.61040296e-01, -3.47929944e-01, -1.19871598e-01,
-6.48965493e-01, 7.53188055e-02, 1.31730340e-01,
1.33229858e-01, -1.43587499e-01, -2.20913989e-02,
-3.40005801e-16],
[ -1.70017435e-01, 4.22573148e-01, 4.81511942e-01,
2.42170125e-01, -1.18575764e-01, -6.87250591e-02,
-1.20660307e-01, -2.22865482e-01, -1.73666882e-02,
-1.52655666e-16],
[ 6.90841779e-02, -2.86233901e-01, -4.16612350e-01,
9.38935057e-03, 3.02325120e-01, -1.61783482e-01,
-3.55465509e-01, 1.15323059e-02, -5.04619674e-02,
4.71844785e-16],
[ 5.26189089e-01, 6.81324113e-01, -2.89960115e-01,
2.01781673e-02, 3.03159463e-01, -2.11777986e-01,
2.25937548e-01, -5.49219872e-05, 3.66268329e-02,
-1.11022302e-16],
[ 6.68680313e-02, -2.99715813e-01, 8.53428694e-01,
-1.30066853e-01, 2.31410283e-01, -1.02860624e-01,
1.95449586e-02, 1.30218425e-01, 1.68059569e-02,
2.22044605e-16],
[ 9.68303353e-01, 4.80944309e-02, 2.62865615e-02,
1.44821658e-01, -1.47094421e-01, 3.07366196e-01,
1.91849667e-02, 5.08517759e-02, -1.03558238e-01,
1.38777878e-16]])
Test result of the same data using Matlab's PCA function:
array([[ -5.09548931e-01, 3.97079651e-01, 1.47555867e-01,
3.55343967e-02, -4.92125732e-01, -1.78191399e-01,
-3.29543974e-02, -3.71406504e-03, -1.06404170e-01,
-0.00000000e+00],
[ 5.15879041e-01, -6.40833419e-01, 7.54601587e-02,
2.00776798e-01, -7.07247669e-02, 2.68582368e-01,
-1.66124362e-01, -1.03414828e-01, -7.76738500e-02,
-0.00000000e+00],
[ 4.42659342e-01, 5.13297786e-01, 1.65477203e-01,
-5.33670847e-01, 2.00194213e-01, 2.06176265e-01,
1.31558875e-01, 2.81699724e-02, -6.19571305e-02,
-0.00000000e+00],
[ 8.50397468e-01, -5.14319846e-02, 1.46289906e-01,
-6.51133920e-02, -2.83887201e-01, -1.90516618e-01,
1.45748370e-01, -9.49464768e-02, 1.05989648e-01,
-0.00000000e+00],
[ 1.61040296e-01, 3.47929944e-01, 1.19871598e-01,
6.48965493e-01, 7.53188055e-02, 1.31730340e-01,
1.33229858e-01, 1.43587499e-01, 2.20913989e-02,
-0.00000000e+00],
[ 1.70017435e-01, -4.22573148e-01, -4.81511942e-01,
-2.42170125e-01, -1.18575764e-01, -6.87250591e-02,
-1.20660307e-01, 2.22865482e-01, 1.73666882e-02,
-0.00000000e+00],
[ -6.90841779e-02, 2.86233901e-01, 4.16612350e-01,
-9.38935057e-03, 3.02325120e-01, -1.61783482e-01,
-3.55465509e-01, -1.15323059e-02, 5.04619674e-02,
-0.00000000e+00],
[ -5.26189089e-01, -6.81324113e-01, 2.89960115e-01,
-2.01781673e-02, 3.03159463e-01, -2.11777986e-01,
2.25937548e-01, 5.49219872e-05, -3.66268329e-02,
-0.00000000e+00],
[ -6.68680313e-02, 2.99715813e-01, -8.53428694e-01,
1.30066853e-01, 2.31410283e-01, -1.02860624e-01,
1.95449586e-02, -1.30218425e-01, -1.68059569e-02,
-0.00000000e+00],
[ -9.68303353e-01, -4.80944309e-02, -2.62865615e-02,
-1.44821658e-01, -1.47094421e-01, 3.07366196e-01,
1.91849667e-02, -5.08517759e-02, 1.03558238e-01,
-0.00000000e+00]])
The sign and other normalization choices for eigenvectors are arbitrary. Matlab and numpy norm the eigenvectors in the same way, but the sign is arbitrary and can depend on details of the linear algebra library that is used.
When I wrote the numpy equivalent of matlab's princomp, then I just normalized the sign of the eigenvectors when I compared them to those of matlab in my unit tests.

Why does numpy.random.dirichlet() not accept multidimensional arrays?

On the numpy page they give the example of
s = np.random.dirichlet((10, 5, 3), 20)
which is all fine and great; but what if you want to generate random samples from a 2D array of alphas?
alphas = np.random.randint(10, size=(20, 3))
If you try np.random.dirichlet(alphas), np.random.dirichlet([x for x in alphas]), or np.random.dirichlet((x for x in alphas)), it results in a
ValueError: object too deep for desired array. The only thing that seems to work is:
y = np.empty(alphas.shape)
for i in xrange(np.alen(alphas)):
y[i] = np.random.dirichlet(alphas[i])
print y
...which is far from ideal for my code structure. Why is this the case, and can anyone think of a more "numpy-like" way of doing this?
Thanks in advance.
np.random.dirichlet is written to generate samples for a single Dirichlet distribution. That code is implemented in terms of the Gamma distribution, and that implementation can be used as the basis for a vectorized code to generate samples from different distributions. In the following, dirichlet_sample takes an array alphas with shape (n, k), where each row is an alpha vector for a Dirichlet distribution. It returns an array also with shape (n, k), each row being a sample of the corresponding distribution from alphas. When run as a script, it generates samples using dirichlet_sample and np.random.dirichlet to verify that they are generating the same samples (up to normal floating point differences).
import numpy as np
def dirichlet_sample(alphas):
"""
Generate samples from an array of alpha distributions.
"""
r = np.random.standard_gamma(alphas)
return r / r.sum(-1, keepdims=True)
if __name__ == "__main__":
alphas = 2 ** np.random.randint(0, 4, size=(6, 3))
np.random.seed(1234)
d1 = dirichlet_sample(alphas)
print "dirichlet_sample:"
print d1
np.random.seed(1234)
d2 = np.empty(alphas.shape)
for k in range(len(alphas)):
d2[k] = np.random.dirichlet(alphas[k])
print "np.random.dirichlet:"
print d2
# Compare d1 and d2:
err = np.abs(d1 - d2).max()
print "max difference:", err
Sample run:
dirichlet_sample:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
np.random.dirichlet:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
max difference: 5.55111512313e-17
I think you're looking for
y = np.array([np.random.dirichlet(x) for x in alphas])
for your list comprehension. Otherwise you're simply passing a python list or tuple. I imagine the reason numpy.random.dirichlet does not accept your list of alpha values is because it's not set up to - it already accepts an array, which it expects to have a dimension of k, as per the documentation.

Categories