What is the function of numpy.linalg.norm method?
In this Kmeans Clustering sample the numpy.linalg.norm function is used to get the distance between new centroids and old centroids in the movement centroid step but I cannot understand what is the meaning by itself
Could somebody give me a few ideas in relation to this Kmeans clustering context?
What is the norm of a vector?
numpy.linalg.norm is used to calculate the norm of a vector or a matrix.
This is the help document taken from numpy.linalg.norm:
numpy.linalg.norm(x, ord=None, axis=None, keepdims=False)[source]
This is the code snippet taken from K-Means Clustering in Python:
# Euclidean Distance Caculator
def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
It take order=None as default, so just to calculate the Frobenius norm of (a-b), this is ti calculate the distance between a and b( using the upper Formula).
I am not a mathematician but here is my layman's explanation of “norm”:
A vector describes the location of a point in space relative to the origin. Here’s an example in 2D space for the point [3 2]:
The norm is the distance from the point to the origin. In the 2D case it’s easy to visualize the point as the diametrically opposed point of a right triangle and see that the norm is the same thing as the hypotenuse.
However, In higher dimensions it’s no longer a shape we describe in average-person language, but the distance from the origin to the point is still called the norm. Here's an example in 3D space:
I don’t know why the norm is used in K-means clustering. You stated that it was part of determing the distance between the old and new centroid in each step. Not sure why one would use the norm for this since you can get the distance between two points in any dimensionality* using an extension of the from used in 2D algebra:
You just add a term for each addtional dimension, for example here is a 3D version:
*where the dimensions are positive integers
numpy.linalg.norm function is used to get the sum from a row or column of a matrix.Suppose ,
>>> c = np.array([[ 1, 2, 3],
... [-1, 1, 4]])
>>> LA.norm(c, axis=0)
array([ 1.41421356, 2.23606798, 5. ])
>>> LA.norm(c, axis=1)
array([ 3.74165739, 4.24264069])
>>> LA.norm(c, ord=1, axis=1)
array([6, 6])
Related
I would like to build a function npbatch(U,X) which compares data points in an input matrix (U) with data points in a training matrix (X) and gets me the index of X with the shortest euclidean distance to the data point in U.
I would like to avoid any loops to increase the performance and I would like to use the function scipy.spatial.distance.cdist to compute the distance.
Example Input:
U
array([[0.69646919, 0.28613933, 0.22685145],
[0.55131477, 0.71946897, 0.42310646],
[0.9807642 , 0.68482974, 0.4809319 ]])
X
array([[0.24875591, 0.16306678, 0.78364326],
[0.80852339, 0.62562843, 0.60411363],
[0.8857019 , 0.75911747, 0.18110506]])
--> Expected Output: Array with the three indices of the data points in X which have the shortest distance to the three data points in U.
My overall target is then to get the label of the corresponding data point using the index which I've got. Example for label input would be:
Y
array([1, 0, 0])
Thank you for any hint!
With scipy.spatial.distance.cdist you already chose a well-suited function for the task. To get the indices, we just have to apply numpy.argmin along the axis 0 (or axis 1 for cdist(U, X)):
ix = numpy.argmin(scipy.spatial.distance.cdist(X, U), 0)
Getting the label is then trivial:
Y[ix]
I was trying to figure out how to calculate the Frobenius of a matrix in numpy. This way I can get the 2-norm of each row in the matrix x below:
My question is about the ord parameter in numpy's linalg.norm module and how the relevant part of numpy documentation describes which norm of a matrix one can calculate. I was able to get the Frobenius norm by setting ord=2, however, it says that only setting ord=None gives the Frobenius norm.
Here is my example:
x = np.array([[0, 3, 4],
[1, 6, 4]])
I found that I can the Frobenius norm with the following line of code:
x_norm = np.linalg.norm(x, ord = 2, axis=1,keepdims=True )
>>> x_norm
array([[ 5. ],
[ 7.28010989]])
My question is whether the documentation here would be considered not as helpful as possible and if this warrants a request to change the description of setting ord=2 in the aforementioned table.
You're not taking a matrix norm. Since you've passed axis=1, you're taking vector norms, and you should be looking at the vector norm column instead of the matrix norm column.
For vector norms, ord=None and ord=2 both produce a 2-norm.
I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.
Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?
I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.
Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:
from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_
This gives an output like
[[ 0.83849224 0.54491354]
[ 0.54491354 -0.83849224]]
where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.
As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :
[ 0.99244289 0.00755711]
This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.
If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:
import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)
Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.
Hope that helps.
The docs say explained_variance_ will give you
"The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.", new in version 0.18.
Seems a little questionable since the first and second sentences do not seem to agree.
sklearn PCA documentation
I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!
I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt
Given two vectors X and Y, I have to find their correlation, i.e. their linear dependence/independence. Both vectors have equal dimension. The result should be a floating point number from [-1.0 .. 1.0].
Example:
X=[-1, 2, 0]
Y=[ 4, 2, -0.3]
Find y = cor(X,Y) such that y belongs to [-1.0 .. 1.0].
It should be a simple construction involving a list-comprehension. No external library is allowed.
UPDATE: ok, if the dot product is enough, then here is my solution:
nX = 1/(sum([x*x for x in X]) ** 0.5)
nY = 1/(sum([y*y for y in Y]) ** 0.5)
cor = sum([(x*nX)*(y*nY) for x,y in zip(X,Y) ])
right?
Sounds like a dot product to me.
Solve the equation for the cosine of the angle between the two vectors, which is always in the range [-1, 1], and you'll have what you want.
It's equal to the dot product divided by the magnitudes of two vectors.
Since range is supposed to be [-1, 1] I think that the Pearson Correlation can be ok for your purposes.
Also dot-product would work but you'll have to normalize vectors before calculating it and you can have a -1,1 range just if you have also negative values.. otherwise you would have 0,1
Don't assume because a formula is algebraically correct that its direct implementation in code will work. There can be numerical problems with some definitions of correlation.
See How to calculate correlation accurately