Does sklearn's implementation of PCA preserve order of input? - python

Let's say this is how I do my PCA with sklearns sklearn.decomposition.PCA:
def doPCA(arr):
scaler = StandardScaler()
scaler.fit(arr)
arr =scaler.transform(arr)
pca =PCA(n_components=2)
X = pca.fit_transform(arr)
return X
My current understanding is that I get an output array of the same length, but each sample is now of dimension 2.
Now, I am interested where a value in my original array arr ended up after the PCA.
My question is:
Can I assume that X[i] corresponds to arr[i]?

What you obtain as X, which is U[:, :n_components]*S[:n_components], in your code are the PCA loadings on the first n_components. To understand why X[i] should correspond to arr[i], let us see what loadings mean.
Loadings
Imagine the eigen vectors to be basis vectors for the new dimensions of order n_components. The loadings help define where each of the data points lie on this new dimension space. In other words, the original data points from the full feature space projected on to the reduced dimensional space. These are coefficients in linear combination (np.dot(X, n_components)) predicting the original full set of features by the (standardized) components.
So you can assume that X[i] corresponds to arr[i].

Related

Trouble in understanding how PCA is achieving image compression and reducing dimension

I was going through this amazing playlist for SVD by Steve Brunton in youtube. I think I got majority of the concepts but there are some gaps. Let me add a couple of screenshots so that it's easier for me to explain.
He is considering the input matrix X to be a collection of images. So, considering an image is 28x28 pixels, we flatten it to create a 784x1 column vector. So, each column denotes an image, and the rows denote pixel indices. Let's take the dimension of X to be n x m. Now, after computing the economy SVD, if we keep only the first r (<< m) singular values, then the approximation of X is given by
X' = σ1.u1.v1(T) + σ2.u2.v2(T) + ... + σr.ur.vr(T)
I understand that here, we're throwing away information, so the reconstructed images would be pixelated but they would still be of the same dimension (28x28). So, how are we achieving compression here? Is it because instead of storing 784m pixel values, we'll have to store r x (28 (length of each u) + 28 (length of each v)) pixels? Or is there something more to it?
My second question is, if I try to draw an analogy to numerical features, e.g. let's say a housing price dataset, that has 50 features, and 1000 data points. So, our X matrix has dimension 50 x 1000 (each column being a feature vector). In that case, if there are useless features, we'll get << 50 features (maybe 20, or 10... whatever) after applying PCA, right? I'm not able to grasp how that smaller feature vector is derived when we select only the biggest r singular values. Because X and X' have the same dimensions.
Let's have a sample code. The dimensions are reversed because of how sklearn expects it.
pca = PCA(n_components=10)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape: ", X.shape) # original shape: (1000, 50)
print("transformed shape:", X_pca.shape) # transformed shape: (1000, 10)
So, how are we going from 50 to 10 here? I get that that in this case there would be 50 U basis vectors. So, even if we choose top r from these 50, the dimensions will still be the same, right? Any help is appreciated.
I've been searching for the answer all over the web, and finally it clicked when I saw this video tutorial. We know X = U x ∑ x V.T. Here, columns of U give us the principal components for the colspace of X. Similarly rows of V.T give us the principal components for the rowspace of X. Since, in pca we tend to represent a feature vector by a row (unlike svd), so we'd select the first r principal components from the matrix V.T.
Let's assume the dimensions of X to be mxn. So, we have m samples each having n features. That gives us the following dimensions for the SVD:
U: mxm
∑: mxn
V: nxn
Now, if we select only r (<< n) principal components then the projection of X to the r-dimensional space would be given by X.[v1 v2 ... vr]. Here each of v1, v2, ... vr is a column vector. So, the dimension of [v1 v2 ... vr] is nxr. If we now multiply X with this vector we get an nxr matrix, which is nothing but the projection of all the data points to r dimensions.

Finding and utilizing eigenvalues and eigenvectors from PCA in scikit-learn

I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.
Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?
I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.
Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:
from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_
This gives an output like
[[ 0.83849224 0.54491354]
[ 0.54491354 -0.83849224]]
where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.
As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :
[ 0.99244289 0.00755711]
This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.
If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:
import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)
Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.
Hope that helps.
The docs say explained_variance_ will give you
"The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.", new in version 0.18.
Seems a little questionable since the first and second sentences do not seem to agree.
sklearn PCA documentation

How do I compute the variance of a column of a sparse matrix in Scipy?

I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.

python hcluster, distance matrix and condensed distance matrix

I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!
I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt

What's wrong with my PCA?

My code:
from numpy import *
def pca(orig_data):
data = array(orig_data)
data = (data - data.mean(axis=0)) / data.std(axis=0)
u, s, v = linalg.svd(data)
print s #should be s**2 instead!
print v
def load_iris(path):
lines = []
with open(path) as input_file:
lines = input_file.readlines()
data = []
for line in lines:
cur_line = line.rstrip().split(',')
cur_line = cur_line[:-1]
cur_line = [float(elem) for elem in cur_line]
data.append(array(cur_line))
return array(data)
if __name__ == '__main__':
data = load_iris('iris.data')
pca(data)
The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Output:
[ 20.89551896 11.75513248 4.7013819 1.75816839]
[[ 0.52237162 -0.26335492 0.58125401 0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]
[ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
[ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess
Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:
cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val
But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?
edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.
You decomposed the wrong matrix.
Principal Component Analysis requires manipulating the eigenvectors/eigenvalues
of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.
You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:
import numpy as NP
import numpy.linalg as LA
# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)
# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)
# calculate the covariance matrix
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)
# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)
To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD,
though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's)
LA module--it is a little easier to work with than svd, the return values are the eigenvectors
and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.
Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose,
regardless of the form that your data is in. This is obvious because the matrix you
are decomposing in PCA is a covariance matrix, which by definition is always square
(i.e., the columns are the individual data points of the original matrix, likewise
for the rows, and each cell is the covariance of those two points, as evidenced
by the ones down the main diagonal--a given data point has perfect covariance with itself).
The left singular values returned by SVD(A) are the eigenvectors of AA^T.
The covariance matrix of a dataset A is : 1/(N-1) * AA^T
Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.
In your case, N=150 and you haven't done this division, hence the discrepancy.
This is explained in detail here
(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)
You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:
a. the left singular vectors of X,
b. the principal components of X,
c. the eigenvectors of X X^T.
Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.
Question already adressed: Principal component analysis in Python

Categories