I'm working on an assignment where I am tasked to implement PCA in Python for an online course. Unfortunately, when I try to run a comparison (provided by the course) between my implementation and SKLearn's, my results appear to differ too greatly.
After many hours of review, I am still unsure where it is going wrong. If someone could take a look and determine what step I have coded or interpreted incorrectly, I would greatly appreciate it.
def normalize(X):
"""
Normalize the given dataset X to have zero mean.
Args:
X: ndarray, dataset of shape (N,D)
Returns:
(Xbar, mean): tuple of ndarray, Xbar is the normalized dataset
with mean 0; mean is the sample mean of the dataset.
Note:
You will encounter dimensions where the standard deviation is zero.
For those ones, the process of normalization results in normalized data with NaN entries.
We can handle this by setting the std = 1 for those dimensions when doing normalization.
"""
# YOUR CODE HERE
### Uncomment and modify the code below
mu = np.mean(X, axis = 0) # Setting axis = 0 will compute means column-wise. Setting it to 1 will compute the mean across rows.
std = np.std(X, axis = 0) # Computing the std dev column wise using axis = 0.
std_filled = std.copy()
std_filled[std == 0] = 1
# Compute the normalized data as Xbar
Xbar = (X - mu)/std_filled
return Xbar, mu, # std_filled
def eig(S):
"""
Compute the eigenvalues and corresponding unit eigenvectors for the covariance matrix S.
Args:
S: ndarray, covariance matrix
Returns:
(eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors
Note:
the eigenvals and eigenvecs should be sorted in descending
order of the eigen values
"""
# YOUR CODE HERE
# Uncomment and modify the code below
# Compute the eigenvalues and eigenvectors
# You can use library routines in `np.linalg.*` https://numpy.org/doc/stable/reference/routines.linalg.html for this
eigvals, eigvecs = np.linalg.eig(S)
# The eigenvalues and eigenvectors need to be sorted in descending order according to the eigenvalues
# We will use `np.argsort` (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html) to find a permutation of the indices
# of eigvals that will sort eigvals in ascending order and then find the descending order via [::-1], which reverse the indices
sort_indices = np.argsort(eigvals)[::-1]
# Notice that we are sorting the columns (not rows) of eigvecs since the columns represent the eigenvectors.
return eigvals[sort_indices], eigvecs[:, sort_indices]
def projection_matrix(B):
"""Compute the projection matrix onto the space spanned by the columns of `B`
Args:
B: ndarray of dimension (D, M), the basis for the subspace
Returns:
P: the projection matrix
"""
# YOUR CODE HERE
P = B # (np.linalg.inv(B.T # B)) # B.T
return P
def select_components(eig_vals, eig_vecs, num_components):
"""
Selects the n components desired for projecting the data upon.
Args:
eig_vals: The eigenvalues sorted in descending order of magnitude.
eig_vecs: The eigenvectors sorted in order relative to that of the eigenvalues.
num_components: the number of principal components to use.
Returns:
The number of desired components to keep for projection of the data upon.
"""
principal_vals, principal_components = eig_vals[:num_components], eig_vecs[:, range(num_components)]
return principal_vals, principal_components
def PCA(X, num_components):
"""
Projects normalized data onto the 'n' desired principal components.
Args:
X: ndarray of size (N, D), where D is the dimension of the data,
and N is the number of datapoints
num_components: the number of principal components to use.
Returns:
the reconstructed data, the sample mean of the X, principal values
and principal components
"""
# Normalize to have mean 0 and variance 1.
Z, mean_vec = normalize(X)
# Calculate the covariance matrix
S = np.cov(Z, rowvar=False, bias=True) # Set rowvar = False to treat columns as variables. Set bias = True to ensure normalization is done with N and not N-1
# Calculate the (unit) eigenvectors and eigenvalues of S. Sort them in descending order of importance relative to the magnitude of the eigenvalues.
eig_vals, eig_vecs = eig(S)
# Keep only the n largest Principle Components of the sorted unit eigenvectors.
principal_vals, principal_components = select_components(eig_vals, eig_vecs, num_components)
# Compute the projection matrix using only the n largest Principle Components of the sorted unit eigenvectors, where n = num_components.
#P = projection_matrix(eig_vecs[:, :num_components])
P = projection_matrix(principal_components)
# Reconstruct the data by using the projection matrix to project the data onto the principal component vectors we've kept
X_reconst = (P # X.T).T
return X_reconst, mean_vec, principal_vals, principal_components
And here is the test case I'm supposed to pass:
random = np.random.RandomState(0)
X = random.randn(10, 5)
from sklearn.decomposition import PCA as SKPCA
for num_component in range(1, 4):
# We can compute a standard solution given by scikit-learn's implementation of PCA
pca = SKPCA(n_components=num_component, svd_solver="full")
sklearn_reconst = pca.inverse_transform(pca.fit_transform(X))
reconst, _, _, _ = PCA(X, num_component)
# The difference in the result should be very small (<10^-20)
print(
"difference in reconstruction for num_components = {}: {}".format(
num_component, np.square(reconst - sklearn_reconst).sum()
)
)
np.testing.assert_allclose(reconst, sklearn_reconst)
As far as I can tell, there are a few things wrong with your code.
Your projection matrix is wrong.
If the eigenvectors of your covariance matrix is B with dimension D x M where M is the number of components you select and D is the dimension of the original data, then the projection matrix is just B # B.T.
In standard implementation of PCA, we typically do not scale the data by the inverse of the standard deviation. You seem to be trying to do an approximation of a whitened PCA (ZCA), but even then it looks wrong.
As a quick test, you can compute the normalized data without dividing by the standard deviation, and when you compute the covariance matrix, set bias=False.
You should also subtract the mean from the data before multiplying it by the projection operator, and adding it back after that, i.e.,
X_reconst = (P # (X - mean_vec).T).T + mean_vec.
PCA essentially is just a change of basis, followed by discarding coordinates corresponding to directions with low variance. The eigenvectors of the covariance matrix corresponds to the new orthogonal basis, and the eigenvalues tells you the variance of the data along the direction of the corresponding eigenvectors. P = B # B.T is just the change of basis followed to the new basis (and discarding some coordinates), B, followed by a change back to the original basis.
Edit
I'm curious to know which online course teaches people to implement PCA this way.
Related
I am trying to calculate the divergence of a 3D velocity field in a multi-phase flow setting (with solids immersed in a fluid). If we assume u,v,w to be the three velocity components (each a n x n x n) 3D numpy array, here is the function I have for calculating divergence:
def calc_divergence_velocity(df,h=0.025):
"""
#param df: A dataframe with the entire vector field with columns [x,y,z,u,v,w] with
x,y,z indicating the 3D coordinates of each point in the field and u,v,w
the velocities in the x,y,z directions respectively.
#param h: This is the dimension of a single side of the 3D (uniform) grid. Used
as input to numpy.gradient() function.
"""
"""
Reshape dataframe columns to get 3D numpy arrays (dim = 80) so each u,v,w is a
80x80x80 ndarray.
"""
u = df['u'].values.reshape((dim,dim,dim))
v = df['v'].values.reshape((dim,dim,dim))
w = df['w'].values.reshape((dim,dim,dim))
#Supply x,y,z coordinates appropriately.
#Note: Only a scalar `h` has been supplied to np.gradient because
#the type of grid we are dealing with is a uniform grid with each
#grid cell having the same dimensions in x,y,z directions.
u_grad = np.gradient(u,h,axis=0) #central diff. du_dx
v_grad = np.gradient(v,h,axis=1) #central diff. dv_dy
w_grad = np.gradient(w,h,axis=2) #central diff. dw_dz
"""
The `mask` column in the dataframe is a binary column indicating the locations
in the field where we are interested in measuring divergence.
The problem I am looking at is multi-phase flow with solid particles and a fluid
hence we are only interested in the fluid locations.
"""
sdf = df['mask'].values.reshape((dim,dim,dim))
div = (u_grad*sdf) + (v_grad*sdf) + (w_grad*sdf)
return div
The problem I'm having is that the divergence values that I am seeing are far too high.
For example the image below showcases, a distribution with values between [-350,350] whereas most values should technically be close to zero and somewhere between [20,-20] in my case. This tells me I'm calculating the divergence incorrectly and I would like some pointers as to how to correct the above function to calculate the divergence appropriately. As far as I can tell (please correct me if I'm wrong), I think have done something similar to this upvoted SO response. Thanks in advance!
I would like to generate a random potential in 1D or 2D spaces with a specified autocorrelation function, and according to some mathematical derivations including the Wiener-Khinchin theorem and properties of the Fourier transforms, it turns out that this can be done using the following equation:
where phi(k) is uniformly distributed in interval [0, 1). And this function satisfies , which is to ensure that the potential generated is always real.
The autocorrelation function should not affect what I am doing here, and I take a simple Gaussian distribution .
The choice of the phase term and the condition of phi(k) is based on the following properties
The phase term must have a modulus of 1 (by Wiener-Khinchin theorem, i.e. the Fourier transform of the autocorrelation of a function equals the modulus of the Fourier transform of that function);
The Fourier transform of a real function must satisfy (by directly inspecting the definition of Fourier transform in integral form).
Both the generated potential and the autocorrelation are real.
By combining these three properties, this term can only take the form as stated above.
For the relevant mathematics, you may refer to p.16 of the following pdf:
https://d-nb.info/1007346671/34
I randomly generated a numpy array using uniform distribution and concatenated the negative of the array with the original array, such that it satisfies the condition of phi(k) stated above. And then I performed the numpy (inverse) fast Fourier transform.
I have tried both 1D and 2D cases, and only the 1D case is shown below.
import numpy as np
from numpy.fft import fft, ifft
import matplotlib.pyplot as plt
## The Gaussian autocorrelation function
def c(x, V0, rho):
return V0**2 * np.exp(-x**2/rho**2)
x_min, x_max, interval_x = -10, 10, 10000
x = np.linspace(x_min, x_max, interval_x, endpoint=False)
V0 = 1
## the correlation length
rho = 1
## (Uniformly) randomly generated array for k>0
phi1 = np.random.rand(int(interval_x)/2)
phi = np.concatenate((-1*phi1[::-1], phi1))
phase = np.exp(2j*np.pi*phi)
C = c(x, V0, rho)
V = ifft(np.power(fft(C), 0.5)*phase)
plt.plot(x, V.real)
plt.plot(x, V.imag)
plt.show()
And the plot is similar to what is shown as follows:
.
However, the generated potential turns out to be complex, and the imaginary parts are of the same order of magnitude as that of the real parts, which is not expected. I have checked the math many times, but I couldn't spot any problems. So I am thinking whether it's related to the implementation problems, for example whether the data points are dense enough for Fast Fourier Transform, etc.
You have a few misunderstandings about how fft (more correctly, DFT) operates.
First note that DFT assumes that the samples of the sequence are indexed as 0, 1, ..., N-1, where N are the number of samples. Instead, you generate a sequence corresponding to indices -10000, ..., 10000. Second, note that the DFT of a real sequence will generate real values for the "frequencies" corresponding to 0 and N/2. You also seem to not take this into account.
I won't go into further details as this is out of the scope of this stackexchange site.
Just for a sanity check, the code below generates a sequence that has the properties expected for the DFT (FFT) of a real-valued sequence:
conjugate symmetry of positive and negative frequencies,
real-valued elements corresponding to frequencies 0 and N/2
sequence assumed to correspond to indices 0 to N-1
As you can see, the ifft of this sequence indeed generates a real-valued sequence
from scipy.fftpack import ifft
N = 32 # number of samples
n_range = np.arange(N) # indices over which the sequence is defined
n_range_positive = np.arange(int(N/2)+1) # the "positive frequencies" sample indices
n_range_negative = np.arange(int(N/2)+1, N) # the "negative frequencies" sample indices
# generate a complex-valued sequence with the properties expected for the DFT of a real-valued sequence
abs_FFT_positive = np.exp(-n_range_positive**2/100)
phase_FFT_positive = np.r_[0, np.random.uniform(0, 2*np.pi, int(N/2)-1), 0] # note last frequency has zero phase
FFT_positive = abs_FFT_positive * np.exp(1j * phase_FFT_positive)
FFT_negative = np.conj(np.flip(FFT_positive[1:-1]))
FFT = np.r_[FFT_positive, FFT_negative] # this is the final FFT sequence
# compute the IFFT of the above sequence
IFFT = ifft(FFT)
#plot the results
plt.plot(np.abs(FFT), '-o', label = 'FFT sequence (abs. value)')
plt.plot(np.real(IFFT), '-s', label = 'IFFT (real part)')
plt.plot(np.imag(IFFT), '-x', label = 'IFFT (imag. part)')
plt.legend()
More care needs to be taken when concatenating:
phi1 = np.random.rand(int(interval_x)//2-1)
phi = np.concatenate(([0], phi1, [0], -phi1[::-1]))
The first element is the offset (zero frequency mode). "Negative" frequencies come after the midpoint.
This gives me
I have 60000 vectors of 784 dimensions. This data has 10 classes.
I must evaluate a function that takes out one dimension and computes the distance metric again. This function is computing the distance of each vector to it's classes' mean. In code:
def objectiveFunc(self, X, y, indices):
subX = np.array([X[:,i] for i in indices]).T
d = np.zeros((10,1))
for n in range(10):
C = subX[np.where(y == n)]
u = np.mean(C, axis = 0)
Sinv = pinv(covariance(C))
d[n] = np.mean(np.apply_along_axis(mahalanobis, axis = 1, arr=C, v=u, VI=Sinv))
where indices are fed in with one index removed during each iteration.
As you can imagine, I am computing a lot of individual components during the computation for Mahalanobis distance. Is there a way for me to store all the 784 component distances?
Alternatively, what's the fastest way to compute Mahalanobis distance?
First of all and to make it easier to understand, this is the Mahalanobis Distance formula:
So, to compute the mahalanobis distance for each element according to its class, we can do:
X_train=X_train.reshape(-1,784)
def mahalanobis(element,classe):
part=np.where(y_train==classe)[0]
ave=np.mean(X_train[part])
distance_example=np.sqrt(((np.mean(X_train[part[[element]]])-ave)**2)/np.var(X_train[part]))
return distance_example
mahalanobis(20,2)
# Out[91]: 0.13947337027828757
Then you can create a for statement to calculate all distances. For instance, class 0:
[mahalanobis(i,0) for i in range(0,len(X_train[np.where(y_train==0)[0]]))]
I want to perform a PCA an my dataset
XT.shape
->(2500,260)
The rows of the complex X contain the samples (2500), the columns of X contain the variables (260).
I perform SVD like this: (Python)
u, s, vh = np.linalg.svd(XT)
proj_0 = np.dot(XT,vh)[:,0]
I thougth this would give me the projection of my data onto the first principle component. However, if I do a PCA using the covariance matrix:
cov = np.cov(XT, rowvar=False)
eVals, eVecs = np.linalg.eigh(cov)
# Sort eigenvalues in decreasing order and eigenvectors alike
idx = np.argsort(np.abs(eVals))[::-1]
eVals = eVals[idx]
eVecs = eVecs[:,idx]
# Project data on eigenvectors
PCA_0 = np.dot(eVecs.T, XT.T).T[:,0]
Then PCA_0 and proj_0 do not yield the same results. So something I am missing, but what?
Iam trying to calculate PCA of a matrix.
Sometimes the resulting eigen values/vectors are complex values so when trying to project a point to a lower dimension plan by multiplying the eigen vector matrix with the point coordinates i get the following Warning
ComplexWarning: Casting complex values to real discards the imaginary part
In that line of code np.dot(self.u[0:components,:],vector)
The whole code i used to calculate PCA
import numpy as np
import numpy.linalg as la
class PCA:
def __init__(self,inputData):
data = inputData.copy()
#m = no of points
#n = no of features per point
self.m = data.shape[0]
self.n = data.shape[1]
#mean center the data
data -= np.mean(data,axis=0)
# calculate the covariance matrix
c = np.cov(data, rowvar=0)
# get the eigenvalues/eigenvectors of c
eval, evec = la.eig(c)
# u = eigen vectors (transposed)
self.u = evec.transpose()
def getPCA(self,vector,components):
if components > self.n:
raise Exception("components must be > 0 and <= n")
return np.dot(self.u[0:components,:],vector)
The covariance matrix is symmetric, and thus has real eigenvalues. You may see a small imaginary part in some eigenvalues due to numerical error. The imaginary parts can generally be ignored.
You can use scikits python library for PCA, this is an example of how to use it