I am working on an excercise in remote sensing, and have some trouble performing a pca. I have 103 different spectral bands, meaning the shape of my array is (m, n, 103). I want to reduce this to ten components.
I tried to use PCA from sklearn.decomposition, however my dimensions are wrong, and i dont think i understand how to use the function.
pca = PCA(n_components = 10)
test = feature_scaling(data)
pca_components = pca.fit_transform(test)
Giving me the following error: *** ValueError: Found array with dim 3. Estimator expected <= 2.
I was wondering if i maybe need to extract each of the spectral bands and perform a flattening, but i am not sure what my input in the pca is supposed to look like.
Does anyone know what i can do?
Related
I have a function f(u,v,w) which I would like to interpolate using a scipy function (with linear interpolation). This is easy enough.
When I run the interpolation step, I simply do the following (interpolating over a u,v,w grid):
u = np.linspace(-1,1,100)
v = np.linspace(-2,2,50)
w = np.linspace(3,8,30)
values_grid = np.zeros((len(u),len(v),len(w)))
count = 0
for i in range(len(u)):
for j in range(len(w)):
for k in range(len(w)):
values_grid[i,j,k] = f(u[i],v[j],w[k])
from scipy.interpolate import RegularGridInterpolator
my_interpolating_function = RegularGridInterpolator((u, v, w), values_grid, method='linear',bounds_error=False,fill_value=-999)
This is fine for many cases. However, when I want to evaluate this interpolation function it seems like I am required to use inputs which have shape [(Number of input samples) x (Dimension of Samples)]. E.g:
func_input = np.vstack([u_samps,v_samps,w_samps].T # E.g. shape is 500,3
output = my_interpolating_function(func_input)) # Has output shape 500
This works fine. The issue is that I would like to evaluate this function over a grid where the samples have the following shape
shape(u_samps) = 500
shape(v_samps) = (100,100)
shape(w_samps) = (100,100)
Meaning I would like to evaluate
my_interpolating_function([u_samps, v_samps, w_samps])
and get out an array which has shape (500,100,100) (so the interpolation is evaluated for all 500 u_samps over the v_samps and w_samps grids). I can flatten the v_samps and w_samps array, but then I have to make several (hundreds) copies of u_samps to get the inputs into the correct format. So is there any way to have an interpolation function that can take the inputs above (u_samps, v_samps, w_samps with the specified shapes) and get out an array with shape (500,100,100) efficiently?
Any help greatly appreciated, I have been stuck on this problem and it's really holding up my progress! The end goal is to use this function in a statistical likelihood which needs to be sampled with MCMC, so speed is pretty important (and making hundreds of copies of massive arrays is very slow)
I've been tasked to implement my PCA code to convert data to a 2d field for a KNN assignment. My PCA code creates an array with the eigenvectors called PCevecs.
def __PCA(data):
#Normalize data
data_cent = data-np.mean(data)
#calculate covariance
covarianceMatrix = np.cov(data_cent, bias=True)
#Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
#Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
The assignment transforms the training-data using the PCA. The returned PCevecs has the shape (88, 88) given by calling print(PCevecs.shape). The shape of the training data is (88, 4).
np.dot(trainingFeatures, PCevecs[:, 0:2])
When the code is running I get the error message "ValueError: shapes (88,4) and (88,2) not alligned: 4 (dim 1) != 88 (dim 0)". I can see that the arrays don't match, but I can't see that I've done anything wrong with the PCA implementation. I've tried to have a look at similar problems on Stackoverflow. I haven't seen anyone sorting the eigenvector and eigenvalues the same way.
(EDITED with additional info from the comments)
While the PCA implementation is OK in general, you may want to either compute it on the data transposed, or you want to make sure that you tell np.cov() in which axis your dimensionality is via the rowvar parameter.
The following would work as you expect:
import numpy as np
def __PCA_fixed(data, rowvar=False):
# Normalize data
data_cent = data - np.mean(data)
# calculate covariance (pass `rowvar` to `np.cov()`)
covarianceMatrix = np.cov(data_cent, rowvar=rowvar, bias=True)
# Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
# Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
Testing it out with some random numbers:
data = np.random.randint(0, 100, (100, 10))
PCevals, PCevecs = __PCA_fixed(data)
print(PCevecs.shape)
# (10, 10)
Also note that, in more general terms, the singular value decomposition (np.linalg.svd() in NumPy) might be a better approach for principal component analysis (with a simple relationship with the eigenvalue decomposition you use and transposition).
As a general coding style note, it may be a good idea to follow the advice from PEP-8, many of which can be readily checked by some automated tool like, e.g. autopep8.
This is the code I've found online
d0 = pd.read_csv('./mnist_train.csv')
labels = d0.label.head(15000)
data = d0.drop('label').head(15000)
from sklearn.preprocessing import StandardScaler
standardized_data = StandardScaler().fit_transform(data)
#find the co-variance matrix which is : (A^T * A)/n
sample_data = standardized_data
# matrix multiplication using numpy
covar_matrix = np.matmul(sample_data.T , sample_data) / len(sample_data)
How does multiplying the same data gives np.matmul(sample_data.T, sample_data) covariance matrix? What is the co-variance matrix according to this tutorial I found online? The last step is what I don't understand.
This might be a better question for the math or stats stack exchange, but I'll answer here for now.
This comes from the definition of covariance. The Wikipedia page (linked) gives a whole lot of detail, but covariance is defined as (in pseudo-code)
cov = E[dot((x - E[x]), (x - E[x]).T)]
for column vectors, but in your case you probably have row vectors, which is why the first element in your dot-product is transposed, not the second. The E[...] means expected value, which is the mean for Gaussian-distributed data. When you perform StandardScaler().fit_transform(data), you are basically subtracting out the mean of the data, so that's why you don't explicitly do so in your dot product.
Note that StandardScaler() is also dividing by the variance, so it's normalizing everything to unit variance. This is going to affect your covariance! So if you need the actual covariance of the data without normalization, just calculate it with something like np.cov() from the numpy module.
Let's build towards Covariance matrix step by step, first let's define variance.
The variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean.
Now we have to define covariance.
Covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. Read here.
So now armed with that you can understand that Co-variance matrix is a matrix which shows how each feature varies with changes in other features. Which can be calculated as
and there you can see the equation that you are confused about formed at the bottom. If you have any further queries, comment down.
Image Source: Wikipedia.
I'm trying to calculate eigenfaces for a set of images using python.
First I turn each image into a vector using:
list(map(lambda x:x.flatten(), x))
Then I calculate covariance matrix (after removing mean from all data):
# x is a numpy array
x = x - mean_image
cov_matrix = np.cov(x.T)
Then I calculate eigenvalues and eigenevtors:
eigen_values, eigen_vecotrs = np.linalg.eig(cov_matrix)
The results are vectors with complex numbers, so I only keep the real part to be able to show them:
eigen_vectors = np.real(eigen_vectors)
After trying to show eigenfaces (eigenvectors), the result is not even close to how an eigenface looks like:
I have managed to get a list of eigenfaces using np.linalg.svd() however I'm curious why my code does not work and how can I change it so it work as expected.
To fix the np.linalg.eig returning complex results I reduced the size of images, so it doesn't return complex numbers anymore however still my eigenvectors doesn't look like an eigenface:
proj_data = np.dot(x.transpose(),eigen_vector).T
img = proj_data[i].reshape(height,width)
This will give you the expected result.
After calculating the eigenvectors you should transpose it. Or you will get mixed image.
This Question pertains to machine learning.
I populate an array with the values of a greyscale image.
ben = io.ImageCollection('./Ben_bw.png')[0]
ben = np.array(ben)#array of all pixels
Now I flatten the array with:
ben_flat = ben.reshape((1, -1))
when I print ben_flat.shape then I get a (1, 10304) array that is not non-zero
Then when I try to use PCA and fit the array:
pca = PCA(n_components=200)
ben_reduced = pca.fit(ben_flat)
When I fit the array I recieve an error:
RuntimeWarning: invalid value encountered in true_divide
From what I understand there is a zero divider somewhere. But I can't understand where it is or how it ends up there.
PCA fitting is done with n samples each with an equivalent number of features. The components of each sample are compared and the ones with most variance are kept first, thus keeping the most information. ben_flat is just one sample and the algorithm has no idea how to decompose it into a lower dimension because there are not other samples to compare it with.