This Question pertains to machine learning.
I populate an array with the values of a greyscale image.
ben = io.ImageCollection('./Ben_bw.png')[0]
ben = np.array(ben)#array of all pixels
Now I flatten the array with:
ben_flat = ben.reshape((1, -1))
when I print ben_flat.shape then I get a (1, 10304) array that is not non-zero
Then when I try to use PCA and fit the array:
pca = PCA(n_components=200)
ben_reduced = pca.fit(ben_flat)
When I fit the array I recieve an error:
RuntimeWarning: invalid value encountered in true_divide
From what I understand there is a zero divider somewhere. But I can't understand where it is or how it ends up there.
PCA fitting is done with n samples each with an equivalent number of features. The components of each sample are compared and the ones with most variance are kept first, thus keeping the most information. ben_flat is just one sample and the algorithm has no idea how to decompose it into a lower dimension because there are not other samples to compare it with.
Related
I am working on an excercise in remote sensing, and have some trouble performing a pca. I have 103 different spectral bands, meaning the shape of my array is (m, n, 103). I want to reduce this to ten components.
I tried to use PCA from sklearn.decomposition, however my dimensions are wrong, and i dont think i understand how to use the function.
pca = PCA(n_components = 10)
test = feature_scaling(data)
pca_components = pca.fit_transform(test)
Giving me the following error: *** ValueError: Found array with dim 3. Estimator expected <= 2.
I was wondering if i maybe need to extract each of the spectral bands and perform a flattening, but i am not sure what my input in the pca is supposed to look like.
Does anyone know what i can do?
I'm confused with the problem mentioned in the title. Does n_components=None mean that no transformation has been made in input, or that it has been transformed to new dimensional space but instead of the usual "reduction" (keeping few components with high eigenvalues) with keeping all the new synthetic features? The documentation suggests the former for me:
Hence, the None case results in: n_components == min(n_samples, n_features) - 1
But this is not entirely clear, and additionally, if it indeed means keeping all the components, why on earth the number of these equals to n_components == min(n_samples, n_features) - 1, why not to n_features?
However, I find the other alternative (in case of None, dropping the whole PCA step), I have never heard about applying PCA without omitting some eigenvectors...
As per official documentation -
If svd_solver == 'arpack', the number of components must be strictly less than the minimum of n_features and n_samples.
Hence, the None case results in:
n_components == min(n_samples, n_features) - 1
So it depends upon the type of solver (which can be set via the parameter) being used for the eigenvectors.
If arpack : run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
As to your second query about dropping the whole PCA step, it depends totally upon what you are trying to solve. Since PCA components explain the variation of the data in decreasing order (1st component explains max variance, the last component explains the least variance), it can be useful for specific tasks to have some features that explain more variance.
I've been tasked to implement my PCA code to convert data to a 2d field for a KNN assignment. My PCA code creates an array with the eigenvectors called PCevecs.
def __PCA(data):
#Normalize data
data_cent = data-np.mean(data)
#calculate covariance
covarianceMatrix = np.cov(data_cent, bias=True)
#Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
#Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
The assignment transforms the training-data using the PCA. The returned PCevecs has the shape (88, 88) given by calling print(PCevecs.shape). The shape of the training data is (88, 4).
np.dot(trainingFeatures, PCevecs[:, 0:2])
When the code is running I get the error message "ValueError: shapes (88,4) and (88,2) not alligned: 4 (dim 1) != 88 (dim 0)". I can see that the arrays don't match, but I can't see that I've done anything wrong with the PCA implementation. I've tried to have a look at similar problems on Stackoverflow. I haven't seen anyone sorting the eigenvector and eigenvalues the same way.
(EDITED with additional info from the comments)
While the PCA implementation is OK in general, you may want to either compute it on the data transposed, or you want to make sure that you tell np.cov() in which axis your dimensionality is via the rowvar parameter.
The following would work as you expect:
import numpy as np
def __PCA_fixed(data, rowvar=False):
# Normalize data
data_cent = data - np.mean(data)
# calculate covariance (pass `rowvar` to `np.cov()`)
covarianceMatrix = np.cov(data_cent, rowvar=rowvar, bias=True)
# Find eigenvector and eigenvalue
eigenvalue, eigenvector= np.linalg.eigh(covarianceMatrix)
# Sorting the eigenvectors and eigenvalues:
PCevals = eigenvalue[::-1]
PCevecs = eigenvector[:,::-1]
return PCevals, PCevecs
Testing it out with some random numbers:
data = np.random.randint(0, 100, (100, 10))
PCevals, PCevecs = __PCA_fixed(data)
print(PCevecs.shape)
# (10, 10)
Also note that, in more general terms, the singular value decomposition (np.linalg.svd() in NumPy) might be a better approach for principal component analysis (with a simple relationship with the eigenvalue decomposition you use and transposition).
As a general coding style note, it may be a good idea to follow the advice from PEP-8, many of which can be readily checked by some automated tool like, e.g. autopep8.
Let us say, I have a df of 20 columns and 10K rows. Since the data has a wide range values, I use the following code to normalize the data:
from sklearn.preprocessing import StandardScaler
min_max_scaler = preprocessing.StandardScaler()
df_scaled = min_max_scaler.fit_transform(df)
df_scaled now contains both negative and positive values.
Now if I pass this normalized data frame to the spectral cluster as follows,
spectral = SpectralClustering(n_clusters = k,
n_init=30,
affinity='nearest_neighbors', random_state=cluster_seed,
assign_labels='kmeans')
clusters = spectral.fit_predict(df_scaled)
I will get the cluster lables.
Here is what confuses me: the official doc says that
"Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm."
Questions: Do the normalized negative values of df_scaled affect the clustering result?
OR
Does it depend on the affinity computation I am using e.g. precomputed, rbf? If so how can I use the normalized input values to SpectralClustering?
My understanding is that normalizing could improve the clustering results and good for faster computation.
I appreciate any help or tips on how to I can approach the problem.
You are passing a data matrix, not a precomputed affinity matrix.
The "nearest neighbors" uses a binary kernel, which is non-negative.
To better understand the inner workings, please have a look at the source code.
I am trying to use Scikit to train 2 features called: x1 and x2. Both these arrays are shape (490,1). In order to pass in one X argument into clf.fit(X,y), I used np.concatenate to produce an array shape (490,2). The label array is composed of 1's and 0's and is shape (490,). The code is shown below:
x1 = int_x # previously defined array shape (490,1)
x2 = int_x2 # previously defined array shape (490,1)
y=np.ravel(close) # where close is composed of 1's and 0's shape (490,1)
X,y = np.concatenate((x1[:-1],x2[:-1]),axis=1), y[:-1] #train on all datapoints except last
clf = SVC()
clf.fit(X,y)
The following error is shown:
X.shape[1] = 1 should be equal to 2, the number of features at training time
What I don't understand is why this message appears even though when I check the shape of X, it is indeed 2 and not 1. I originally tried this with only one feature and clf.fit(X,y) worked well, so I am inclined to think that np.concatenate produced something that was not suitable. Any suggestions would be great.
It's difficult to say without having the concrete values of int_x, int_x2 and close. Indeed, if I try with int_x, int_x2 and close randomly constructed as
import numpy as np
from sklearn.svm import SVC
int_x = np.random.normal(size=(490,1))
int_x2 = np.random.normal(size=(490,1))
close = np.random.randint(2, size=(490,))
which conforms to your specs, then your code works. Thus the error may be in the way you constructed int_x, int_x2 and close.
If you believe the problem is not there, could you please share a minimal reproducible example with specific values of int_x, int_x2 and close?
I think I understand what was wrong with my code.
First, I should have created another variable, say x that defined the concatenation of int_x and int_x2 and is shape: (490,2), which is the same shape as close. This came in handy later.
Next, the clf.fit(X,y) was not incorrect in itself. However, I did not correctly formulate my prediction code. For instance, I said: clf.predict([close[-1]]) in hopes of capturing the binary target output (either 0 or 1). The argument that was passed into this method was incorrect. It should have been clf.predict([x[-1]]) because the algorithm predicts the label at the feature location as opposed to the other way around. Since the variable x is now the same shape as close, then the result of clf.predict([x[-1]]) should produce the predicted result of close[-1].