Scikit-learn principal component analysis (PCA) for dimension reduction - python

I want to perform principal component analysis for dimension reduction and data integration.
I have 3 features(variables) and 5 samples like below. I want to integrate them into 1-dimensional(1 feature) output by transforming them(computing 1st PC). I want to use transformed data for further statistical analysis, because I believe that it displays the 'main' characteristics of 3 input features.
I first wrote a test code with python using scikit-learn like below. It is the simple case that the values of 3 features are all equivalent. In other word, I applied PCA for three same vector, [0, 1, 2, 1, 0].
Code
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
samples = np.array([[0,0,0],[1,1,1],[2,2,2],[1,1,1],[0,0,0]])
pc1 = pca.fit_transform(samples)
print (pc1)
Output
[[-1.38564065]
[ 0.34641016]
[ 2.07846097]
[ 0.34641016]
[-1.38564065]]
Is taking 1st PCA after dimension reduction proper approach for data integration?
1-2. For example, if features are like [power rank, speed rank], and power have roughly negative correlation with speed, when it is a 2-feature case. I want to know the sample which have both 'high power' and 'high speed'. It is easy to decide that [power 1, speed 1] is better than [power 2, speed 2], but difficult for the case like [power 4, speed 2] vs [power 3, speed 3].
So I want to apply PCA to 2-dimensional 'power and speed' dataset, and take 1st PC, then use the rank of '1st PC'. Is this kind of approach still proper?
In this case, I think the output should also be [0, 1, 2, 1, 0] which is the same as the input. But output was [-1.38564065, 0.34641016, 2.07846097, 0.34641016, -1.38564065]. Are there any problem with the code, or is it the right answer?

Yes. It is also called data projection (to the lower dimension).
The resulting output is centered and normalized according to the train data. The result is correct.
In case of only 5 samples I don't think it is wise to run any statistical methods. And if you believe that your features are the same, just check that correlation between dimensions is close to 1, and then you can just disregard other dimensions.

There is no need to use PCA for this small dataset. And for PCA you array should be scaled.
In any case, you have only 3 dimensions: you can plot points and take a look with your eyes, you can calculate distances (make some kind on Nearest Neighborhoods algorithm).

Related

Can somebody tell me the name of the algorithm if it exists or tell me how to find it

Here is the idea:
There is a huge 2D dataset (250,000 datapoints).
I need to get rid of 90% of the datapoint without hurting the data structure. Which means (i believe) to get rid of the closest ones. Density must decrease...
Considering we need to keep the structure - we can't just randomly delete 90% as this might cause bias. There may be a little element of random in this but no too much.
I can put the data in 2D matrix and divide into cells. Some cells then will have more datapoints and some will have less and some will have none.
I need the algorithm that will group those datapoints or the cells in my matrix into segments which will all have relatively close nummer of datapoints in it. Those segments or cells in "new" matrix can be different size(which i believe is the point in this algorithm).
I've drawn a picture. It is not accurate but I hope it will make idea a bit clearer.
Also I code in python :^)
Thank you!!
the algorithm you are searching is a unsupervised learning method, the most famous one is kmeans on python.
You can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Here is a code example for an array:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
If you have to adjust it for a dataframe (df), it looks like this:
from sklearn.cluster import KMeans
X = df[['column A',..., 'column D']]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
the output labels are your clusters.

Understanding results from 1D np.correlate

I am trying to determine the similarity between two 1D time-series using numpy.correlate.
I wrote a small example program to learn more about how cross correlation works, however I am not completely understanding the trend in the correlation output.
Code:
import numpy as np
import matplotlib.pyplot as plt
#sample arrays to correlate
arr_1 = np.arange(1, 101) #[1, 2, 3, ..... 100]
arr_2 = np.concatenate([np.zeros(50), np.arange(50, 101)]) #[0, 0, ... 50, 51 ... 100]
cross_corr = np.correlate(arr_1, arr_2, "same")
plt.plot(list(cross_corr))
This graph raises a couple questions for me. It is my understanding that the cross correlation relies on the convolution operation (essentially the integral of the inner product of two signals - accounting for some lag).
Why is the correlation signal (above) steadily increases from (0, 50) if arr_2 is full of 0's from index 0 to 50?
How can I set the lag for the convolution operation. From the numpy docs I can't find a parameter which allows me to tweak the lag.
The peak at 50, is due to the fact that both signals line up at index 50, but why then does the correlation steadily decrease thereafter? If the two signals are lining up then shouldn't the correlation be increasing?
A correlation is significant only if its value is greater than 2/sqrt(n - abs(k)). Where n is the number of samples and k is the lag. How would correlation significance come into play for the graph shown above?
It seems that you are confused about what is exactly being output. The documentation is a little lacking honestly. The output computes the correlation between your two arrays for each lag. The midpoint, is where the lag is 0 and where correlation is highest.
FYI, your two arrays are not the same size. arr_1 is length 100 and arr_2 is length 101. Not sure if this was intentional.

PCA: Get Top 20 Most Important Dimensions

I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.

Finding and utilizing eigenvalues and eigenvectors from PCA in scikit-learn

I have been utilizing PCA implemented in scikit-learn. However, I want to find the eigenvalues and eigenvectors that result after we fit the training dataset. There is no mention of both in the docs.
Secondly, can these eigenvalues and eigenvectors themselves be utilized as features for classification purposes?
I am assuming here that by EigenVectors you mean the Eigenvectors of the Covariance Matrix.
Lets say that you have n data points in a p-dimensional space, and X is a p x n matrix of your points then the directions of the principal components are the Eigenvectors of the Covariance matrix XXT. You can obtain the directions of these EigenVectors from sklearn by accessing the components_ attribute of the PCA object. This can be done as follows:
from sklearn.decomposition import PCA
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA()
pca.fit(X)
print pca.components_
This gives an output like
[[ 0.83849224 0.54491354]
[ 0.54491354 -0.83849224]]
where every row is a principal component in the p-dimensional space (2 in this toy example). Each of these rows is an Eigenvector of the centered covariance matrix XXT.
As far as the Eigenvalues go, there is no straightforward way to get them from the PCA object. The PCA object does have an attribute called explained_variance_ratio_ which gives the percentage of the variance of each component. These numbers for each component are proportional to the Eigenvalues. In the case of our toy example, we get these if print the explained_variance_ratio_ attribute :
[ 0.99244289 0.00755711]
This means that the ratio of the eigenvalue of the first principal component to the eigenvalue of the second principal component is 0.99244289:0.00755711.
If the understanding of the basic mathematics of PCA is clear, then a better way to get the Eigenvectors and Eigenvalues is to use numpy.linalg.eig to get Eigenvalues and Eigenvectors of the centered covariance matrix. If your data matrix is a p x n matrix, X (p features, n points), then the you can use the following code:
import numpy as np
centered_matrix = X - X.mean(axis=1)[:, np.newaxis]
cov = np.dot(centered_matrix, centered_matrix.T)
eigvals, eigvecs = np.linalg.eig(cov)
Coming to your second question. These EigenValues and EigenVectors cannot be used themselves for classification. For classification you need features for each data point. These Eigenvectors and Eigenvalues that you generate are derived from the entire covariance matrix, XXT. For dimensionality reduction you could use the projections of your original points(in the p-dimensional space) on the principal components obtained as a result of PCA. However, this is also not always useful, because PCA does not take into account the labels of your training data. I would recommend you to look into LDA for supervised problems.
Hope that helps.
The docs say explained_variance_ will give you
"The amount of variance explained by each of the selected components. Equal to n_components largest eigenvalues of the covariance matrix of X.", new in version 0.18.
Seems a little questionable since the first and second sentences do not seem to agree.
sklearn PCA documentation

Modify kmeans alghoritm for 1d array where order matters

I want to find groups in one dimensional array where order/position matters. I tried to use numpys kmeans2 but it works only when i have numbers in increasing order.
I have to maximize average difference between neigbour sub-arrays
For example: if I have array [1,2,2,8,9,0,0,0,1,1,1] and i want to get 4 groups the result should be something like [1,2,2], [8,9], [0,0,0], [1,1,1]
Is there a way to do it in better then O(n^k)
answer: I ended up with modiied dendrogram, where I merge neigbors only.
K-means is about minimizing the least squares. Among it's largest drawbacks (there are many) is that you need to know k. Why do you want to inherit this drawback?
Instead of hacking k-means into not ignoring the order, why don't you instead look at time series segmentation and change detection approaches that are much more appropriate for this problem?
E.g. split your time series if abs(x[i] - x[-1]) > stddev where stddev is the standard deviation of your data set. Or the standard deviation of the last 10 samples (in above series, the standard deviation is about 3, so it would split as [1,2,2], [8,9], [0,0,0,1,1,1] because the change 0 to 1 is not significant.

Categories