I am using following code to do principal component analysis of iris data:
from sklearn import datasets
iris = datasets.load_iris()
dat = pd.DataFrame(data=iris.data, columns=['sl', 'sw', 'pl', 'pw'])
from sklearn.preprocessing import scale
stddat = scale(dat)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pc_out = pca.fit_transform(stddat)
pcdf = pd.DataFrame(data = pc_out , columns = ['PC-1', 'PC-2'])
print(pcdf.head())
Output:
PC-1 PC-2
0 -2.264542 0.505704
1 -2.086426 -0.655405
2 -2.367950 -0.318477
3 -2.304197 -0.575368
4 -2.388777 0.674767
Now I want to determine PC-1 for a new set of values of 'sl', 'sw', 'pl' and 'pw', say: 4.8, 3.1, 1.3, 0.2. How can I do this? I could not find any way to do this using sklearn library.
Edit: as mentioned in comments, I can get PC values for new data with command pca.transform(new_data). However, I am interested in getting variable loadings so that I can use these numbers to determine PC values later and from anywhere, rather than just in current environment.
By loadings I mean "the weight by which each standardized original variable should be multiplied to get the component score" (from https://en.wikipedia.org/wiki/Principal_component_analysis ). I cannot find a method to do this on the documentation page: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Here's the transform function available here:
def transform(self, X):
"""Apply dimensionality reduction to X.
X is projected on the first principal components previously extracted
from a training set.
Parameters
----------
X : array-like, shape (n_samples, n_features)
New data, where n_samples is the number of samples
and n_features is the number of features.
Returns
-------
X_new : array-like, shape (n_samples, n_components)
Examples
--------
>>> import numpy as np
>>> from sklearn.decomposition import IncrementalPCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> ipca = IncrementalPCA(n_components=2, batch_size=3)
>>> ipca.fit(X)
IncrementalPCA(batch_size=3, copy=True, n_components=2, whiten=False)
>>> ipca.transform(X) # doctest: +SKIP
"""
check_is_fitted(self, ['mean_', 'components_'], all_or_any=all)
X = check_array(X)
if self.mean_ is not None:
X = X - self.mean_
X_transformed = np.dot(X, self.components_.T)
if self.whiten:
X_transformed /= np.sqrt(self.explained_variance_)
return X_transformed
The variable loadings, are the components, which you get from pca.components_. Be sure that your mean_ is 0 and whiten is False, and then you can simply get that matrix and use it wherever you want to transform your matrices/vectors.
Related
after fitting my data into
X = my data
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.
When I perform inverse transformation by definition isn't it supposed to return to original data, that is X, 2-D array?
when I do
X_ori = pca.inverse_transform(X_pca)
I get same dimension however different numbers.
Also if I plot both X and X_ori they are different.
When I perform inverse transformation by definition isn't it supposed to return to original data
No, you can only expect this if the number of components you specify is the same as the dimensionality of the input data. For any n_components less than this, you will get different numbers than the original dataset after applying the inverse PCA transformation: the following diagrams give an illustration in two dimensions.
It can not do that, since by reducing the dimensions with PCA, you've lost information (check pca.explained_variance_ratio_ for the % of information you still have). However, it tries its best to go back to the original space as well as it can, see the picture below
(generated with
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(1)
X_orig = np.random.rand(10, 2)
X_re_orig = pca.inverse_transform(pca.fit_transform(X_orig))
plt.scatter(X_orig[:, 0], X_orig[:, 1], label='Original points')
plt.scatter(X_re_orig[:, 0], X_re_orig[:, 1], label='InverseTransform')
[plt.plot([X_orig[i, 0], X_re_orig[i, 0]], [X_orig[i, 1], X_re_orig[i, 1]]) for i in range(10)]
plt.legend()
plt.show()
)
If you had kept the n_dimensions the same (set pca = PCA(2), you do recover the original points (the new points are on top of the original ones):
I'm a Matlab user and I'm learning Python with the sklearn library. I want to translate this Matlab code
[coeff,score] = pca(X)
For coeff I have tried this in Python:
from sklearn.decomposition import PCA
import numpy as np
pca = PCA()
pca.fit(X)
coeff = print(np.transpose(pca.components_))
I don't know whether or not it's right; for score I have no idea.
Could anyone enlight me about correctness of coeff and feasibility of score?
The sklearn PCA has a score method as described in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Try: pca.score(X) or pca.score_samples(X) depending whether you wish a score for each sample (the latter) or a single score for all samples (the former)
The PCA score in sklearn is different from matlab.
In sklearn, pca.score() or pca.score_samples() gives the log-likelihood of samples whereas matlab gives the principal components.
From sklearn Documentation:
Return the log-likelihood of each sample.
Parameters:
X : array, shape(n_samples, n_features)
The data.
Returns:
ll : array, shape (n_samples,)
Log-likelihood of each sample under the current model
From matlab documentation:
[coeff,score,latent] = pca(___) also returns the principal component
scores in score and the principal component variances in latent. You
can use any of the input arguments in the previous syntaxes.
Principal component scores are the representations of X in the
principal component space. Rows of score correspond to observations,
and columns correspond to components.
The principal component variances are the eigenvalues of the
covariance matrix of X.
Now, the equivalent of matlab score in pca is fit_transform() or transform() :
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> matlab_equi_score = pca.fit_transform(X)
The objects of LinearRegression from sklearn.linear_model can be used to fit data point to a line. As can be seen from the code below, the fit method takes two parameters, the list of points and another list of just y coordinates.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
My question is this: why is the second parameter even required? Is it not redundant information?
Linear models are not restricted to only 1 predictor variable and 1 response variable. In other words, you can have X and Y as the two predictors being used to predict the response variable Z, where Z might depend linearly on X and Y. In your case you are only trying to predict Y from X, so change your code to the following:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit ([[0], [1], [2]], [0, 1, 2])
to fit data point to a line
It rather fits a line through your data points.
fit method takes two parameters, the list of points and another list of just y coordinates.
X are your data samples, where each row is a datapoint (one sample, a N-dimensional feature vector).
y are the datapoint labels, one per datapoint. fit method finds matrix W (feature weights) and vector b (bias), such that it minimizes distance between your prediction yhat = Wx + b and the real y.
E.g. if you are given 2-dimensional datapoints with coordinates [x,y] and you would like to predict y based on x, you pass xs as the first argument and ys as the second argument to fit.
I can run the simple pykalman Kalman Filter example given in the pykalman documentation:
import pykalman
import numpy as np
kf = pykalman.KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([[1,0], [0,0], [0,1]]) # 3 observations
(filtered_state_means, filtered_state_covariances) = kf.filter(measurements)
print filtered_state_means
This correctly returns the state estimates (one for each observation):
[[ 0.07285974 0.39708561]
[ 0.30309693 0.2328318 ]
[-0.5533711 -0.0415223 ]]
However, if I provide only a single observation, the code fails:
import pykalman
import numpy as np
kf = pykalman.KalmanFilter(transition_matrices = [[1, 1], [0, 1]], observation_matrices = [[0.1, 0.5], [-0.3, 0.0]])
measurements = np.asarray([[1,0]]) # 1 observation
(filtered_state_means, filtered_state_covariances) = kf.filter(measurements)
print filtered_state_means
with the following error:
ValueError: could not broadcast input array from shape (2,2) into shape (2,1)
How can I use pykalman to update an initial state and initial covariance using just a single observation?
From the documentation at: http://pykalman.github.io/#kalmanfilter
filter_update(filtered_state_mean, filtered_state_covariance, observation=None, transition_matrix=None, transition_offset=None, transition_covariance=None, observation_matrix=None, observation_offset=None, observation_covariance=None)
This takes in the filtered_state_mean and filtered_state_covariance at time t, and an observation at t+1, and returns the state mean and state covariance at t+1 (to be used for the next update)
If I understand Kalman filter algorithm correctly, you can predict the state using just one observation. But, the gain and the covariance would be way off and the prediction would be nowhere close to the actual state.
You need to give a Kalman filter a few observations as a training set to reach a steady state
I am trying to run a Fisher's LDA (1, 2) to reduce the number of features of matrix.
Basically, correct if I am wrong, given n samples classified in several classes, Fisher's LDA tries to find an axis that projecting thereon should maximize the value J(w), which is the ratio of total sample variance to the sum of variances within separate classes.
I think this can be used to find the most useful features for each class.
I have a matrix X of m features and n samples (m rows, n columns).
I have a sample classification y, i.e. an array of n labels, each one for each sample.
Basing on y I want to reduce the number of features to, for example, 3 most representative features.
Using scikit-learn I tried in this way (following this documentation):
>>> import numpy as np
>>> from sklearn.lda import LDA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> clf = LDA(n_components=3)
>>> clf.fit_transform(X, y)
array([[ 4.],
[ 4.],
[ 8.],
[-4.],
[-4.],
[-8.]])
At this point I am a bit confused, how to obtain the most representative features?
The features you are looking for are in clf.coef_ after you have fitted the classifier.
Note that n_components=3 doesn't make sense here, since X.shape[1] == 2, i.e. your feature space only has two dimensions.
You do not need to invoke fit_transform in order to obtain coef_, calling clf.fit(X, y) will suffice.