Difference in PCA implementation between numpy only vs sklearn - python

from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets('data/MNIST/', one_hot=True)
numpy implementation
# Entire Data set
Data=np.array(mnist.train.images)
#centering the data
mu_D=np.mean(Data, axis=0)
Data-=mu_D
COV_MA = np.cov(Data, rowvar=False)
eigenvalues, eigenvec=scipy.linalg.eigh(COV_MA, eigvals_only=False)
together = zip(eigenvalues, eigenvec)
together = sorted(together, key=lambda t: t[0], reverse=True)
eigenvalues[:], eigenvec[:] = zip(*together)
n=3
pca_components=eigenvec[:,:n]
print(pca_components.shape)
data_reduced = Data.dot(pca_components)
print(data_reduced.shape)
data_original = np.dot(data_reduced, pca_components.T) # inverse_transform
print(data_original.shape)
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
sklearn implementation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(Data)
data_reduced = np.dot(Data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
I'd like to implement PCA algorithms by using numpy. However I don't know how to reconstruct the images from that and I don't even know if this code is correct.
Actually, when I used sklearn.decomposition.PCA, the result is different from the numpy implementation.
Can you explain the differences?

I can spot a few differences already.
For one:
n=300
projections = only_2.dot(eigenvec[:,:n])
Xhat = np.dot(projections, eigenvec[:,:n].T)
Xhat += mu_D
plt.imshow(Xhat[5].reshape(28,28),cmap='Greys',interpolation='nearest')
The point I'm trying to make is, if my understanding is correct n = 300, you are trying to fit 300 eigen vectors whose eigen values go from high to low.
But in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(only_2)
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # invers
It seems to me you are fitting just the FIRST component (the component that maximizes variance) and you're not taking all 300.
Further more:
One thing I can clearly, say is that you seem to understand what's happening in PCA but you're having trouble implementing it. Correct me if I'm wrong but:
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
In this part, you are trying to PROJECT your eigenvectors to your data which is what you should go about doing in PCA, but in sklearn, what you should do is the following:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=300)
pca.fit_transform(only_2)
If you could tell me how you created only_2, I can give you a much more specific answer tomorrow.
Here is what sklearn says about fit_transform for PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform:
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
y : Ignored
Returns:
X_new : array-like, shape (n_samples, n_components)

Related

Difference between statsmodel OLS and scikit-learn linear regression

I tried to practice linear regression model with iris dataset.
from sklearn import datasets
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
# load iris data
train = sns.load_dataset('iris')
train
# one-hot-encoding
species_encoded = pd.get_dummies(train["species"], prefix = "speceis")
species_encoded
train = pd.concat([train, species_encoded], axis = 1)
train
# Split by feature and target
feature = ["sepal_length", "petal_length", "speceis_setosa", "speceis_versicolor", "speceis_virginica"]
target = ["petal_width"]
X_train = train[feature]
y_train = train[target]
case 1 : statsmodels
# model
X_train_constant = sm.add_constant(X_train)
model = sm.OLS(y_train, X_train_constant).fit()
print("const : {:.6f}".format(model.params[0]))
print(model.params[1:])
result :
const : 0.253251
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.337843
speceis_versicolor 0.094816
speceis_virginica 0.496278
case 2 : scikit-learn
# model
model = LinearRegression()
model.fit(X_train, y_train)
print("const : {:.6f}".format(model.intercept_[0]))
print(pd.Series(model.coef_[0], model.feature_names_in_))
result :
const : 0.337668
sepal_length -0.001693
petal_length 0.231921
speceis_setosa -0.422260
speceis_versicolor 0.010399
speceis_virginica 0.411861
Why are the results of statsmodels and sklearn different?
Additionally, the results of the two models are the same except for all or part of the one-hot-encoded feature.
You included a full set of one-hot encoded dummies as regressors, which results in a linear combination that is equal to the constant, therefore you have perfect multicollinearity: your covariance matrix is singular and you can't take its inverse.
Under the hood both statsmodels and sklearn rely on Moore-Penrose pseudoinverse and can invert singular matrices just fine, the problem is that the coefficients obtained in the singular covariance matrix case don't mean anything in any physical sense. The implementations differ a bit between packages (sklearn relies on scipy.stats.lstsq, statsmodels has some custom procedure statsmodels.tools.pinv_extended, which is basically numpy.linalg.svd with minimal changes), so at the end of the day they both display «nonsense» (since no meaningful coefficients can be obtained), it's just a design choice of what kind of «nonsense» to display.
If you take the sum of coefficients of one-hot encoded dummies, you can see that for statsmodels it is equal to the constant, and for sklearn it is equal to 0, while the constant differs from statsmodels constant. The coefficients of variables that are not «responsible» for perfect multicollinearity are unaffected.

Expected 2D Array,got 1D array instead.Where's the mistake?

I am beginning to learn SVM and PCA.I tried to apply SVM on the Sci-Kit Learn 'load_digits' dataset.
When i apply the .fit method to SVC,i get an error:
"Expected 2D array, got 1D array instead:
array=[ 1.9142151 0.58897807 1.30203491 ... 1.02259477 1.07605691
-1.25769703].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample."
Here is the code i wrote:**
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1E6)
X=[reduced_data[:,0]
y=reduced_data[:,1]
clf.fit(X, y)
Can someone help me out?Thank you in advance.
Your error results from the fact that clf.fit() requires the array X to be of dimension 2 (currently it is 1 dimensional), and by using X.reshape(-1, 1), X becomes a (N,1) (2D - as we would like) array, as opposed to (N,) (1D), where N is the number of samples in the dataset. However, I also believe that your interpretation of reduced_data may be incorrect (from my limited experience of sklearn):
The reduced_data array that you have contains two principle components (the two most important features in the dataset, n_components=2), which you should be using as the new "data" (X).
Instead, you have taken the first column of reduced_data to be the samples X, and the second column to be the target values y. It is to my understanding that a better approach would be to make X = reduced_data since the sample data should consist of both PCA features, and make y = y_digits, since the labels (targets) are unchanged by PCA.
(I also noticed you defined pca = PCA(n_components=10).fit_transform(data), but did not go on to use it, so I have removed it from the code in my answer).
As a result, you would have something like this:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.svm import SVC
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
# pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
clf = SVC(kernel='rbf', C=1e6)
clf.fit(reduced_data, y_digits)
I hope this has helped!

Interpreting OLS Weights after PCA (in Python)

I want to interpret the regression model weights in a model where the input data has been pre-processed with PCA. In reality, I have 100s of input dimensions which are highly correlated, so I know that PCA is useful. However, for the sake of illustration I will use the Iris dataset.
The sklearn code below illustrates my question:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
w = np.array([0.3, 10, -0.1, -0.01])
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 4
# reconstruct w
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# reconstruct w
reg_trans = LinearRegression().fit(X_trans, Y)
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)
Running this code, one can see that the weights are reproduced fine.
However, if I set the number of components to 3 (i.e. n_components = 3) then then weights printed out deviate substantially from the true ones.
Am I misunderstanding how I can transform back these weights? Or is it because of PCA's information loss moving from 4 to 3 components?
I think this was working fine, it's just that I was looking at the w_trans_hat instead of the reconstructed Y:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
# create fake loadings
w = np.array([0.3, 10, -0.1, -0.01])
# centre X
X = np.subtract(X, np.mean(X, 0))
# calculate Y
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 3
# reconstruct w using linear regression
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# regress Y on principal components
reg_trans = LinearRegression().fit(X_trans, Y)
# reconstruct Y using regressed weights and transformed X
Y_trans = np.dot(X_trans, reg_trans.coef_)
# show MSE to original Y
print(np.mean((Y - Y_trans) ** 2))
# show w implied by reduced model in original space
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)

Multivariate linear regression in python

I have a data set as follows arranged in X and Y matrices as follows:
I want to find a 2*2matrix A such that y_i=A x_i for all i=1,...,n. So I am using the following code for linear regression in python:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import datasets, linear_model
#n=5
X=np.random.uniform(0,1,(2,5))
A=np.random.uniform(0,1,(2,2))
y=np.dot(A,X)
print(y)
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
model=regr.fit(X, y)
#model.predict(X)
model.coef_
However my model.coef_ command is printing a 5*5 matrix instead of a 2*2 matrix that I want for A. How do I achieve this?
Fit the model on a transpose of the samples and outcome - the second dimensions will be used to create the model - to get a 2x2 array:
model=regr.fit(X.T, y.T)
# test
np.testing.assert_allclose(y.T, model.predict(X.T))

Add new vector to PCA new space data python

Imagine I have training data with 9 dimension and 6000 sample, and I applied PCA algorithm using sklearn PCA.
I reduce it's dimensions to 4, and know I want convert one new sample with 9 features to my training data space with 4 components as fast as possible.
here is my first pca code:
X_std = StandardScaler().fit_transform(df1)
pca = PCA(n_components = 4)
result = pca.fit_transform(X_std)
Is there any way do this with sklearn pca function?
If you want to transform the original matrix to the reduced dimensionality projection offered by PCA you can use the transform function which will run an efficient inner-product on the eigenvectors and the input matrix:
pca = PCA(n_components=4)
pca.fit(X_train)
X_std_reducted = pca.transform(X_std)
From the scikit source:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now you can apply the projection to the training set and any new data that we want to tests against in the future.
This article describes the process in more detail: http://www.eggie5.com/69-dimensionality-reduction-using-pca

Categories