PCA features do not match original features - python

I am trying to reduce the feature dimensions using PCA. I have been able to apply PCA to my training data, but am struggling to understand why the reduced feature set (X_train_pca) shares no similarities with the original features (X_train).
print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)
most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]
Should the first feature vector in X_train_pca not be just a subset of the first feature vector in X_train? For example, why doesn't the following equal True?
print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False
Furthermore, none of the features from the first feature vector of X_train are in the first feature vector of X_train_pca:
for i in X_train[0]:
print(i in X_train_pca[0])
# False
# False
# False
# ...

PCA transforms your high dimensional feature vectors into low dimensional feature vectors.
It does not simply determine the least important index in the original space and drop that dimension.

This is normal since the PCA algorithm applies a transformation to your data:
PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
(https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction)
Run the following code sample to see the effects the PCA algorithm on a simple Gaussian data set.
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
pca = PCA(2)
X = np.random.multivariate_normal(mean=np.array([0, 0]), cov=np.array([[1, 0.75],[0.75, 1]]), size=(1000,))
X_new = pca.fit_transform(X)
plt.scatter(X[:, 0], X[:, 1], s=5, label='Initial data')
plt.scatter(X_new[:, 0], X_new[:, 1], s=5, label='Transformed data')
plt.legend()
plt.show()

Related

Principal Component Analysis: Order of components AFTER transformation

I am using the PCA class from sklearn.decomposition to reduce the dimensionality of the feature space in order to plot that feature space.
I wondering the following: After applying the fit and transform method of the PCA class, I am getting back an array X_transformed of shape (n_samples, n_components) as stated in the documentation. Is the order of columns of X_transformed sorted by the amount of explained variance? In the documentation it says that PCA.components_ is sorted by explained variance, so I am assuming that the columns of X_transformed are as well, but please correct me if I am wrong.
Little example:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X) # X is an array containing my original features. X.shape=(n_samples, n_features)
X_transformed = pca.transfom(X) # X_transformed.shape=(n_samples, n_components). Are X_transformed's columns sorted by explained variance?
Thanks!
Hmm maybe just got an idea to test that
from sklearn.decomposition import PCA
import numpy as np
pca_2 = PCA(n_components=2)
X_transformed_2 = pca_2.fit_transform(X)
# X_transformed_2 hold two components with most variance explained
pca_10 = PCA(n_components=10)
X_transformed_10 = pca_10.fit_transform(X)
# X_transformed_10 hold 10 components with most variance explained
# Hypothesis: If the first 2 components in X_transformed_10 are ordered by explained variance, it's first 2 columns should equal X_transformed_2
np.array_equal(X_transformed_2, X_transformed_10[:, 2]) ## returns True

Strange sampling results from statsmodels.api.GLM (Generalised linear model)

I encounter a problem in using python tool "statsmodels.api.GLM", which I cannot understand. I come here asking for helps.
I'm working on an example of (see the Section of) "Cubic and Natual Cubic Splines"
on this page https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-python-codes/ (data link is included in the page or here)
The problem is that. After fitting the data, I try to predict values at given places of x (eg. the xp00 and xp01 in the following code). Then I find that, once the requested positions having different min and max (i.e., the xp01) from the training x-set (i.e., the xp), the result becomes something else, not at all my transitional expectation that, at the same position, the prediction should be exactly the same value, whatever how you made the request because the fit to the data is done and fixed. I'm expecting the pred01 is overlapped with pred00, but just shorter a the left end.
# import modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
# read data_set
data = pd.read_csv("Wage.csv")
data.head()
data_x = data['age']
data_y = data['wage']
# Dividing data into train and validation datasets
from sklearn.model_selection import train_test_split
train_x, valid_x, train_y, valid_y = train_test_split(data_x, data_y, test_size=0.33, random_state = 1)
from patsy import dmatrix
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error
from math import sqrt
# Generating cubic spline with 3 knots at 25, 40 and 60
transformed_x = dmatrix("bs(train, knots=(25,40,60), degree=3, include_intercept=False)", {"train": train_x},return_type='dataframe')
# Fitting Generalised linear model on transformed dataset
fit1 = sm.GLM(train_y, transformed_x).fit()
# Prediction on splines
pred1 = fit1.predict(dmatrix("bs(valid, knots=(25,40,60), include_intercept=False)", {"valid": valid_x}, return_type='dataframe'))
# Calculating RMSE values
rms1 = sqrt(mean_squared_error(valid_y, pred1))
print(rms1)
#-> 39.4
# We will plot the graph for 70 observations only
xp = np.linspace(valid_x.min(),valid_x.max(),70)
xp00 = np.linspace(valid_x.min(),valid_x.max(),170)
xp01 = np.linspace(valid_x.min()+4,valid_x.max(),170) # just shift the lower bound a bit
# Make some predictions
pred1 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp}, return_type='dataframe'))
pred00 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp00}, return_type='dataframe'))
pred01 = fit1.predict(dmatrix("bs(xp, knots=(25,40,60), include_intercept=False)", {"xp": xp01}, return_type='dataframe'))
SMALL_SIZE = 4
gamma=0.4
plt.rc('font', size=SMALL_SIZE)
plt.rc('axes', titlesize=SMALL_SIZE)
plt.figure(figsize=(5,2),dpi=300)
# Plot the splines and error bands
plt.scatter(data.age, data.wage, facecolor='None', edgecolor='k', alpha=0.1)
#plt.plot(xp, pred1, label='Specifying degree =3 with 3 knots')
plt.plot(xp, pred1, color='r', label='Specifying degree =3 with 4 knots xp')
plt.plot(xp00, pred00, color='g', label='Specifying degree =3 with 4 knots xp00')
plt.plot(xp01, pred01, color='b', label='Specifying degree =3 with 4 knots xp00')
plt.legend()
plt.xlim(15,85)
plt.ylim(0,350)
plt.xlabel('age')
plt.ylabel('wage')
plt.show()
Well, I have no right to enclose the figure in the post! so please click the link below and check the strange results. Perhaps not strange just myself don't know how to use it. I'm ready to see.
the strange reuslts (URL :https://i.stack.imgur.com/uFkGH.jpg)
Thanks!!
Yanbin
splines are a statefull transformation. That means that computing the splines needs parameters like knot location that are based on the data. This is similar to standardization that depends on mean and standard deviation of the sample.
Using formulas in statsmodels keeps track of those stateful transformation for transformations like splines that are provided by patsy. So, the original parameters for the statefull transformation are used when computing the transformed design matrix for new prediction points.
In the example code, the spline basis is computed separately for the training and test example. However, it specifies the interior knots to be the same in both cases.
My guess what happens in the example is that patsy adjusts the boundary knots to the transformation data. In that case, even if the interior knots are the same, the boundary knots differ.
As consequence, the B-spline basis will agree in the interior of the data space, but not for points close to the boundary.
A second source of differences is that removing the intercept from the spline basis can be a "global" transformation which will affect all spline basis columns and not just a single column. (I do not remember what patsy's default for removing the intercept is for the B-splines.)

How to plot normal vector of decision boundary?

I've managed to plot the decision boundary of a support vector machine in 2D and 3D. Now, I'd like to plot the normal vector of it as well, but in a way that works not only in 2D / 3D but also in higher-dimensional spaces. At the moment, I'm simply calculating the normal vector by computing the slope of it with m1 * m2 = -1.
Going deeper into the mathematics behind SVMs, I've found out that there's the w-vector which is perpendicular to the decision boundary. I'm using the LinearSVC implementation of sklearn to train the classifier. As far as I know, the w-vector is given by the coef_[0] attribute, but plotting this vector doesn't give the result I was expecting.
Is there a general way to compute the normal vector of a SVM decision boundary, which not only works in 2D / 3D but also in high-dimensional spaces?
What I'm trying to achieve is to navigate inside a n-dimensional space gradually from one class to another. Since its not possible to visualize a high-dimensional space, I'd like to validate everything first in 2D/3D to gain a better understanding.
I've a data set of labeled fashion item images. First, I've extracted 2048-dimensional feature vectors using a CNN (ResNet-50). Then, I perform PCA reduce the dimensionality of the vectors. Before, I've performed some data cleaning and filtering.
num_feature_dimensions = 2 # Set the number of embedding dimensions
pca = PCA(n_components = num_feature_dimensions)
embs_compressed = pca.fit_transform(df_embs_filtered)
df_embs_filtered_compressed = pd.DataFrame(embs_compressed)
df_embs_filtered_compressed
After that, I train the SVM with the uncompressed embeddings as X and the season feature as y (binary problem, either winter or summer).
X = df_embs_filtered
y = df_filtered["season"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
svm_clf = LinearSVC(C=1, max_iter=100000)
svm_clf.fit(X_scaled, y)
The last step is to visualize the embedding space (in case of 2D / 3D) with the decision boundary and an orthogonal axis. It should be possible for a user to navigate over that orthogonal axis to go from one class to the other. So, I'm creating a marker for the user and utilizing an ipywidgets FloatSlider which updates the position. Then, depending on the user's position, it'll show the image of the nearest neighbor embedding.
This is the whole code for creating the scatter plot, computing the decision boundary and its orthogonal axis, and the FloatSlider for 2D. I left out some snippets which I think aren't relevant to the question.
from ipywidgets import AppLayout, FloatSlider
from matplotlib.offsetbox import (AnnotationBbox, OffsetImage, TextArea)
plt.ioff()
fig, ax = plt.subplots(figsize=(15,7))
fig.canvas.header_visible = False
fig.canvas.layout.min_height = '400px'
# Create Scatterplot of filtered dataset colored by season feature
sns.scatterplot(x="x", y="y",
hue="season",
data=df_filtered,
legend="full",
alpha=0.8)
# Computes the decision boundary of a trained classifier
db_xx, db_yy = calc_svm_decision_boundary(svm_clf, -35, 35)
# Rotate the decision boundary 90° to get perpendicular axis
neg_yy = np.negative(db_yy)
neg_slope = -1 / -svm_clf.coef_[0][0]
bias = svm_clf.intercept_[0] / svm_clf.coef_[0][1]
ortho_db_yy = neg_slope * db_xx - bias
# Plot the axes
plt.plot(db_xx, db_yy, "k-", linewidth=1)
plt.plot(db_xx, ortho_db_yy, "r-", linewidth=1)
#plt.plot(neg_yy, db_xx, "g-", linewidth=2)
# Choose a random starting position and initialize user marker on that position
rand_idx = random.choice(range(len(db_xx)))
x = db_xx[rand_idx]
y = ortho_db_yy[rand_idx]
user_marker, user_positon = create_user_marker(x, y)
# Compute the nearest neighbour and annotate it with its respective image
nearest_neighbour, nearest_neighbour_pos = get_nearest_neighbour(user_positon, df_filtered)
annotate_nearest_neighbour(nearest_neighbour, nearest_neighbour_pos, ax, df_filtered)
plt.title('Nearest Embedding: {} with season: {}, pos: {}'.format(nearest_neighbour, df_filtered.loc[df_filtered['id'] == nearest_neighbour].season.values[0], user_positon))
# Create Slider to interact with the plot
slider = FloatSlider(
orientation="horizontal",
description="x-Position:",
value=user_positon[0],
min=min(db_xx),
max=max(db_xx)
)
slider.layout.margin = '0px 30% 0px 30%'
slider.layout.width = '25%'
slider.observe(update_user_position_2D, names='value')
AppLayout(
center=fig.canvas,
footer=slider,
pane_heights=[0, 6, 1]
)
def calc_svm_decision_boundary(svm_clf, xmin, xmax):
"""Compute the decision boundary"""
w = svm_clf.coef_[0]
b = svm_clf.intercept_[0]
xx = np.linspace(xmin, xmax, 200)
yy = -w[0]/w[1] * xx - b/w[1]
return xx, yy
This results to the following plot:
This approach to compute the orthogonal axis works in 2D but I'm looking for a general approach to compute it regardless of the space dimension. As you can see in the plot, for the season feature there's no fitting separating hyperplane in lower-dimensional spaces. My hypothesis is, that there's a hyperplane in a higher-dimensional space which is separating the classes well.
Now, I thought that I could use the w-vector which is perpendicular to the decision boundary to compute an orthogonal axis in any space. Is that possible or do I've an error in reasoning?

Sorted data not plotted in correct datapoints [duplicate]

This question already has an answer here:
Plot is unclear using matplotlib and pandas library
(1 answer)
Closed 4 years ago.
This seems a sklearn question but it's not (at least not directly). I just use sklearn here to get the data points since this will be able to reproduce fully my problem. Some background
I use sklearn to predict some points in a small interval. First I build a synthetic domain X with 2d vectors (rows in a matrix).
Then I calculate some image points y= x_1 + x_2 + noise using those rows x=(x_1, x_2) and some noise to try to replicate some real data.
To do the regression (aka interpolation), as part of the method I fetch randomly pick vectors/points (here in matrix form they are rows) from the domain X using the command train_test_split, I will skip the details, but the result arrays are random subsets of the space (the space is (x_1, x_2, y) for all (x_1, x_2) in my compact support.
Then I do the regression using sklearn, so far so good. everything works as expected. And I get in y_pred_test_sine the predictions and they work well. But the prediction is completely shuffled since the method picks random points from domain as a test set.
Here comes the problem...
Since I want to plot as a continous function (being interpolated by matplotlib, and that is ok, I will play with my own interpolations tests later). I do two things:
Create a new vector with sorted predicted image points from test X_test_sort
Create a new vector with sorted domain points from test. y_pred_test_sine_sort
These (1) and (2) match (at least should) each data point in the predicted model (these are only sorted to be easily plotted using plt.plot lines, and not markers)
Then I plot them and they do not match (AT ALL) the expected points in my solution space.
Here we can see that the full black line (the sorted predicted line) do not follow the orange dots (the predicted points). And that was not what I expect at all.
Here follow the code to reproduce the issue.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
plt.close('all')
rng = np.random.RandomState(42)
regressor = LinearRegression()
# Synthetic dataset
x_1 = np.linspace(-3, 3, 300)
x_2 = np.sin(4*x_1)
noise = rng.uniform(size=len(x_1))
y = x_1 + x_2 + noise
X = np.vstack((x_1, x_2)).T
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Regression 2 features data
fit_sine = regressor.fit(X_train, y_train)
y_pred_test_sine = regressor.predict(X_test)
# Here I have sorted the X values and its image points Y = f(x)
# Why those are not correctly placed over the 'prediction' points
X_test_sort = np.sort(X_test[:,0].ravel())
y_pred_test_sine_sort = np.sort(y_pred_test_sine.ravel())
# DO THE PLOTTING
plt.plot(X_test[:,0], y_test, 'o', alpha=.5, label='data')
plt.plot(X_test[:,0], y_pred_test_sine, 'o', alpha=.5, label='prediction')
plt.plot(X_test_sort, y_pred_test_sine_sort, 'k', label='prediction line')
plt.plot(x, np.sin(4*x)+x+.5, 'k:', alpha=0.3, label='trend')
plt.legend()
As you mentioned in your comments, by sorting y, you ruin the connection between X and y by place. Instead, use use argsort to get the sorting order of X, and then order X_test and y with that:
argsort_X_test = np.argsort((X_test[:,0].ravel()))
X_test_sort = X_test[argsort_X_test, 0]
y_pred_test_sine_sort = y_pred_test_sine[argsort_X_test]
This will give you the desired graph

Time series prediction using support vector regression

I've been trying to implement time series prediction tool using support vector regression in python language. I use SVR module from scikit-learn for non-linear Support vector regression. But I have serious problem with prediction of future events. The regression line fits the original function great (from known data) but as soon as I want to predict future steps, it returns value from the last known step.
My code looks like this:
import numpy as np
from matplotlib import pyplot as plt
from sklearn.svm import SVR
X = np.arange(0,100)
Y = np.sin(X)
svr_rbf = SVR(kernel='rbf', C=1e5, gamma=1e5)
y_rbf = svr_rbf.fit(X[:-10, np.newaxis], Y[:-10]).predict(X[:, np.newaxis])
figure = plt.figure()
tick_plot = figure.add_subplot(1, 1, 1)
tick_plot.plot(X, Y, label='data', color='green', linestyle='-')
tick_plot.axvline(x=X[-10], alpha=0.2, color='gray')
tick_plot.plot(X, y_rbf, label='data', color='blue', linestyle='--')
plt.show()
Any ideas?
thanks in advance,
Tom
You are not really doing time-series prediction. You are trying to predict each element of Y from a single element of X, which means that you are just solving a standard kernelized regression problem.
Another problem is when computing the RBF kernel over a range of vectors [[0],[1],[2],...], you will get a band of positive values along the diagonal of the kernel matrix while values far from the diagonal will be close to zero. The test set portion of your kernel matrix is far from the diagonal and will therefore be very close to zero, which would cause all of the SVR predictions to be close to the bias term.
For time series prediction I suggest building the training test set as
x[0]=Y[0:K]; y[0]=Y[K]
x[1]=Y[1:K+1]; y[1]=Y[K+1]
...
that is, try to predict future elements of the sequence from a window of previous elements.

Categories