Mean centering before PCA - python

I am unsure if this kind of question (related to PCA) is acceptable here or not.
However, it is suggested to do MEAN CENTER before PCA, as known. In fact, I have 2 different classes (Each different class has different participants.). My aim is to distinguish and classify those 2 classes. Still, I am not sure about MEAN CENTER that should be applied to the whole data set, or to each class.
Is it better to make it separately? (if it is, should PREPROCESSING STEPS also be separately as well?) or does it not make any sense?

PCA is just a rotation, optionally accompanied with a projection onto a lower-dimensional space. It finds axes of maximal variance (which happen to be the principal axes of inertia of your point cloud) and then rotates the dataset to align those axes with your coordinate's system. You get to decide how many such axes you'd like to retain, which means the rotation is then followed by projection onto the first k axes of greatest variance, with k the dimensionality of the representation space you'll have chosen.
With this in mind, again like for calculating axes of inertia, you could decide to look for such axes through the center of mass of your cloud (the mean), or through any arbitrary origin of choice. In the former case, you would mean-center your data, and in the latter you may translate the data to any arbitrary point, with the result being to diminish the importance of the intrinsic cloud shape itself and increase the importance of the distance between the center of mass and the arbitrary point. Thus, in practice, you would almost always center your data.
You may also want to standardize your data (center and divide by standard deviation so as to make variance 1 on each coordinate), or even whiten your data.
In any case, you will want to apply the same transformations to the entire dataset, not class by class. If you were to apply the transformation class by class, whatever distance exists between the centers of gravity of each would be reduced to 0, and you would likely observe a collapsed representation with the two classes as overlapping. This may be interesting if you want to observe the intrinsic shape of each class, but then you would also apply PCA separately for each class.
Please note that PCA may make it easier for you to visualize the two classes (without guarantees, if the data are truly n-dimensional without much of a lower-dimensional embedding). But in no circumstances would it make it easier to discriminate between the two. If anything, PCA will reduce how discriminable your classes are, and it is often the case that the projection will intermingle classes (increase ambiguity) that are otherwise quite distinct and e.g. separable with a simple hyper-surface.

PCA is more or less per definition a SVD with centering of the data.
Depending on the implementation (if you use a PCA from a library) the centering is applied automatically e.g. sklearn - because as said it has to be centered by definition.
So for sklearn you do not need this preprocessing step and in general you apply it over your whole data.
PCA is unsupervised can be used to find a representation that is more meaningful and representative for you classes afterwards. So you need all your samples in the same feature space via the same PCA.
In short: You do the PCA once and over your whole (training) data and must be center over your whole (traning) data. Libraries like sklarn do the centering automatically.

The k neareast neighbor will help you distinquish between the two classes. Also try tsne to visualize data classes using higher dimensions.
def pca_classifier(X, y, n_components=2, n_neighbors=1):
"""
X: numpy array of shape (n_samples, n_features)
y: numpy array of shape (n_samples, )
n_components: int, number of components to keep
n_neighbors: int, number of neighbors to use in the knn classifier
"""
# 1. PCA
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
# 2. KNN
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_pca, y)
# 3. plot
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA')
plt.show()
return knn

Related

Python - How to resample a 2D shape?

I am writing a python script for some geometrical data manipulation (calculating motion trajectories for a multi-drive industrial machine). Generally, the idea is that there is a given shape (let's say - an ellipse, but it general case it can be any convex shape, defined with a series of 2D points), which is rotated and it's uppermost tangent point must be followed. I don't have a problem with the latter part but I need a little hint with the 2D shape preparation.
Let's say that the ellipse was defined with too little points, for example - 25. (As I said, ultimately this can be any shape, for example a rounded hexagon). To maintain necessary precision I need far more points (let's say - 1000), preferably equally distributed over whole shape or with higher density of points near corners, sharp curves, etc.
I have a few things ringing in my head, I guess that DFT (FFT) would be a good starting point for this resampling, analyzing the scipy.signal.resample() I have found out that there are far more functions in the scipy.signal package which sound promising to me...
What I'm asking for is a suggestion which way I should follow, what tool I should try for this job, which may be the most suitable. Maybe there is a tool meant exactly for what I'm looking for or maybe I'm overthinking this and one of the implementations of FFT like resample() will work just fine (of course, after some adjustments at the starting and ending point of the shape to make sure it's closing without issues)?
Scipy.signal sounds promising, however, as far as I understand, it is meant to work with time series data, not geometrical data - I guess this may cause some problems as my data isn't a function (in a mathematical understanding).
Thanks and best regards!
As far as I understood, what you want is to get an interpolated version of your original data.
The DFT (or FFT) will not achieve this purpose, since it will perform an Fourier Transform (which is not what you want).
Talking theoretically, what you need to interpolate your data is to define a function to calculate the result in the new-data-points.
So, let's say your data contains 5 points, in which one you have a 1D (to simplify) number stored, representing your data, and you want a new array with 10 points, filled with the linear-interpolation of your original data.
Using numpy.interp:
import numpy as np
original_data = [2, 0, 3, 5, 1] # define your data in 1D
new_data_resolution = 0.5 # define new sampling distance (i.e, your x-axis resolution)
interp_data = np.interp(
x = np.arange(0, 5-1+new_data_resolution , new_data_resolution), # new sampling points (new axis)
xp = range(original_data),
fp = original_data
)
# now interp_data contains (5-1) / 0.5 + 1 = 9 points
After this, you will have a (5-1) / new_resolution (which is greater than 5, since new_resolution < 1)-length data, which values will be (in this case) a linear interpolation of your original data.
After you have achieved/understood this example, you can dive in the scipy.interpolate module to get a better understanding in the interpolation functions (my example uses a linear function to get the data in the missing points).
Applying this to n-D dimensional arrays is straight-forward, iterating over each dimension of your data.

PCA Explained Variance Analysis

I'm very new to PCA.
I have 11 X variables for my model. These are the X variable labels
x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]
This is the graph I generated from the explained variance. With the x axis being the principal component.
[ 3.47567089e-01 1.72406623e-01 1.68663799e-01 8.86739892e-02
4.06427375e-02 2.75054035e-02 2.26578769e-02 5.72892368e-03
2.49272688e-03 6.37160140e-05]
I need to know whether I have a good selection of features. And how can I know which feature contributions the most.
from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_
Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.
By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.
In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.
Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.
When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.
When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.
You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.

Python - Kriging (Gaussian Process) in scikit_learn

I am considering using this method to interpolate some 3D points I have. As an input I have atmospheric concentrations of a gas at various elevations over an area. The data I have appears as values every few feet of vertical elevation for several tens of feet, but horizontally separated by many hundreds of feet (so 'columns' of tightly packed values).
The assumption is that values vary in the vertical direction significantly more than in the horizontal direction at any given point in time.
I want to perform 3D kriging with that assumption accounted for (as a parameter I can adjust or that is statistically defined - either/or).
I believe the scikit learn module can do this. If it can, my question is how do I create a discrete cell output? That is, output into a 3D grid of data with dimensions of, say, 50 x 50 x 1 feet. Ideally, I would like an output of [x_location, y_location, value] with separation of those (or similar) distances.
Unfortunately I don't have a lot of time to play around with it, so I'm just hoping to figure out if this is possible in Python before delving into it. Thanks!
Yes, you can definitely do that in scikit_learn.
In fact, it is a basic feature of kriging/Gaussian process regression that you can use anisotropic covariance kernels.
As it is precised in the manual (cited below) ou can either set the parameters of the covariance yourself or estimate them. And you can choose either having all parameters equal or all different.
theta0 : double array_like, optional
An array with shape (n_features, ) or (1, ). The parameters in the
autocorrelation model. If thetaL and thetaU are also specified, theta0
is considered as the starting point for the maximum likelihood
estimation of the best set of parameters. Default assumes isotropic
autocorrelation model with theta0 = 1e-1.
In the 2d case, something like this should work:
import numpy as np
from sklearn.gaussian_process import GaussianProcess
x = np.arange(1,51)
y = np.arange(1,51)
X, Y = np.meshgrid(lons, lats)
points = zip(obs_x, obs_y)
values = obs_data # Replace with your observed data
gp = GaussianProcess(theta0=0.1, thetaL=.001, thetaU=1., nugget=0.001)
gp.fit(points, values)
XY_pairs = np.column_stack([X.flatten(), Y.flatten()])
predicted = gp.predict(XY_pairs).reshape(X.shape)

Python - Convolution with a Gaussian

I need to convolute the next curve with a Gaussian function of specific parameters centered at 3934.8A.
The problem I see is that my curve is a discrete array and the Gaussian would be a well define continuos function. How can I make this work?
To do this, you need to create a Gaussian that's discretized at the same spatial scale as your curve, then just convolve.
Specifically, say your original curve has N points that are uniformly spaced along the x-axis (where N will generally be somewhere between 50 and 10,000 or so). Then the point spacing along the x-axis will be (physical range)/(digital range) = (3940-3930)/N, and the code would look like this:
dx = float(3940-3930)/N
gx = np.arange(-3*sigma, 3*sigma, dx)
gaussian = np.exp(-(x/sigma)**2/2)
result = np.convolve(original_curve, gaussian, mode="full")
Here this is a zero-centered gaussian and does not include the offset you refer to (which to me would just add confusion, since the convolution by its nature is a translating operation, so starting with something already translated is confusing).
I highly recommend keeping everything in real, physical units, as I did above. Then it's clear, for example, what the width of the gaussian is, etc.

How do I perform a convolution in python with a variable-width Gaussian?

I need to perform a convolution using a Gaussian, however the width of the Gaussian needs to change. I'm not doing traditional signal processing but instead I need to take my perfect Probability Density Function (PDF) and ``smear" it, based on the resolution of my equipment.
For instance, suppose my PDF starts out as a spike/delta-function. I'll model this as a very narrow Gaussian. After being run through my equipment, it will be smeared out according to some Gaussian resolution. I can calculate this using the scipy.signal convolution functions.
import numpy as np
import matplotlib.pylab as plt
import scipy.signal as signal
import scipy.stats as stats
# Create the initial function. I model a spike
# as an arbitrarily narrow Gaussian
mu = 1.0 # Centroid
sig=0.001 # Width
original_pdf = stats.norm(mu,sig)
x = np.linspace(0.0,2.0,1000)
y = original_pdf.pdf(x)
plt.plot(x,y,label='original')
# Create the ``smearing" function to convolve with the
# original function.
# I use a Gaussian, centered at 0.0 (no bias) and
# width of 0.5
mu_conv = 0.0 # Centroid
sigma_conv = 0.5 # Width
convolving_term = stats.norm(mu_conv,sigma_conv)
xconv = np.linspace(-5,5,1000)
yconv = convolving_term.pdf(xconv)
convolved_pdf = signal.convolve(y/y.sum(),yconv,mode='same')
plt.plot(x,convolved_pdf,label='convolved')
plt.ylim(0,1.2*max(convolved_pdf))
plt.legend()
plt.show()
This all works no problem. But now suppose my original PDF is not a spike, but some broader function. For example, a Gaussian with sigma=1.0. And now suppose my resolution actually varys over x: at x=0.5, the smearing function is a Gaussian with sigma_conv=0.5, but at x=1.5, the smearing function is a Gaussian with sigma_conv=1.5. And suppose I know the functional form of the x-dependence of my smearing Gaussian. Naively, I thought I would change the line above to
convolving_term = stats.norm(mu_conv,lambda x: 0.2*x + 0.1)
But that doesn't work, because the norm function expects a value for the width, not a function. In some sense, I need my convolving function to be a 2D array, where I have a different smearing Gaussian for each point in my original PDF, which remains a 1D array.
So is there a way to do this with functions already defined in Python? I have some code to do this that I wrote myself....but I want to make sure I've not just re-invented the wheel.
Thanks in advance!
Matt
Question, in brief:
How to convolve with a non-stationary kernel, for example, a Gaussian that changes width for different locations in the data, and does a Python an existing tool for this?
Answer, sort-of:
It's difficult to prove a negative, but I do not think that a function to perform a convolution with a non-stationary kernel exists in scipy or numpy. Anyway, as you describe it, it can't really be vectorized well, so you may as well do a loop or write some custom C code.
One trick that might work for you is, instead of changing the kernel size with position, stretch the data with the inverse scale (ie, at places where you'd want to the Gaussian with to be 0.5 the base width, stretch the data to 2x). This way, you can do a single warping operation on the data, a standard convolution with a fixed width Gaussian, and then unwarp the data to original scale.
The advantages of this approach are that it's very easy to write, and is completely vectorized, and therefore probably fairly fast to run.
Warping the data (using, say, an interpolation method) will cause some loss of accuracy, but if you choose things so that the data is always expanded and not reduced in your initial warping operation, the losses should be minimal.

Categories