scikit KernelPCA unstable results

scikit KernelPCA unstable results - python

I'm trying to use KernelPCA for reducing the dimensionality of a dataset to 2D (both for visualization purposes and for further data analysis).
I experimented computing KernelPCA using a RBF kernel at various values of Gamma, but the result is unstable:
(each frame is a slightly different value of Gamma, where Gamma is varying continuously from 0 to 1)
Looks like it is not deterministic.
Is there a way to stabilize it/make it deterministic?
Code used to generate transformed data:
def pca(X, gamma1):
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=gamma1)
X_kpca = kpca.fit_transform(X)
#X_back = kpca.inverse_transform(X_kpca)
return X_kpca

KernelPCA should be deterministic and evolve continuously with gamma. It is different from RBFSampler that does have built-in randomness in order to provide an efficient (more scalable) approximation of the RBF kernel.
However what can change in KernelPCA is the order of the principal components: in scikit-learn they are returned sorted in order of descending eigenvalue, so if you have 2 eigenvalues close to each other it could be that the order changes with gamma.
My guess (from the gif) is that this is what is happening here: the axes along which you are plotting are not constant so your data seems to jump around.
Could you provide the code you used to produce the gif?
I'm guessing it is a plot of the data points along the 2 first principal components but it would help to see how you produced it.
You could try to further inspect it by looking at the values of kpca.alphas_ (the eigenvectors) for each value of gamma.
Hope this makes sense.
EDIT: As you noted it looks like the points are reflected against the axis, the most plausible explanation is that one of the eigenvector flips sign (note this does not affect the eigenvalue).
I put in a simple gist to reproduce the issue (you'll need a Jupyter notebook to run it). You can see the sign-flipping when you change the value of gamma.
As a complement note that this kind of discrepancy happens only because you fit several times the KernelPCA object several times. Once you settled with a particular gamma value and you've fit kpca once you can call transform several times and get consistent results.
For the classical PCA the docs mention that:
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.
I don't know about the behavior of a single KernelPCA object that you would fit several times (I did not find anything relevant in the docs).
It does not apply to your case though as you have to fit the object with several gamma values.

So... I can't give you a definitive answer on why KernelPCA is not deterministic. The behavior resembles the differences I've observed between the results of PCA and RandomizedPCA. PCA is deterministic, but RandomizedPCA is not, and sometimes the eigenvectors are flipped in sign relative to the PCA eigenvectors.
That leads me to my vague idea of how you might get more deterministic results....maybe. Use RBFSampler with a fixed seed:
def pca(X, gamma1):
kernvals = RBFSampler(gamma=gamma1, random_state=0).fit_transform(X)
kpca = PCA().fit_transform(X)
X_kpca = kpca.fit_transform(X)
return X_kpca

Related

Why divide the output of NumPy FFT by N?

In many tutorials/blogs I've seen the output of np.fft.fft(signal) divided by the number of sample points N.
I understand that in some implementations that the transform is scaled/normalized by some factor like multiplying by N. However, I just read the docs, and by default the output of fft.fft() is unscaled. Yet I still see the output divided by N everywhere.
Why is this?
I have noticed that by scaling the output by 1/N I get back the correct amplitudes of the contributing wave signals. So obviously it is necessary, but I'd like to understand what the pure output is as compared to the scaled output.

For the DFT to be reversible (x == IDFT(DFT(x))), you need to divide by N somewhere. In signal processing this normalization is typically done in the inverse transform. For example Wikipedia shows it this way.
In other fields it is more often done in the forward transform. In physics I have seen half the normalization (1/sqrt(N)) applied to each transform, making them symmetric.
When the forward transform normalizes, then the values it returns are independent of the signal length (for example the zero frequency is the mean of all signal values). This is therefore the more useful variant when studying signal power.
The variant where the normalization is applied in the inverse transform (as commonly implemented in signal processing software, such as np.fft.fft(), and MATLAB's fft), then computing the convolution by multiplication in the frequency domain is easiest: one can directly write g = IDFT(DFT(f)*DFT(h)). If the normalization is applied elsewhere, it must be partly undone to obtain a correctly scaled result.
Other software, for example the FFTW library, does not normalize the transform at all, leaving that up to the user. This avoids unnecessary multiplications if the user wants a different normalization variant than what the library chooses.

Based on Parseval's theorem
This expresses that the energies in the time- and frequency-domain are the same. It means the magnitude |X[k]| of each frequency bin k is contributed by N samples. In order to find out the average contribution by each sample, the magnitude is normalized as |X[k]|/N, which leads to
where the LHS is the power of the signal.
However, such a scale normally doesn't matter unless you care about the unit of the magnitude, like in the case of the Sound Pressure Level (SPL) spectrum.

Weighted clustering in sklearn

Assuming I have a set of points (x,y and size). I want to find clusters in my data using sklearn.cluster.DBSCAN and their centers. That is no problem if I treat every point the same. But actually I want the weighted centers instead of the geometrical centers (meaning a bigger sized point should be counted more than a smaller) .
I came across with sample_weight, but I don't quite get if that is what I need. When I use sample_weight (right side) I get completely different clusters to the case when I don`t use it (left side):
Second I thought about using np.repeat(x,w) where x is my data and w is the size of each point so I get multiple copies of the points proportional to their weights. But this is probably not a smart solution as I get a lot of data, right?
Is sample_weight useful in my case or are there suggestions for better solutions than using np.repeat? I know that there are some questions about sample_weight already, but I could not read out how to use it exactly.
Thanks!

The most important thing for DBSCAN is the parameter setting. There are 2 parameters, epsilon and minPts (=min_samples). The epsilon parameter is the radius around your points and minPts considers your points as a part of a cluster if minPts is fulfilled. So instead of using np.repeat I would suggest adjusting the parameters for this dataset.
According to the documentation of DBSCAN, sample_weight is a tuning parameter for your runtime:
Another way to reduce memory and computation time is to remove
(near-)duplicate points and use sample_weight instead.
I think you want to address the quality of your result first before you tune your runtime.
I am not sure what you mean with weighted centers, probably you are refering to a differt clustering algorithm such as Gaussian mixture model.

Comparing models fitting multivariate data

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

Scipy.optimize.minimize only iterates some variables.

I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.

I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.

You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc

You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.