Using scikit-learn's WhiteKernel for Gaussian Process Regression - python

There are two ways to specify the noise level for Gaussian Process Regression (GPR) in scikit-learn.
The first way is to specify the parameter alpha in the constructor of the class GaussianProcessRegressor which just adds values to the diagonal as expected.
The second way is incorporate the noise level in the kernel with WhiteKernel.
The documentation of GaussianProcessRegressor (see documentation here) says that specifying alpha is "equivalent to adding a WhiteKernel with c=alpha". However, I am experiencing a different behavior and want to find out what the reason is for that (and, of course, what the "correct" way or "truth" is).
Here is a code snippet plotting two different regression fits for a perturbed version of the function f(x)=x^2 although they should show the same:
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel as C, RBF, WhiteKernel
rnd.seed(0)
n = 40
xs = np.linspace(-1, 1, num=n)
noise = 0.1
kernel1 = C()*RBF() + WhiteKernel(noise_level=noise)
kernel2 = C()*RBF()
data = xs**2 + rnd.multivariate_normal(mean=np.zeros(n), cov=noise*np.eye(n))
gpr1 = GaussianProcessRegressor(kernel=kernel1, alpha=0.0, optimizer=None)
gpr1.fit(xs[:, np.newaxis], data)
gpr2 = GaussianProcessRegressor(kernel=kernel2, alpha=noise, optimizer=None)
gpr2.fit(xs[:, np.newaxis], data)
xs_plt = np.linspace(-1., 1., num=100)
for gpr in [gpr1, gpr2]:
pred, pred_std = gpr.predict(xs_plt[:, np.newaxis], return_std=True)
plt.figure()
plt.plot(xs_plt, pred, 'C0', lw=2)
plt.scatter(xs, data, c='C1', s=20)
plt.fill_between(xs_plt, pred - 1.96*pred_std, pred + 1.96*pred_std,
alpha=0.2, color='C0')
plt.title("Kernel: %s\n Log-Likelihood: %.3f"
% (gpr.kernel_, gpr.log_marginal_likelihood(gpr.kernel_.theta)),
fontsize=12)
plt.ylim(-1.2, 1.2)
plt.tight_layout()
plt.show()
I already was looking into the implementation in the scikit-learn package, but was not able to find out what is going wrong. Or maybe I am just overseeing something and the outputs make perfect sense.
Does anyone have an idea of what is going on here or had a similar experience?
Thanks a lot!

I might be wrong here, but I believe the claim 'specifying alpha is "equivalent to adding a WhiteKernel with c=alpha"' is subtly incorrect.
When setting the GP-Regression noise, the noise is added only to K, the covariance between the training points. When adding a Whitenoise-Kernel, the noise is also added to K**, the covariance between test points.
In your case, the test points and training points are identical. However, the three different matrices are likely still created. This could lead to the discrepancy observed here.

I argue that the documentation is incorrect. See github issue #13267 about this with (which I opened).
In practice, what I do is fit a GP with the WhiteKernel then take that noice level. I then add that value to alpha and recompute the necessary variables. An easier alternative is to make a new GP with the alpha set and the same length scales but do not fit it.
I should note that it is not universally accepted as to whether or not this is the right approach. I had this discussion with a colleague and we came to the following conclusion. This pertains to the data bei.ng noise from experimental error
If you want to sample the GP to predict what a new experiment with more independent measurements, you want the WhiteKernel
If you want to sample the possible underlying truth, you do not want the WhiteKernel since you want a smooth response

https://gpflow.readthedocs.io/en/awav-documentation/notebooks/regression.html
Maybe you can use the GPflow package, which makes separate prediction for latent function f and observation y (f+ noise).
m.predict_f returns the mean and variance of the latent function (f) at the points Xnew.
m.predict_y returns the mean and variance of a new data point (i.e. includes the noise variance).

Related

Very low p-values in Python Kolmogorov-Smirnov Goodness of Fit Test

I have a set of data and fit the corresponding histogram by a lognormal distribution.
I first calculate the optimal parameters for the lognormal function, and then plot the histogram and the lognormal function. This gives quite good results:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
num_data = len(data)
x_axis = np.linspace(min(data),
max(data),num_data)
number_of_bins = 240
histo, bin_edges = np.histogram(data, number_of_bins, normed=False)
shape, location, scale = sp.stats.lognorm.fit(data)
plt.hist(data, number_of_bins, normed=False);
# the scaling factor scales the normalized lognormal function up to the size
# of the histogram:
scaling_factor = len(data)*(max(data)-min(data))/number_of_bins
plt.plot(x_axis,scaling_factor*sp.stats.lognorm.pdf(x_axis, shape,
location, scale),'r-')
# adjust the axes dimensions:
plt.axis([bin_edges[0]-10,bin_edges[len(bin_edges)-1]+10,0, histo.max()*1.1])
However, when performing the Kolmogorov-Smirnov test on the data versus the fitting function, I get way too low p-values (of the order of e-32):
lognormal_ks_statistic, lognormal_ks_pvalue =
sp.stats.kstest(
data,
lambda k: sp.stats.lognorm.cdf(k, shape, location, scale),
args=(),
N=len(data),
alternative='two-sided',
mode='approx')
print(lognormal_ks_statistic)
print(lognormal_ks_pvalue)
This is not normal, since we see from the plot that the fitting is quite accurate... does anybody know where I made a mistake?
Thanks a lot!!
Charles
This simply means that your data isn't exactly log-normal. Based on the histogram, you have a lot of data points for the K-S test to use. This means that if your data is evenly slightly different than would be expected based on a log-normal distribution with those parameters, the K-S test will indicate the data isn't drawn from log-normal.
Where is the data from? If it is from an organic source, or any source other than specifically drawing random numbers from a lognormal distribution, I would expect an extremely small p-value, even if the fits looks great. This certainly isn't a problem though as long as the fit is sufficiently good for your purposes.

Interpolate with lmfit?

I am trying to fit a curve with lmfit but the data set I'm working with does not contain a lot of points and this makes the resulting fit look jagged instead of curved.
I'm simply using the line:
out = mod.fit(SV, pars, x=VR)
were VR and SV are the coordinates of the points I'm trying to fit.
I've tried using scipy.interpolate.UnivariateSpline and the fitted the resulting data but I want to know if there is a built-in or faster way to do this.
Thank you
There is not a built-in way to automatically interpolate with lmfit. With a lmfit Model, you provide the array on independent values at which the Model should be evaluated, and an array of data to compared to that model.
You're free to interpolate or smooth the data or perform some other transformation (I sometimes Fourier transform data and model to emphasize some frequencies), but you'll have to include that as part of the model.
While you might be able to do the job with scipy.interpolate.UnivariateSpline, you would basically be fitting to the fit you already did.
Instead you can use the components that are given to you already from your original fit. It's very trivial once you know how, but the lmfit documentation does not provide a clear case.
import numpy as np
from lmfit.models import GaussianModel
import matplotlib.pyplot as plt
y, _ = np.histogram(np.random.normal(size=1000), bins=10, density=True)
x = np.linspace(0, 1, y.size)
# Replace with whatever model you are using (with the caveat that the above dataset is gaussian).
model = GaussianModel()
params = model.guess(y, x=x)
result = model.fit(y, params, x=x)
x_interp = np.linspace(0, 1, 100*y.size)
# The model is attached to the result, which makes it easier if you're sending it somewhere.
y_interp = result.model.func(x_interp, **result.best_values)
plt.plot(x, y, label='original')
plt.plot(x_interp, y_interp, label='interpolated')
plt.legend()
plt.show()

How good is this interpolation method?

I came up with a custom interpolation method for my problem and I'd like to ask if there are any risks using it. I am not a math or programming expert, that's why I'd like a feedback :)
Story:
I was searching for a good curve-fit method for my data when I came up with an idea to interpolate the data.
I am mixing paints together and making reflectance measurements with a spectrophotometer when the film is dry. I would like to calculate the required proportions of white and colored paints to reach a certain lightness, regardless of any hue shift (e.g. black+white paints gives a bluish grey) or chroma loss (e.g. orange+white gives "pastel" yellowish orange, etc.)
I check if Beer-Lambert law applies, but it does not. Pigment-mixing behaves in a more complicated fashion than dye-dilutions. So I wanted to fit a curve to my data points (the process is explained here: Interpolation for color-mixing
First step was doing a calibration curve, I tested the following ratios of colored VS white paints mixed together:
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
This is the plot of my carefully prepared samples, measured with a spectrophotometer, the blue curve represents the full color (ratio = 1), the red curve represents the white paint (ratio = 0), the black curves the mixed samples:
Second step I wanted to guess from this data a function that would compute a spectral curve for any ration between 0 and 1. I did test several curve fitting (fitting an exponential function) and interpolation (quadratic, cubic) methods but the results were of a poor quality.
For example, this is my reflectance data at 380nm for all the color samples:
This is the result of scipy.optimize.curve_fit using the function:
def func(x, a, b, c):
return a * np.exp(-b * x) + c
popt, pcov = curve_fit(func, x, y)
Then I came-up with this idea: the logarithm of the spectral data gives a closer match to a straight line, and the logarithm of the logarithm of the data is almost a straight line, as demonstrated by this code and graph:
import numpy as np
import matplotlib.pyplot as plt
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
plt.plot(ratios, linear_approx)
plt.show()
What I did then is to interpolate the linear approximation an then convert the data back to linear, then I got a very nice interpolation of my data, much better than what I got before:
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
xnew = np.arange(100)/100.
cs = scipy.interpolate.spline(ratios, linear_approx, xnew, order=1)
cs = np.exp(np.exp(cs))
plt.plot(xnew,cs)
plt.plot(x,y,'ro')
plt.show()
So my question is for experts: how good is this interpolation method and what are the risks of using it? Can it lead to wrong results?
Also: can this method be improved or does it already exists and if so how is it called?
Thank you for reading
This looks similar to the Kernel Method that is used for fitting regression lines or finding decision boundaries for classification problems.
The idea behind the Kernel trick being, the data is transformed into a dimensional space (often higher dimensional), where the data is linearly separable (for classification), or has a linear curve-fit (for regression). After the curve-fitting is done, inverse transformations can be applied. In your case successive exponentiations (exp(exp(X))), seems to be the inverse transformation and successive logarithms (log(log(x)))seems to be the transformation.
I am not sure if there is a kernel that does exactly this, but the intuition is similar. Here is a medium article explaining this for classification using SVM:
https://medium.com/#zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
Since it is a method that is quite popularly used in Machine Learning, I doubt it will lead to wrong results if the fit is done properly (not under-fit or over-fit) - and this needs to be judged by statistical testing.

Intuitive interpolation between unevenly spaced points

I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

Categories