Interpolation to less denser grid + Least-squares fitting in Python - python

I am new in Python and a bit confused with the interpolation and Least-squares fitting of two ndarrays.
I have 2 ndarrays:
My final goal is to make Least-squares fitting of the modelled spectrum (blue curve) to the observed spectrum (orange curve).
Blue curve ndarray has the following parameters:
Orange curve ndarray has the following parameters:
As a first and the easiest step I wanted to plot the residuals (difference) between that two ndarrays, but the problem is that since they have different sizes 391 and 256 respectively. I've tried to use numpy.reshape or ndarray.resphape functions, but they lead to an errors.
Probably the proper solution will be to start with the interpolation of the blue curve into the less denser grid of the orange curve. I've tried to use numpy.interp function but it also leads to an errors.

Something along the lines of the following:
import numpy as np
import matplotlib.pyplot as plt
n_denser = 33
n_coarser = 7
x_denser = np.linspace(0,1,n_denser)
y_denser = np.power(x_denser, 2) + np.random.randn(n_denser)/10.
x_coarser = np.linspace(0,1,n_coarser)
y_coarser = np.power(x_coarser, 2) + np.random.randn(n_coarser)/10. + 0.5
y_dense_interp = np.interp(x_coarser, x_denser, y_denser)
plt.plot(x_denser, y_denser, 'b+-')
plt.plot(x_coarser, y_coarser, 'ro:')
plt.plot(x_coarser, y_dense_interp, 'go')
plt.legend(['dense data', 'coarse data', 'interp data'])
plt.show()
Which returns something like:

Your confusion seems to stem from mixing up the methods you mention. Least-squares is not a method for interpolation, rather it is a minimization curve fitting method. One key difference is that with interpolation the plots always pass through the original data points. With least-squares this can happen bit it is not generally the case.
Cubic-spline interpolation will give you 'nice' plots if you need to pass through the original data points.
If you want to use least-squares, you need to know what degree polynomial you want to fit. The most common is linear (first order).

Related

Inaccurate interpolation with scipy.interpolate.Rbf()

When I execute the following code
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import Rbf
x_coarse, y_coarse = np.mgrid[0:5, 0:5]
x_fine, y_fine = np.mgrid[1:4:0.23,1:4:0.23]
data_coarse = np.ones([5,5])
rbfi = Rbf(x_coarse.ravel(), y_coarse.ravel(), data_coarse.ravel())
interpolated_data = rbfi(x_fine.ravel(), y_fine.ravel()).reshape([x_fine.shape[0],
y_fine.shape[0]])
plt.imshow(interpolated_data)
the array interpolated_data has values ranging from 0.988 to 1.002 and the corresponding plot looks like this:
However, I would expect that in such a simple interpolation case, the interpolated values would be a lot closer to the correct value, i.e. 1.000.
I think the variations in the interpolated values are caused by the different distances from the interpolated points to the given data points.
My question is: Is there a way to avoid this behavior? How can I get an interpolation that is not weighted by the distance of the interpolated points to the data points and gives me nothing but 1.000 in interpolated_data?
I would expect that in such a simple interpolation case,
An unwarranted expectation. The RBF interpolation, as its name says, uses radial basis functions. By default the basis function sqrt((r/epsilon)**2 + 1) where r is the distance from a data point and epsilon is a positive parameter. There is no way for a weighted sum of such functions to be identically constant. RBF interpolation isn't like a linear or bilinear interpolation. It's a rough interpolation suitable for rough data.
By setting an absurdly large epsilon, you can get closer to 1; just because it makes the basis functions nearly identical on the grid.
rbfi = Rbf(x_coarse.ravel(), y_coarse.ravel(), data_coarse.ravel(), epsilon=10)
# ...
print(interpolated_data.min(), interpolated_data.max())
# outputs 0.9999983458255883 1.0000002402521204
However this is not a good idea, because when the data is not constant, there will be too much long-range influence in the interpolant.
gives me nothing but 1.000 in interpolated_data?
That would be linear interpolation. LinearNDInterpolator has similar syntax to Rbf, in that it returns a callable.
linear = LinearNDInterpolator(np.stack((x_coarse.ravel(), y_coarse.ravel()), axis=-1),
data_coarse.ravel())
interpolated_data = linear(x_fine.ravel(), y_fine.ravel()).reshape([x_fine.shape[0], y_fine.shape[0]])
print(interpolated_data.min(), interpolated_data.max())
# outputs 1.0 1.0
There is also a griddata which has more interpolation modes.

Fit spline through scatter

I a have two sets of data of which I want to find a correlation. Although there is quite some scattering of data there's obvious a relation. I currently use numpy polyfit (8th order) but there is some "wiggling" of the line (especially at the beginning and the end) which is not appropriate. Secondly I don't think the fit is very well at the beginning of the line (the curve should be slightly steeper.
How can I get a best fit "spline" through these data points?
My current code:
# fit regression line
regressionLineOrder = 8
regressionLine = np.polyfit(data['x'], data['y'], regressionLineOrder)
p = np.poly1d(regressionLine)
Take a look at #MatthewDrury's answer for Why use regularisation in polynomial regression instead of lowering the degree?. It's simply fantastic and spot on. The most interesting bit comes in at the end when he starts talking about using a natural cubic spline to fit a regression in place of a regularized polynomial of degree 10. You could use the implementation of scipy.interpolate.CubicSpline to accomplish something very similar. There are a ton of classes for other spline methods contained in scipy.interpolate for similar methods.
Here is a simple example:
from scipy.interpolate import CubicSpline
cs = CubicSpline(data['x'], data['y'])
x_range = np.arange(x_min, x_max, some_step)
plt.plot(x_range, cs(x_range), label='Cubic Spline')
There are some possible issues with your data set... from your plot of n (x,y) points, they are linked with straight lines; if you display points instead, should see the points density along your domain, and it's not evenly distributed as the lines are not. Let's say your domain is [xmin,xmax], an 8th order polynom is good for interpolation, but it wiggles because of the high order and also because the point density is oddly distributed. Polynoms are not good for extrapolation, once there are no control points outside your domain. You could fix that with a spline, a cubic natural spline will control the derivative at xmin and xmax, but to do that, you should sort your dataset (x axis) and take a subsample of the n points with rolling average as control points to the spline algoritm. If your problem has an analytical solution (a gaussian variogram, for instance, looks like your points distribution), just try optimizing the parameters (range and sill, for the gaussian variogram, for instance) to minimize error inside the domain and follow the assintotes outside.

How good is this interpolation method?

I came up with a custom interpolation method for my problem and I'd like to ask if there are any risks using it. I am not a math or programming expert, that's why I'd like a feedback :)
Story:
I was searching for a good curve-fit method for my data when I came up with an idea to interpolate the data.
I am mixing paints together and making reflectance measurements with a spectrophotometer when the film is dry. I would like to calculate the required proportions of white and colored paints to reach a certain lightness, regardless of any hue shift (e.g. black+white paints gives a bluish grey) or chroma loss (e.g. orange+white gives "pastel" yellowish orange, etc.)
I check if Beer-Lambert law applies, but it does not. Pigment-mixing behaves in a more complicated fashion than dye-dilutions. So I wanted to fit a curve to my data points (the process is explained here: Interpolation for color-mixing
First step was doing a calibration curve, I tested the following ratios of colored VS white paints mixed together:
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
This is the plot of my carefully prepared samples, measured with a spectrophotometer, the blue curve represents the full color (ratio = 1), the red curve represents the white paint (ratio = 0), the black curves the mixed samples:
Second step I wanted to guess from this data a function that would compute a spectral curve for any ration between 0 and 1. I did test several curve fitting (fitting an exponential function) and interpolation (quadratic, cubic) methods but the results were of a poor quality.
For example, this is my reflectance data at 380nm for all the color samples:
This is the result of scipy.optimize.curve_fit using the function:
def func(x, a, b, c):
return a * np.exp(-b * x) + c
popt, pcov = curve_fit(func, x, y)
Then I came-up with this idea: the logarithm of the spectral data gives a closer match to a straight line, and the logarithm of the logarithm of the data is almost a straight line, as demonstrated by this code and graph:
import numpy as np
import matplotlib.pyplot as plt
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
plt.plot(ratios, linear_approx)
plt.show()
What I did then is to interpolate the linear approximation an then convert the data back to linear, then I got a very nice interpolation of my data, much better than what I got before:
import numpy as np
import matplotlib.pyplot as plt
import scipy.interpolate
reflectance_at_380nm = 5.319, 13.3875, 24.866, 35.958, 47.1105, 56.2255, 65.232, 83.9295
ratios = 1, 1/2., 1/4., 1/8., 1/16., 1/32., 1/64., 0
linear_approx = np.log(np.log(reflectance_at_380nm))
xnew = np.arange(100)/100.
cs = scipy.interpolate.spline(ratios, linear_approx, xnew, order=1)
cs = np.exp(np.exp(cs))
plt.plot(xnew,cs)
plt.plot(x,y,'ro')
plt.show()
So my question is for experts: how good is this interpolation method and what are the risks of using it? Can it lead to wrong results?
Also: can this method be improved or does it already exists and if so how is it called?
Thank you for reading
This looks similar to the Kernel Method that is used for fitting regression lines or finding decision boundaries for classification problems.
The idea behind the Kernel trick being, the data is transformed into a dimensional space (often higher dimensional), where the data is linearly separable (for classification), or has a linear curve-fit (for regression). After the curve-fitting is done, inverse transformations can be applied. In your case successive exponentiations (exp(exp(X))), seems to be the inverse transformation and successive logarithms (log(log(x)))seems to be the transformation.
I am not sure if there is a kernel that does exactly this, but the intuition is similar. Here is a medium article explaining this for classification using SVM:
https://medium.com/#zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d
Since it is a method that is quite popularly used in Machine Learning, I doubt it will lead to wrong results if the fit is done properly (not under-fit or over-fit) - and this needs to be judged by statistical testing.

Intuitive interpolation between unevenly spaced points

I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).

Fourier smoothing of data set

I am following this link to do a smoothing of my data set.
The technique is based on the principle of removing the higher order terms of the Fourier Transform of the signal, and so obtaining a smoothed function.
This is part of my code:
N = len(y)
y = y.astype(float) # fix issue, see below
yfft = fft(y, N)
yfft[31:] = 0.0 # set higher harmonics to zero
y_smooth = fft(yfft, N)
ax.errorbar(phase, y, yerr = err, fmt='b.', capsize=0, elinewidth=1.0)
ax.plot(phase, y_smooth/30, color='black') #arbitrary normalization, see below
However some things do not work properly.
Indeed, you can check the resulting plot :
The blue points are my data, while the black line should be the smoothed curve.
First of all I had to convert my array of data y by following this discussion.
Second, I just normalized arbitrarily to compare the curve with data, since I don't know why the original curve had values much higher than the data points.
Most importantly, the curve is like "specular" to the data point, and I don't know why this happens.
It would be great to have some advices especially to the third point, and more generally how to optimize the smoothing with this technique for my particular data set shape.
Your problem is probably due to the shifting that the standard FFT does. You can read about it here.
Your data is real, so you can take advantage of symmetries in the FT and use the special function np.fft.rfft
import numpy as np
x = np.arange(40)
y = np.log(x + 1) * np.exp(-x/8.) * x**2 + np.random.random(40) * 15
rft = np.fft.rfft(y)
rft[5:] = 0 # Note, rft.shape = 21
y_smooth = np.fft.irfft(rft)
plt.plot(x, y, label='Original')
plt.plot(x, y_smooth, label='Smoothed')
plt.legend(loc=0)
plt.show()
If you plot the absolute value of rft, you will see that there is almost no information in frequencies beyond 5, so that is why I choose that threshold (and a bit of playing around, too).
Here the results:
From what I can gather you want to build a low pass filter by doing the following:
Move to the frequency domain. (Fourier transform)
Remove undesired frequencies.
Move back to the time domain. (Inverse fourier transform)
Looking at your code, instead of doing 3) you're just doing another fourier transform. Instead, try doing an actual inverse fourier transform to move back to the time domain:
y_smooth = ifft(yfft, N)
Have a look at scipy signal to see a bunch of already available filters.
(Edit: I'd be curious to see the results, do share!)
I would be very cautious in using this technique. By zeroing out frequency components of the FFT you are effectively constructing a brick wall filter in the frequency domain. This will result in convolution with a sinc in the time domain and likely distort the information you want to process. Look up "Gibbs phenomenon" for more information.
You're probably better off designing a low pass filter or using a simple N-point moving average (which is itself a LPF) to accomplish the smoothing.

Categories