Discontinuity in numerical derivative of an interpolating cubic spline - python

I am trying to to calculate and plot the numerical derivative (dy/dx) from two lists x and y. I am using the scipy.interpolate.UnivariateSpline and scipy.interpolate.UnivariateSpline.derivative to compute the slope. The plot of y vs x seems to be C1 continuous and I was expecting the slope dy/dx to be smooth as well when plotted against x. But then what is causing the little bump in the plot here? Also any suggestion on how I can massage the code to make it C1 continuous?
import numpy as np
from matplotlib import pyplot as plt
from scipy.interpolate import UnivariateSpline
x=[20.14141131550861, 20.29161104293003, 20.458574567775457, 20.653802880772922, 20.910446090013004, 21.404599384233677, 21.427939384233678, 21.451279384233676, 21.474619384233677, 21.497959384233678, 21.52129938423368, 21.52130038423368, 21.54463938423368, 21.56797938423368, 21.59131938423368, 21.61465938423368, 21.63799938423368, 22.132152678454354, 22.388795887694435, 22.5840242006919]
y=[-1.6629252348586834, -1.7625046339166028, -1.875358801338162, -2.01040013818419, -2.193327440415778, -2.5538174545988306, -2.571799827167608, -2.5896274995868005, -2.607298426787476, -2.624811539182082, -2.642165776735291, -2.642165776735291, -2.659360089028171, -2.6763934353217587, -2.693264784620056, -2.7099731157324367, -2.7265165368570314, -3.0965791078676754, -3.290845721407758, -3.440799238587583]
spl1 = UnivariateSpline(x,y,s=0)
dydx = spl1.derivative(n=1)
T = dydx(x)
plt.plot(x,y,'-x')
plt.plot(x,T,'-')
plt.show()

The given data points look like they define a nice C1-smooth curve, but they do not. Plotting the slopes (difference of y over difference of x) shows this:
plt.plot(np.diff(y)/np.diff(x))
There are some duplicate values of y in the array, which look like they don't belong, also some near-duplicate (but not duplicate) values of x.
The easiest way to fix the spline is to allow a tiny bit of smoothing:
spl1 = UnivariateSpline(x, y, s=1e-5)
makes the derivative what you expected:
Removing the "bad apple" also helps, though not as much.
spl1 = UnivariateSpline(x[:10] + x[11:], y[:10] + y[11:], s=0)

Related

savgol_filter from scipy.signal library, get the resulting polinormial function?

savgol_filter gives me the series.
I want to get the underlying polynormial function.
The function of the red line in a below picture.
So that I can extrapolate a point beyond the given x range.
Or I can find the slope of the function at the two extreme data points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
** edit**
Since the filter uses least squares regression to fit the data in a small window to a polynomial of given degree, you can probably only extrapolate from the ends. I think the fitted curve is a piecewise function of these 'fits' and each function would not be a good representation of the entire data as a whole. What you could do is take the end windows of your data, and fit them to the same polynomial degree as the savitzy golay filter (using scipy's polyfit). It likely will not be accurate very far from the window though.
You can also use scipy.signal.savgol_coeffs() to get the coefficients of the filter. I think you dot product the coefficient array with your array of data to get the value at each point. You can include a derivative argument to get the slope at the ends of your data.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_coeffs.html#scipy.signal.savgol_coeffs

Curve fitting with cubic spline

I am trying to interpolate a cumulated distribution of e.g. i) number of people to ii) number of owned cars, showing that e.g. the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are e.g. 100mn people and 200mn cars.
Now coming to my code:
#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate
curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')
Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)
#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')
number_of_people_mn= 100000000
oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)
#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn
#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])
plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()
However when looking at the actual plot, I get the following result which is false: Cubic interpolation
In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation
Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?
Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.
Let me explain. Here is your code that I simplified for clarity
from scipy.interpolate import interp1d
from matplotlib import pyplot as plt
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')
number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()
note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result
orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing
Clearly it is not doing what you would want
The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.
If I remove two right-most points from the interpolator:
interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')
it looks a bit more reasonable:
But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large
The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same
Your best bet here, I think, is to use something like curve_fit
a related discussion can be found here
specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with
from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)
and get a decent-looking fit:

Fourier transform or fit of sines and cosines to a 2D surface from discrete point cloud data

I have x,y,z data that define a surface (x and y position, z height).
The data is imperfect, in that it contains some noise, i.e. not every point lies precisely on the plane I wish to model, just very close to it.
I only have data within a triangular region, not the full x,y, plane.
Here is an example with z represented by colour:
In this example the data has been sampled in the centres of triangles on a mesh like this (each blue dot is a sample):
If it is necessary, the samples could be evenly spaced on an x,y grid, though a solution where this is not required is preferable.
I want to represent this data as a sum of sines and cosines in order to manipulate it mathematically. Ideally using as few terms as are needed to keep the error of the fit acceptably low.
If this were a square region I would take the 2D Fourier transform and discard higher frequency terms.
However I think this situation has two key differences that make this approach not viable:
Ideally I want to use samples at the points indicated by the blue dots in my grid above. I could instead use a regular x,y grid if there is no alternative, but this is not an ideal solution
I do not have data for the whole x,y, plane. The white areas in the first image above do not contain data that should be considered in the fit.
So in summary my question is thus:
Is there a way to extract coefficients for a best-fit of this data using a linear combination of sines and cosines?
Ideally using python.
My apologies if this is more of a mathematics question and stack overflow is not the correct place to post this!
EDIT: Here is an example dataset in python style [x,y,z] form - sorry it's huge but apparently I can't use pastebin?:
[[1.7500000000000001e-08, 1.0103629710818452e-08, 14939.866751020554],
[1.7500000000000001e-08, 2.0207259421636904e-08, 3563.2218207404617],
[8.7500000000000006e-09, 5.0518148554092277e-09, 24529.964593228644],
[2.625e-08, 5.0518148554092261e-09, 24529.961688158553],
[1.7500000000000001e-08, 5.0518148554092261e-09, 21956.74682671843],
[2.1874999999999999e-08, 1.2629537138523066e-08, 10818.190869824304],
[1.3125000000000003e-08, 1.2629537138523066e-08, 10818.186813746233],
[1.7500000000000001e-08, 2.5259074277046132e-08, 3008.9480862705223],
[1.3125e-08, 1.7681351993932294e-08, 5630.9978116591838],
[2.1874999999999999e-08, 1.768135199393229e-08, 5630.9969846863969],
[8.7500000000000006e-09, 1.0103629710818454e-08, 13498.380006002562],
[4.3750000000000003e-09, 2.5259074277046151e-09, 40376.866196753763],
[1.3125e-08, 2.5259074277046143e-09, 26503.432370909999],
[2.625e-08, 1.0103629710818452e-08, 13498.379635232159],
[2.1874999999999999e-08, 2.5259074277046139e-09, 26503.430698738041],
[3.0625000000000005e-08, 2.525907427704613e-09, 40376.867011915041],
[8.7500000000000006e-09, 1.2629537138523066e-08, 11900.832515759088],
[6.5625e-09, 8.8406759969661469e-09, 17422.002946526718],
[1.09375e-08, 8.8406759969661469e-09, 17275.788904632376],
[4.3750000000000003e-09, 5.0518148554092285e-09, 30222.756636780832],
[2.1875000000000001e-09, 1.2629537138523088e-09, 64247.241146490327],
[6.5625e-09, 1.2629537138523084e-09, 35176.652106572205],
[1.3125e-08, 5.0518148554092277e-09, 22623.574247287044],
[1.09375e-08, 1.2629537138523082e-09, 27617.700396641056],
[1.5312500000000002e-08, 1.2629537138523078e-09, 25316.907231576402],
[2.625e-08, 1.2629537138523066e-08, 11900.834523905782],
[2.4062500000000001e-08, 8.8406759969661469e-09, 17275.796410700641],
[2.8437500000000002e-08, 8.8406759969661452e-09, 17422.004617294893],
[2.1874999999999999e-08, 5.0518148554092269e-09, 22623.570035270699],
[1.96875e-08, 1.2629537138523076e-09, 25316.9042194055],
[2.4062500000000001e-08, 1.2629537138523071e-09, 27617.700160860692],
[3.0625000000000005e-08, 5.0518148554092261e-09, 30222.765972585737],
[2.8437500000000002e-08, 1.2629537138523069e-09, 35176.65151453446],
[3.2812500000000003e-08, 1.2629537138523065e-09, 64247.246775422129],
[2.1875000000000001e-09, 2.5259074277046151e-09, 46711.23463223876],
[1.0937500000000001e-09, 6.3147685692615553e-10, 101789.89315354674],
[3.28125e-09, 6.3147685692615543e-10, 52869.788364220134],
[3.2812500000000003e-08, 2.525907427704613e-09, 46711.229428833962],
[3.1718750000000001e-08, 6.3147685692615347e-10, 52869.79233902022],
[3.3906250000000006e-08, 6.3147685692615326e-10, 101789.92509671643],
[1.0937500000000001e-09, 1.2629537138523088e-09, 82527.848790063814],
[5.4687500000000004e-10, 3.1573842846307901e-10, 137060.87432327325],
[1.640625e-09, 3.157384284630789e-10, 71884.380087542726],
[3.3906250000000006e-08, 1.2629537138523065e-09, 82527.861035177877],
[3.3359375000000005e-08, 3.1573842846307673e-10, 71884.398689011548],
[3.4453125000000001e-08, 3.1573842846307663e-10, 137060.96214950032],
[4.3750000000000003e-09, 6.3147685692615347e-09, 18611.868317256733],
[3.28125e-09, 4.4203379984830751e-09, 27005.961455364879],
[5.4687499999999998e-09, 4.4203379984830751e-09, 28655.126635802204],
[3.0625000000000005e-08, 6.314768569261533e-09, 18611.869287539808],
[2.9531250000000002e-08, 4.4203379984830734e-09, 28655.119850641502],
[3.1718750000000001e-08, 4.4203379984830726e-09, 27005.959731047784]]
Nothing stops you from doing normal linear least squares with whatever basis you like. (You'll have to work out the periodicity you want, as mikuszefski said.) The lack of samples outside the triangle will naturally blind the method to the function's behavior out there. You probably want to weight the samples according to the area of their mesh cell, to avoid overfitting the corners.
Here some code that might help to fit periodic spikes. That also shows the use of the base x, x/2+ sqrt(3)/2 * y. The flat part can then be handled by low order Fourier. I hope that gives an idea. (BTW I agree with Davis Herring that area weighting is a good idea). For the fit, I guess, good initial guesses are crucial.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
def gauss(x,s):
return np.exp(-x**2/(2.*s**2))
fig = plt.figure()
ax = fig.gca(projection='3d')
X = np.arange(-5, 5, 0.15)
Y = np.arange(-5, 5, 0.15)
X, Y = np.meshgrid(X, Y)
kX=np.sin(X)
kY=np.sin(0.5*X+0.5*np.sqrt(3.)*Y)
R = np.sqrt(kX**2 + kY**2)
Z = gauss(R,.4)
#~ surf = ax.plot_wireframe(X, Y, Z, linewidth=1)
surf= ax.plot_surface(X, Y, Z, rstride=1, cstride=1,linewidth=0, antialiased=False)
plt.show()
Output:

Discrete fourier transformation from a list of x-y points

What I'm trying to do is, from a list of x-y points that has a periodic pattern, calculate the period. With my limited mathematics knowledge I know that Fourier Transformation can do this sort of thing.
I'm writing Python code.
I found a related answer here, but it uses an evenly-distributed x axis, i.e. dt is fixed, which isn't the case for me. Since I don't really understand the math behind it, I'm not sure if it would work properly in my code.
My question is, does it work? Or, is there some method in numpy that already does my work? Or, how can I do it?
EDIT: All values are Pythonic float (i.e. double-precision)
For samples that are not evenly spaced, you can use scipy.signal.lombscargle to compute the Lomb-Scargle periodogram. Here's an example, with a signal whose
dominant frequency is 2.5 rad/s.
from __future__ import division
import numpy as np
from scipy.signal import lombscargle
import matplotlib.pyplot as plt
np.random.seed(12345)
n = 100
x = np.sort(10*np.random.rand(n))
# Dominant periodic signal
y = np.sin(2.5*x)
# Add some smaller periodic components
y += 0.15*np.cos(0.75*x) + 0.2*np.sin(4*x+.1)
# Add some noise
y += 0.2*np.random.randn(x.size)
plt.figure(1)
plt.plot(x, y, 'b')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
dxmin = np.diff(x).min()
duration = x.ptp()
freqs = np.linspace(1/duration, n/duration, 5*n)
periodogram = lombscargle(x, y, freqs)
kmax = periodogram.argmax()
print("%8.3f" % (freqs[kmax],))
plt.figure(2)
plt.plot(freqs, np.sqrt(4*periodogram/(5*n)))
plt.xlabel('Frequency (rad/s)')
plt.grid()
plt.axvline(freqs[kmax], color='r', alpha=0.25)
plt.show()
The script prints 2.497 and generates the following plots:
As starting point:
(I assume all coordinates are positive and integer, otherwise map them to reasonable range like 0..4095)
find max coordinates xMax, yMax in list
make 2D array with dimensions yMax, xMax
fill it with zeros
walk through you list, set array elements, corresponding to coordinates, to 1
make 2D Fourier transform
look for peculiarities (peaks) in FT result
This page from Scipy shows you basic knowledge of how Discrete Fourier Transform works:
http://docs.scipy.org/doc/numpy-1.10.0/reference/routines.fft.html
They also provide API for using DFT. For your case, you should look at how to use fft2.

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

Categories