Generate a B-Spline basis in SciPy, like bs() in R

Generate a B-Spline basis in SciPy, like bs() in R - python

With N 1-dimensional data X, I would like to evaluate each point at K cubic B-splines. In R, there is a simple function with an intuitive API, called bs. There is actually a python package patsy which replicates this, but I can't use that package -- only scipy and such.
Having looked through the scipy.interpolate documentation on spline-related functions, the closest I can find is BSpline, or BSpline.basis_element, but how to get just the K basis functions is totally mysterious to me. I tried the following:
import numpy as np
import scipy.interpolate as intrp
import matplotlib.pyplot as plt
import patsy # for comparison
# in Patsy/R: nice and sensible
x = np.linspace(0., 1., 100)
y = patsy.bs(x, knots=np.linspace(0,1,4), degree=3)
plt.subplot(1,2,1)
plt.plot(x,y)
plt.title('B-spline basis')
# in scipy: ?????
y_py = np.zeros((x.shape[0], 6))
for i in range(6):
y_py[:,i] = intrp.BSpline(np.linspace(0,1,10),(np.arange(6)==i).astype(float), 3, extrapolate=False)(x)
plt.subplot(1,2,2)
plt.plot(x,y_py)
plt.title('Something else')
It doesn't work, and makes me realise I don't actually know what this function is doing. First of all, it will not accept fewer than 8 interior knots, which I don't understand why. Secondly, it only thinks that the splines are defined within (1/3, 2/3)ish range, which maybe means that it is ignoring the first 3 and last 3 knot values for some reason? Do I need to pad the knots?
Any help would be appreciated!
EDIT: I have solved this discrepancy, indeed it seems like BSpline ignore the first 3 and last 3 values of knots. I'm still interested in knowing why there is this discrepancy, so that I feel less bad for the odd hour spent debugging a strange interface.
For posterity, here is the code that does produce the basis functions
import numpy as np
import scipy.interpolate as intrp
import matplotlib.pyplot as plt
import patsy # for comparison
these_knots = np.linspace(0,1,5)
# in Patsy/R: nice and sensible
x = np.linspace(0., 1., 100)
y = patsy.bs(x, knots=these_knots, degree=3)
plt.subplot(1,2,1)
plt.plot(x,y)
plt.title('B-spline basis')
# in scipy: ?????
numpyknots = np.concatenate(([0,0,0],these_knots,[1,1,1])) # because??
y_py = np.zeros((x.shape[0], len(these_knots)+2))
for i in range(len(these_knots)+2):
y_py[:,i] = intrp.BSpline(numpyknots, (np.arange(len(these_knots)+2)==i).astype(float), 3, extrapolate=False)(x)
plt.subplot(1,2,2)
plt.plot(x,y_py)
plt.title('In SciPy')

Looks like you already found the answer, but to clarify why these you need to define the multiple knots at the edges, you can read the scipy docs. They are defined using the Cox-de Boor recursive formula. This formula starts with defining neighbouring support domains between the given knot points with a constant value of 1 (zeroth order). These are convoluted to acquire the higher order basis functions. Hence two domains make one first order basis function, three domains make one second order basis function and four domains (= 5 knot points) make one third order basis function that is supported within the range of these 5 knot points. If you want n basis functions of degree k = 3, you will need to have (n+k+1) knot points.
The minimum of 8 knots is such that n >= k + 1, which gives 2 * (k+1). The base interval t[k] ... t[n] in scipy is the only range where you can define full degree basis functions. To make sure that this base interval reaches the outer knot points, the two end knots are usually given a multiplicity of (k+1). Probably scipy only showed this base interval in your 'Something else' result.
Note that you can also get the basis functions using
y_py[:,i] = intrp.BSpline.basis_element(numpyknots[i:i+5], extrapolate=False)(x)
this also removes the difference at x = 1.

Related

Curve fitting with cubic spline

I am trying to interpolate a cumulated distribution of e.g. i) number of people to ii) number of owned cars, showing that e.g. the top 20% of people own much more than 20% of all cars - off course 100% of people own 100% of cars. Also I know that there are e.g. 100mn people and 200mn cars.
Now coming to my code:
#import libraries (more than required here)
import pandas as pd
from scipy import interpolate
from scipy.interpolate import interp1d
from sympy import symbols, solve, Eq
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px
from scipy import interpolate
curve=pd.read_excel('inputs.xlsx',sheet_name='inputdata')
Input data: Curveplot (cumulated people (x) on the left // cumulated cars (y) on the right)
#Input data in list form (I am not sure how to interpolate from a list for the moment)
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
x, y = points[:,0], points[:,1]
interpolation = interp1d(x, y, kind = 'cubic')
number_of_people_mn= 100000000
oneperson = 1 / number_of_people_mn
dataset = pd.DataFrame(range(number_of_people_mn + 1))
dataset.columns = ["nr_of_one_person"]
dataset.drop(dataset.index[:1], inplace=True)
#calculating the position of every single person on the cumulated x-axis (between 0 and 1)
dataset["cumulatedpeople"] = dataset["nr_of_one_person"] / number_of_people_mn
#finding the "cumulatedcars" to the "cumulatedpeople" via interpolation (between 0 and 1)
dataset["cumulatedcars"] = interpolation(dataset["cumulatedpeople"])
plt.plot(dataset["cumulatedpeople"], dataset["cumulatedcars"])
plt.legend(['Cubic interpolation'], loc = 'best')
plt.xlabel('Cumulated people')
plt.ylabel('Cumulated cars')
plt.title("People-to-car cumulated curve")
plt.show()
However when looking at the actual plot, I get the following result which is false: Cubic interpolation
In fact, the curve should look almost like the one from a linear interpolation with the exact same input data - however this is not accurate enough for my purpose: Linear interpolation
Is there any relevant step I am missing out or what would be the best way to get an accurate interpolation from the inputs that almost looks like the one from a linear interpolation?

Short answer: your code is doing the right thing, but the data is unsuitable for cubic interpolation.
Let me explain. Here is your code that I simplified for clarity
from scipy.interpolate import interp1d
from matplotlib import pyplot as plt
cumulatedpeople = [0, 0.453086, 0.772334, 0.950475, 0.978981, 0.999876, 0.999990, 1]
cumulatedcars= [0, 0.016356, 0.126713, 0.410482, 0.554976, 0.950073, 0.984913, 1]
interpolation = interp1d(cumulatedpeople, cumulatedcars, kind = 'cubic')
number_of_people_mn= 100#000000
cumppl = np.arange(number_of_people_mn + 1)/number_of_people_mn
cumcars = interpolation(cumppl)
plt.plot(cumppl, cumcars)
plt.plot(cumulatedpeople, cumulatedcars,'o')
plt.show()
note the last couple of lines -- I am plotting, on the same graph, both the interpolated results and the input date. Here is the result
orange dots are the original data, blue line is cubic interpolation. The interpolator passes through all the points so technically is doing the right thing
Clearly it is not doing what you would want
The reason for such strange behavior is mostly at the right end where you have a few x-points that are very close together -- the interpolator produces massive wiggles trying to fit very closely spaced points.
If I remove two right-most points from the interpolator:
interpolation = interp1d(cumulatedpeople[:-2], cumulatedcars[:-2], kind = 'cubic')
it looks a bit more reasonable:
But still one would argue linear interpolation is better. The wiggles on the left end now because the gaps between initial x-poonts are too large
The moral here is that cubic interpolation should really be used only if gaps between x points are roughly the same
Your best bet here, I think, is to use something like curve_fit
a related discussion can be found here
specifically monotone interpolation as explained here yields good results on your data. Copying the relevant bits here, you would replace the interpolator with
from scipy.interpolate import pchip
interpolation = pchip(cumulatedpeople, cumulatedcars)
and get a decent-looking fit:

How to draw a general equation with matplotlib?

The following problem has been researched - primarily with matplotlib in python.
"Basic" functions are possible, such as y = x^2, but if I want to plot an equation (which isn't necessarily a function due to multiple x-y associations), e.g.:
x^2 + y^2 = 1 (just a basic circle with a radius of 1 around the point (0/0) in a two-dimensional coordinate system).
Is there any way to plot such equation with matplotlib or any library alike?
The idea of re-writing the equation to a drawable function has come to my mind, but due to the absolute value assignment it just looks harder than the original equation, e.g. the equation above into a "function": |y| = sqrt(1-x²) with -y and +y.
//EDIT: On request from #mkrieger1 an edit of this question.
The aim of my software is to use an input (given by another function; string representing any equation, e.g. "y^3-sqrt(sin(x^2)-2)*2 = 3x") and turn it into a plot. I personally failed with the approach to solve the functions for y (as mentioned previously), especially with more complex functions. Splitting these equations into "smaller pieces" is, given the broad variety of mathematical inputs, pretty hard as well, thus I thought that going with a Contour-solving approach would be the best part. (As #mkrieger1 suggested).
Once again, this approach is critical due to a needed "editing" of the equation before implementing it in a plt.contour(X, Y, func, [0]), as well as a UserWarning later on.

You can also use sympy to convert an expression in a string to an equation and then plot it. I left out the -2 of the example, as this would lead to a quite empty plot. Sympy's parser supports special functions to allow multiplication be left out (as in 3x) and to convert Python's xor function (^) to a power.
from sympy import plot_implicit, Eq
from sympy.parsing.sympy_parser import parse_expr
from sympy.parsing.sympy_parser import standard_transformations, convert_xor, implicit_multiplication
string = "y^3-sqrt(sin(x^2))*2 = 3x"
transformations = (standard_transformations + (implicit_multiplication,) + (convert_xor,))
lhs = parse_expr(string.split('=')[0], transformations=transformations)
rhs = parse_expr(string.split('=')[1], transformations=transformations)
plot_implicit(Eq(lhs, rhs))
Another example:
from sympy import plot_implicit, Eq, cos
from sympy.abc import x, y
plot_implicit(Eq(x/y, cos(y)), (x, -10, 10), (y, -10, 10))
Note that without explicitly setting the range for the variables, plot_implicit supposes default ranges between -5 and 5.

If you use matplotlib at all, you will notice that plot accepts a pair of arrays of equal length, representing sequences of x-y pairs. It has no knowledge of functions, equations, or any of the other concepts you mention.
The assertion that plotting a simple function is supported is therefore largely meaningless, even if true. That being said, a standard approach to converting something that is a non-function in Cartesian space, like a circle, is to parametrize it. One possible parameterization for many popular non-functions is to use polar coordinates.
For example:
t = np.linspace(0, 2 * np.pi, 100) # the parameter
x = np.cos(t)
y = np.sin(t)

How can I change the number of basis functions when performing B-Spline fitting in scipy (python)?

I have a discrete set of points (x_n, y_n) that I would like to approximate/represent as a linear combination of B-spline basis functions. I need to be able to manually change the number of B-spline basis functions used by the method, and I am trying to implement this in python using scipy. To be specific, below is a bit of code that I am using:
import scipy
spl = scipy.interpolate.splrep(x, y)
However, unless I have misunderstood or missed something in the documentation, it seems I cannot change the number of B-spline basis functions that scipy uses. That seems to be set by the size of x and y. So, my specific questions are:
Can I change the number of B-spline basis functions used by scipy in the "splrep" function that I used above?
Once I have performed the transformation shown in the code above, how can I access the coefficients of the linear combination? Am I correct in thinking that these coefficients are stored in the vector spl[1]?
Is there a better method/toolbox that I should be using?
Thanks in advance for any help/guidance you can provide.

Yes, spl[1] are the coefficients, and spl[0] contains the knot vector.
However, if you want to have a better control, you can manipulate the BSpline objects and construct them with make_interp_spline or make_lsq_spline, which accepts the knot vector and that determines the b-spline basis functions to use.

You can change the number of B-spline basis functions, by supplying a knot vector with the t parameter. Since there is a connection number of knots = number of coefficients + degree + 1, the number of knots will also define the number of coefficients (== the number of basis functions).
The usage of the t parameter is not so intuitive since the given knots should be only the inner knots. So, for example, if you want 7 coefficients for a cubic spline you need to give 3 inner knots. Inside the function it pads the first and last (degree+1) knots with the xb and xe (clamped end conditions see for example here).
Furthermore, as the documentation says, the knots should satisfy the Schoenberg-Whitney conditions.
Here is an example code that does this:
# Input:
x = np.linspace(0,2*np.pi, 9)
y = np.sin(x)
# Your code:
spl = scipy.interpolate.splrep(x, y)
t,c,k = spl # knots, coefficients, degree (==3 for cubic)
# Computing the inner knots and using them:
t3 = np.linspace(x[0],x[-1],5) # five equally spaced knots in the interval
t3 = t3[1:-1] # take only the three inner values
spl3 = scipy.interpolate.splrep(x, y, t=t3)
Regarding your second question, you're right that the coefficients are indeed stored in spl[1]. However, note that (as the documentation says) the last (degree+1) values are zero-padded and should be ignored.
In order to evaluate the resulting B-spline you can use the function splev or the class BSpline.
Below is some example code that evaluates and draws the above splines (resulting in the following figure):
xx = np.linspace(x[0], x[-1], 101) # sample points
yy = scipy.interpolate.splev(xx, spl) # evaluate original spline
yy3 = scipy.interpolate.splev(xx, spl3) # evaluate new spline
plot(x,y,'b.') # plot original interpolation points
plot(xx,yy,'r-', label='spl')
plot(xx,yy3,'g-', label='spl3')

Python 4D linear interpolation on a rectangular grid

I need to interpolate temperature data linearly in 4 dimensions (latitude, longitude, altitude and time).
The number of points is fairly high (360x720x50x8) and I need a fast method of computing the temperature at any point in space and time within the data bounds.
I have tried using scipy.interpolate.LinearNDInterpolator but using Qhull for triangulation is inefficient on a rectangular grid and takes hours to complete.
By reading this SciPy ticket, the solution seemed to be implementing a new nd interpolator using the standard interp1d to calculate a higher number of data points, and then use a "nearest neighbor" approach with the new dataset.
This, however, takes a long time again (minutes).
Is there a quick way of interpolating data on a rectangular grid in 4 dimensions without it taking minutes to accomplish?
I thought of using interp1d 4 times without calculating a higher density of points, but leaving it for the user to call with the coordinates, but I can't get my head around how to do this.
Otherwise would writing my own 4D interpolator specific to my needs be an option here?
Here's the code I've been using to test this:
Using scipy.interpolate.LinearNDInterpolator:
import numpy as np
from scipy.interpolate import LinearNDInterpolator
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
coords = np.zeros((len(lats),len(lons),len(alts),len(time),4))
coords[...,0] = lats.reshape((len(lats),1,1,1))
coords[...,1] = lons.reshape((1,len(lons),1,1))
coords[...,2] = alts.reshape((1,1,len(alts),1))
coords[...,3] = time.reshape((1,1,1,len(time)))
coords = coords.reshape((data.size,4))
interpolatedData = LinearNDInterpolator(coords,data)
Using scipy.interpolate.interp1d:
import numpy as np
from scipy.interpolate import LinearNDInterpolator
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
interpolatedData = np.array([None, None, None, None])
interpolatedData[0] = interp1d(lats,data,axis=0)
interpolatedData[1] = interp1d(lons,data,axis=1)
interpolatedData[2] = interp1d(alts,data,axis=2)
interpolatedData[3] = interp1d(time,data,axis=3)
Thank you very much for your help!

In the same ticket you have linked, there is an example implementation of what they call tensor product interpolation, showing the proper way to nest recursive calls to interp1d. This is equivalent to quadrilinear interpolation if you choose the default kind='linear' parameter for your interp1d's.
While this may be good enough, this is not linear interpolation, and there will be higher order terms in the interpolation function, as this image from the wikipedia entry on bilinear interpolation shows:
This may very well be good enough for what you are after, but there are applications where a triangulated, really piecewise linear, interpoaltion is preferred. If you really need this, there is an easy way of working around the slowness of qhull.
Once LinearNDInterpolator has been setup, there are two steps to coming up with an interpolated value for a given point:
figure out inside which triangle (4D hypertetrahedron in your case) the point is, and
interpolate using the barycentric coordinates of the point relative to the vertices as weights.
You probably do not want to mess with barycentric coordinates, so better leave that to LinearNDInterpolator. But you do know some things about the triangulation. Mostly that, because you have a regular grid, within each hypercube the triangulation is going to be the same. So to interpolate a single value, you could first determine in which subcube your point is, build a LinearNDInterpolator with the 16 vertices of that cube, and use it to interpolate your value:
from itertools import product
def interpolator(coords, data, point) :
dims = len(point)
indices = []
sub_coords = []
for j in xrange(dims) :
idx = np.digitize([point[j]], coords[j])[0]
indices += [[idx - 1, idx]]
sub_coords += [coords[j][indices[-1]]]
indices = np.array([j for j in product(*indices)])
sub_coords = np.array([j for j in product(*sub_coords)])
sub_data = data[list(np.swapaxes(indices, 0, 1))]
li = LinearNDInterpolator(sub_coords, sub_data)
return li([point])[0]
>>> point = np.array([12.3,-4.2, 500.5, 2.5])
>>> interpolator((lats, lons, alts, time), data, point)
0.386082399091
This cannot work on vectorized data, since that would require storing a LinearNDInterpolator for every possible subcube, and even though it probably would be faster than triangulating the whole thing, it would still be very slow.

scipy.ndimage.map_coordinates
is a nice fast interpolator for uniform grids (all boxes the same size).
See multivariate-spline-interpolation-in-python-scipy on SO
for a clear description.
For non-uniform rectangular grids, a simple wrapper
Intergrid maps / scales non-uniform to uniform grids,
then does map_coordinates.
On a 4d test case like yours it takes about 1 μsec per query:
Intergrid: 1000000 points in a (361, 720, 47, 8) grid took 652 msec

For very similar things I use Scientific.Functions.Interpolation.InterpolatingFunction.
import numpy as np
from Scientific.Functions.Interpolation import InterpolatingFunction
lats = np.arange(-90,90.5,0.5)
lons = np.arange(-180,180,0.5)
alts = np.arange(1,1000,21.717)
time = np.arange(8)
data = np.random.rand(len(lats)*len(lons)*len(alts)*len(time)).reshape((len(lats),len(lons),len(alts),len(time)))
axes = (lats, lons, alts, time)
f = InterpolatingFunction(axes, data)
You can now leave it to the user to call the InterpolatingFunction with coordinates:
>>> f(0,0,10,3)
0.7085675631375401
InterpolatingFunction has nice additional features, such as integration and slicing.
However, I do not know for sure whether the interpolation is linear. You would have to look in the module source to find out.

I can not open this address, and find enough informations about this package

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
y=[]
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
y.append(A*sin(k[i]))
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
M=tck(k)
Below are the results for M for A = 1.e0 and A = 1.e-2
http://i.imgur.com/uEIxq.png Amplitude = 1
http://i.imgur.com/zFfK0.png Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
Cheers,
Rory

I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.legend()
ax.set_title(title + ' Smoothing')
plt.show()
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...

The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate a B-Spline basis in SciPy, like bs() in R - python

Related

Curve fitting with cubic spline

How to draw a general equation with matplotlib?

How can I change the number of basis functions when performing B-Spline fitting in scipy (python)?

Python 4D linear interpolation on a rectangular grid

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

Categories

Resources