Iteratively fitting polynomial curve - python

I want to iteratively fit a curve to data in python with the following approach:
Fit a polynomial curve (or any non-linear approach)
Discard values > 2 standard deviation from mean of the curve
repeat steps 1 and 2 till all values are within confidence interval of the curve
I can fit a polynomial curve as follows:
vals = array([0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
x_values = np.linspace(0, 1, len(vals))
poly_degree = 3
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
How do I do steps 2 and 3?

With the eliminating points too far from an expected solution, you are probably looking for RANSAC (RANdom SAmple Consensus), which is fitting a curve (or any other function) to data within certain bounds, like your case with 2*STD.
You can use scikit-learn RANSAC estimator which is well aligned with included regressors such as LinearRegression. For your polynomial case you need to define your own regression class:
from sklearn.metrics import mean_squared_error
class PolynomialRegression(object):
def __init__(self, degree=3, coeffs=None):
self.degree = degree
self.coeffs = coeffs
def fit(self, X, y):
self.coeffs = np.polyfit(X.ravel(), y, self.degree)
def get_params(self, deep=False):
return {'coeffs': self.coeffs}
def set_params(self, coeffs=None, random_state=None):
self.coeffs = coeffs
def predict(self, X):
poly_eqn = np.poly1d(self.coeffs)
y_hat = poly_eqn(X.ravel())
return y_hat
def score(self, X, y):
return mean_squared_error(y, self.predict(X))
and then you can use RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
residual_threshold=2 * np.std(y_vals),
random_state=0)
ransac.fit(np.expand_dims(x_vals, axis=1), y_vals)
inlier_mask = ransac.inlier_mask_
Note, the X variable is transformed to 2d array as it is required by sklearn RANSAC implementation and in our custom class flatten back because of numpy polyfit function works with 1d array.
y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
plt.plot(x_vals, y_vals, 'bx', label='input samples')
plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
plt.plot(x_vals, y_hat, 'r-', label='estimated curve')
moreover, playing with the polynomial order and residual distance I got following results with degree=4 and range 1*STD
Another option is to use higher order regressor like Gaussian process
from sklearn.gaussian_process import GaussianProcessRegressor
ransac = RANSACRegressor(GaussianProcessRegressor(),
residual_threshold=np.std(y_vals))
Talking about generalization to DataFrame, you just need to set that all columns except one are features and the one remaining is the output, like here:
import pandas as pd
df = pd.DataFrame(np.array([x_vals, y_vals]).T)
ransac.fit(df[df.columns[:-1]], df[df.columns[-1]])
y_hat = ransac.predict(df[df.columns[:-1]])

it doesn't look like you'll get anything worthwhile following that procedure, there are much better techniques for handling unexpected data. googling for "outlier detection" would be a good start.
with that said, here's how to answer your question:
start by pulling in libraries and getting some data:
import matplotlib.pyplot as plt
import numpy as np
Y = np.array([
0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
0.61574739])
X = np.linspace(0, 1, len(Y))
next do an initial plot of the data:
plt.plot(X, Y, '.')
as this lets you see what we're dealing with and whether a polynomial would ever be a good fit --- short answer is that this method isn't going to get very far with this sort of data
at this point we should stop, but to answer the question I'll go on, mostly following your polynomial fitting code:
poly_degree = 5
sd_cutoff = 1 # 2 keeps everything
coeffs = np.polyfit(X, Y, poly_degree)
poly_eqn = np.poly1d(coeffs)
Y_hat = poly_eqn(X)
delta = Y - Y_hat
sd_p = np.std(delta)
ok = abs(delta) < sd_p * sd_cutoff
hopefully this makes sense, I use a higher degree polynomial and only cutoff at 1SD because otherwise nothing will be thrown away. the ok array contains True values for those points that are within sd_cutoff standard deviations
to check this, I'd then do another plot. something like:
plt.scatter(X, Y, color=np.where(ok, 'k', 'r'))
plt.fill_between(
X,
Y_hat - sd_p * sd_cutoff,
Y_hat + sd_p * sd_cutoff,
color='#00000020')
plt.plot(X, Y_hat)
which gives me:
so the black dots are the points to keep (i.e. X[ok] gives me these back, and np.where(ok) gives you indicies).
you can play around with the parameters, but you probably want a distribution with fatter tails (e.g. a Student's T-distribution) but, as I said above, using Google for outlier detection would be my suggestion

There are three functions need to solve this. First a line fitting function is necesary to fit a line to a set of points:
def fit_line(x_values, vals, poly_degree):
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
return poly_eqn, y_hat
We need to know the standard deviation from the points to the line. This function computes that standard deviation:
def compute_sd(x_values, vals, y_hat):
distances = []
for x,y, y1 in zip(x_values, vals, y_hat): distances.append(abs(y - y1))
return np.std(distances)
Finally, we need to compare the distance from a point to the line. The point needs to be thrown out if the distance from the point to the line is greater than two times the standard deviation.
def compare_distances(x_values, vals):
new_vals, new_x_vals = [],[]
for x,y in zip(x_values, vals):
y1 = np.polyval(poly_eqn, x)
distance = abs(y - y1)
if distance < 2*sd:
plt.plot((x,x),(y,y1), c='g')
new_vals.append(y)
new_x_vals.append(x)
else:
plt.plot((x,x),(y,y1), c='r')
plt.scatter(x,y, c='r')
return new_vals, new_x_vals
As you can see in the following graphs, this method does not work well for fitting a line to data that has a lot of outliers. All the points end up getting eliminated for being too far from the fitted line.
while len(vals)>0:
poly_eqn, y_hat = fit_line(x_values, vals, poly_degree)
plt.scatter(x_values, vals)
plt.plot(x_values, y_hat)
sd = compute_sd(x_values, vals, y_hat)
new_vals, new_x_vals = compare_distances(x_values, vals)
plt.show()
vals, x_values = np.array(new_vals), np.array(new_x_vals)

Related

how to fit a function to data in python

I want to fit a function to the independant (X) and dependent (y) variables:
import numpy as np
y = np.array([1.45952016, 1.36947283, 1.31433227, 1.24076599, 1.20577963,
1.14454815, 1.13068077, 1.09638278, 1.08121406, 1.04417094,
1.02251471, 1.01268524, 0.98535659, 0.97400591])
X = np.array([4.571428571362048, 8.771428571548313, 12.404761904850602, 17.904761904850602,
22.904761904850602, 31.238095237873495, 37.95833333302289,
44.67857142863795, 51.39880952378735, 64.83928571408615,
71.5595238097012, 85., 98.55357142863795, 112.1071428572759])
I already tried scipy package in this way:
from scipy.optimize import curve_fit
def func (x, a, b, c):
return 1/(a*(x**2) + b*(x**1) + c)
g = [1, 1, 1]
c, cov = curve_fit (func, X.flatten(), y.flatten(), g)
test_ar = np.arange(min(X), max(X), 0.25)
pred = np.empty(len(test_ar))
for i in range (len(test_ar)):
pred[i] = func(test_ar[i], c[0], c[1], c[2])
I can add higher orders of polynomial to make my func more accurate but I want to keep it simple. I very much appreciate if anyone an give me some help on how to find another function or make my prediction better. The figure also shows the result of the prediction:
First thing you want to do is to specify how do you measure "accuracy" which in your case is not an appropriate term at all.
What are you essentially doing is called linear regression. Suitable metrics in this case are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE). It is up to you to decide which metric to use and what threshold to set for being "acceptable".
The image that you are showing above (where you've fitted the line) looks fine BUT please expand your X-axis from -100 to 300 and show us the image again this is a problem with high degree polynomials.
This is a 101 example how to use regression in scikit-learn. In your case if you want to use x^2 or x^3 for predicting y, you just need to add them in to the data ... Currently your X variable is an array (a vector) you need to expand that to become a matrix where each column is a feature (x, x^2, x^3 ...)
here is some code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
y = [1.45952016, 1.36947283, 1.31433227, 1.24076599,
1.20577963, 1.14454815, 1.13068077, 1.09638278,
1.08121406, 1.04417094, 1.02251471, 1.01268524, 0.98535659,
0.97400591]
x = [4.571428571362048, 8.771428571548313, 12.404761904850602,
17.904761904850602, 22.904761904850602, 31.238095237873495,
37.95833333302289, 44.67857142863795, 51.39880952378735,
64.83928571408615, 71.5595238097012, 85., 98.55357142863795, 112.1071428572759]
df = pd.DataFrame({
'x' : x,
'x^2': [i**2 for i in x],
'x^3': [i**3 for i in x],
'y': y
})
X = df[['x','x^2','x^3']]
y = df['y']
model = linear_model.LinearRegression()
model.fit(X, y)
y1 = model.predict(X)
coef = model.coef_
intercept = model.intercept_
you can see the coefficients from the coef variable:
array([-1.67456732e-02, 2.03899728e-04, -8.70976426e-07])
you can see the intercept from the intercept variable:
1.5042389677980577
which in your case means -> y1 = -1.67e-2x +2.03e-4x^2 -8.70e-7x^3 + 1.5

emcee MCMC sampling not converging to appropriate parameter values

I'm trying to implement emcee MCMC sampling in Python with a predefined likelihood function to find the best boundary between two populations of data.
For emcee see: http://dfm.io/emcee/current/user/line/
The likelihood function calculates the true positive and true negative classifications, given some linear boundary line, and is used to minimise the difference between the two values whilst maximising their sum.
This way you can imagine a TP and TN rate of 1 respectively will give a likelihood value of 1 while TP and TN rates of 0 will return a likelihood value of 0.
But when I attempt to sample the parameter space for m and b, the gradient and offset (or bias), for the boundary line, I get some wildly big and/or small values for the walks.
I have put an example code below which generates some nicely divided populations and then MCMCs around the initial guesses of the parameter values. I'm unsure as to why the MCMC chains don't converge nicely to an appropriate value here so any help would be greatly appreciated.
The following code should run out-of-the-box.
import emcee
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
#generate some test x and y data
folded_xy_train = np.random.uniform(0,1,10000) #test x data
folded_z_train = np.random.uniform(0,1,10000) #test y data
#define the true gradient and offset for the boundary line
m_true, b_true = 5,-2.5
#generate labels for the test data
rounded_labels_train = np.ones(len(folded_z_train))
model = (m_true*folded_xy_train) + b_true
difference = model - folded_z_train
rounded_labels_train[difference<0] = 0
#show the test data
plt.figure()
plt.scatter(folded_xy_train,folded_z_train,c=rounded_labels_train,s=1.0)
#define a likelihood function for the boundary line
def lnlike(theta, x, y, labels):
m, b = theta
model = (m*x) + b
difference = model - y
classifications = np.ones(len(y))
classifications[difference<0]=0
cfm = confusion_matrix(labels,classifications)
cm = cfm.astype('float') / cfm.sum(axis=1)[:, np.newaxis]
tn, fp, fn, tp = cm.ravel()
likelihood_val = (0.5*(tp+tn))/(1+np.abs(tp-tn))
ln_like = -np.log(likelihood_val)
return ln_like
#define a wide flat prior
def lnprior(theta):
m, b, = theta
if 0 < m < 10 and -20 < b < 5:
return 0.0
return -np.inf
#define the posterior
def lnprob(p, x, y, labels):
lp = lnprior(p)
if not np.isfinite(lp):
return 0
return lp + lnlike(p, x, y, labels)
#setup the MCMC sampling
nwalkers = 4
ndim = 2
p0 = np.array([4.2,-2]) + [np.random.rand(ndim) for i in range(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=(folded_xy_train, folded_z_train, rounded_labels_train))
sampler.run_mcmc(p0, 500)
#extract the MCMC paramater value chains
samples = sampler.chain[:, 50:, :].reshape((-1, ndim))
#view the parameter chains
plt.figure()
plt.subplot(211)
plt.plot(samples[:,0])
plt.subplot(212)
plt.plot(samples[:,1])
The initial test data, showing an obvious boundary line for given x y data (coloured by binary class label):
The sample walks, showing strange sampling for the gradient parameter (top) and offset parameter (bottom). The x-axis denotes the MCMC walk step number and the y-axis denotes the MCMC parameter values at a given step:

Convolution of randomly distributed data with gaussians

Lets say i have a randomly distributed data which looks like:
I want to replace each data point y[x_i] with fixed width gaussian
and add them together. It should give me:
My code is very primitive and slow:
def gaussian(x, mu, sig):
return 1/(sig*np.sqrt(2*np.pi))*np.exp(-np.power(x - mu, 2.) / (
2 * np.power(sig, 2.)))
def gaussian_smoothing(x, y, sig=0.5, n=1000):
x_new = np.linspace(x.min()-10*sig, x.max()+10*sig, n)
y_new = np.zeros(x_new.shape)
for _x, _y in zip(x, y):
y_new += _y*gaussian(x_new, _x, sig)
return x_new, y_new
For large data-sets it takes a long time to perform such smoothing.
I was looking at np.convolve. However it seams that that it is only applicable to evenly distributed data and x step for data and gaussians should be the same. What would be the fastest way to perform such operation.
yon try to estimate it as a Guassian mixture with smaller number of components (like EM algorithm) using sklearn:
import matplotlib.pyplot as plt
from numpy.random import choice
from sklearn import mixture
import scipy.stats
import numpy
# generate some data
x = numpy.array([1.,1.1,1.6,2.,2.1,2.2,2.9,3.,8.,62.,62.2,63.,63.4,64.5,65.,67.,69.])
# generate weights to it
y = numpy.random.rand(x.shape[0])
# normalize weigth to 1
y /= y.sum()
# resamlple to 5000 samples with equal weights according to original weights
x_rsmp = numpy.array([choice(x, p=y) for _ in range(5000)])
x_rsmp.sort()
x_rsmp = x_rsmp.reshape(-1,1)
# define number of components - this must be user seelcted or estimated
n_comp = 2
# fit the mixture
gmm = mixture.GaussianMixture(n_components=n_comp, covariance_type='full')
gmm.fit(x_rsmp)
# plot it
fig = plt.figure()
ax = fig.add_subplot(111)
x_gauss = numpy.linspace(-10,100,1000)
for n_c in range(n_comp):
norm_pdf = scipy.stats.norm.pdf(x_gauss, gmm.means_[n_c,0], gmm.covariances_[n_c,0])
ax.plot(x_gauss, norm_pdf, label='gauss %d' % (n_c+1))
ax.stem(x,y,'gray')
plt.legend()
It yields n_c Gaussian components with means gmm.means_ and covariances gmm.covariances_.

L1 norm instead of L2 norm for cost function in regression model

I was wondering if there's a function in Python that would do the same job as scipy.linalg.lstsq but uses “least absolute deviations” regression instead of “least squares” regression (OLS). I want to use the L1 norm, instead of the L2 norm.
In fact, I have 3d points, which I want the best-fit plane of them. The common approach is by the least square method like this Github link. But It's known that this doesn't give the best fit always, especially when we have interlopers in our set of data. And it's better to calculate the least absolute deviation. The difference between the two methods is explained more here.
It'll not be solved by functions such as MAD since it's an Ax = b matrix equations and requires loops to minimizes the results. I want to know if anyone knows of a relevant function in Python - probably in a linear algebra package - that would calculate “least absolute deviations” regression?
This is not so difficult to roll yourself, using scipy.optimize.minimize and a custom cost_function.
Let us first import the necessities,
from scipy.optimize import minimize
import numpy as np
And define a custom cost function (and a convenience wrapper for obtaining the fitted values),
def fit(X, params):
return X.dot(params)
def cost_function(params, X, y):
return np.sum(np.abs(y - fit(X, params)))
Then, if you have some X (design matrix) and y (observations), we can do the following,
output = minimize(cost_function, x0, args=(X, y))
y_hat = fit(X, output.x)
Where x0 is some suitable initial guess for the optimal parameters (you could take #JamesPhillips' advice here, and use the fitted parameters from an OLS approach).
In any case, when test-running with a somewhat contrived example,
X = np.asarray([np.ones((100,)), np.arange(0, 100)]).T
y = 10 + 5 * np.arange(0, 100) + 25 * np.random.random((100,))
I find,
fun: 629.4950595335436
hess_inv: array([[ 9.35213468e-03, -1.66803210e-04],
[ -1.66803210e-04, 1.24831279e-05]])
jac: array([ 0.00000000e+00, -1.52587891e-05])
message: 'Optimization terminated successfully.'
nfev: 144
nit: 11
njev: 36
status: 0
success: True
x: array([ 19.71326758, 5.07035192])
And,
fig = plt.figure()
ax = plt.axes()
ax.plot(y, 'o', color='black')
ax.plot(y_hat, 'o', color='blue')
plt.show()
With the fitted values in blue, and the data in black.
You can solve your problem using scipy.minimize function. You have to set the function you want to minimize (in our case a plane with the form Z= aX + bY + c) and the error function (L1 norm) then run the minimizer with some starting value.
import numpy as np
import scipy.linalg
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def fit(X, params):
# 3d Plane Z = aX + bY + c
return X.dot(params[:2]) + params[2]
def cost_function(params, X, y):
# L1- norm
return np.sum(np.abs(y - fit(X, params)))
We generate 3d points
# Generating 3-dim points
mean = np.array([0.0,0.0,0.0])
cov = np.array([[1.0,-0.5,0.8], [-0.5,1.1,0.0], [0.8,0.0,1.0]])
data = np.random.multivariate_normal(mean, cov, 50)
Last we run the minimizer
output = minimize(cost_function, [0.5,0.5,0.5], args=(np.c_[data[:,0], data[:,1]], data[:, 2]))
y_hat = fit(np.c_[data[:,0], data[:,1]], output.x)
X,Y = np.meshgrid(np.arange(min(data[:,0]), max(data[:,0]), 0.5), np.arange(min(data[:,1]), max(data[:,1]), 0.5))
XX = X.flatten()
YY = Y.flatten()
# # evaluate it on grid
Z = output.x[0]*X + output.x[1]*Y + output.x[2]
fig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, alpha=0.2)
ax.scatter(data[:,0], data[:,1], data[:,2], c='r')
plt.show()
Note: I have used the previous response code and the code from the github as a start

Python natural smoothing splines

I am trying to find a python package that would give an option to fit natural smoothing splines with user selectable smoothing factor. Is there an implementation for that? If not, how would you use what is available to implement it yourself?
By natural spline I mean that there should be a condition that the second derivative of the fitted function at the endpoints is zero (linear).
By smoothing spline I mean that the spline should not be 'interpolating' (passing through all the datapoints). I would like to decide the correct smoothing factor lambda (see the Wikipedia page for smoothing splines) myself.
What I have found
scipy.interpolate.CubicSpline [link]: Does natural (cubic) spline fitting. Does interpolation, and there is no way to smooth the data.
scipy.interpolate.UnivariateSpline [link]: Does spline fitting with user selectable smoothing factor. However, there is no option to make the splines natural.
After hours of investigation, I did not find any pip installable packages which could fit a natural cubic spline with user-controllable smoothness. However, after deciding to write one myself, while reading about the topic I stumbled upon a blog post by github user madrury. He has written python code capable of producing natural cubic spline models.
The model code is available here (NaturalCubicSpline) with a BSD-licence. He has also written some examples in an IPython notebook.
But since this is the Internet and links tend to die, I will copy the relevant parts of the source code here + a helper function (get_natural_cubic_spline_model) written by me, and show an example of how to use it. The smoothness of the fit can be controlled by using different number of knots. The position of the knots can be also specified by the user.
Example
from matplotlib import pyplot as plt
import numpy as np
def func(x):
return 1/(1+25*x**2)
# make example data
x = np.linspace(-1,1,300)
y = func(x) + np.random.normal(0, 0.2, len(x))
# The number of knots can be used to control the amount of smoothness
model_6 = get_natural_cubic_spline_model(x, y, minval=min(x), maxval=max(x), n_knots=6)
model_15 = get_natural_cubic_spline_model(x, y, minval=min(x), maxval=max(x), n_knots=15)
y_est_6 = model_6.predict(x)
y_est_15 = model_15.predict(x)
plt.plot(x, y, ls='', marker='.', label='originals')
plt.plot(x, y_est_6, marker='.', label='n_knots = 6')
plt.plot(x, y_est_15, marker='.', label='n_knots = 15')
plt.legend(); plt.show()
The source code for get_natural_cubic_spline_model
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
def get_natural_cubic_spline_model(x, y, minval=None, maxval=None, n_knots=None, knots=None):
"""
Get a natural cubic spline model for the data.
For the knots, give (a) `knots` (as an array) or (b) minval, maxval and n_knots.
If the knots are not directly specified, the resulting knots are equally
space within the *interior* of (max, min). That is, the endpoints are
*not* included as knots.
Parameters
----------
x: np.array of float
The input data
y: np.array of float
The outpur data
minval: float
Minimum of interval containing the knots.
maxval: float
Maximum of the interval containing the knots.
n_knots: positive integer
The number of knots to create.
knots: array or list of floats
The knots.
Returns
--------
model: a model object
The returned model will have following method:
- predict(x):
x is a numpy array. This will return the predicted y-values.
"""
if knots:
spline = NaturalCubicSpline(knots=knots)
else:
spline = NaturalCubicSpline(max=maxval, min=minval, n_knots=n_knots)
p = Pipeline([
('nat_cubic', spline),
('regression', LinearRegression(fit_intercept=True))
])
p.fit(x, y)
return p
class AbstractSpline(BaseEstimator, TransformerMixin):
"""Base class for all spline basis expansions."""
def __init__(self, max=None, min=None, n_knots=None, n_params=None, knots=None):
if knots is None:
if not n_knots:
n_knots = self._compute_n_knots(n_params)
knots = np.linspace(min, max, num=(n_knots + 2))[1:-1]
max, min = np.max(knots), np.min(knots)
self.knots = np.asarray(knots)
#property
def n_knots(self):
return len(self.knots)
def fit(self, *args, **kwargs):
return self
class NaturalCubicSpline(AbstractSpline):
"""Apply a natural cubic basis expansion to an array.
The features created with this basis expansion can be used to fit a
piecewise cubic function under the constraint that the fitted curve is
linear *outside* the range of the knots.. The fitted curve is continuously
differentiable to the second order at all of the knots.
This transformer can be created in two ways:
- By specifying the maximum, minimum, and number of knots.
- By specifying the cutpoints directly.
If the knots are not directly specified, the resulting knots are equally
space within the *interior* of (max, min). That is, the endpoints are
*not* included as knots.
Parameters
----------
min: float
Minimum of interval containing the knots.
max: float
Maximum of the interval containing the knots.
n_knots: positive integer
The number of knots to create.
knots: array or list of floats
The knots.
"""
def _compute_n_knots(self, n_params):
return n_params
#property
def n_params(self):
return self.n_knots - 1
def transform(self, X, **transform_params):
X_spl = self._transform_array(X)
if isinstance(X, pd.Series):
col_names = self._make_names(X)
X_spl = pd.DataFrame(X_spl, columns=col_names, index=X.index)
return X_spl
def _make_names(self, X):
first_name = "{}_spline_linear".format(X.name)
rest_names = ["{}_spline_{}".format(X.name, idx)
for idx in range(self.n_knots - 2)]
return [first_name] + rest_names
def _transform_array(self, X, **transform_params):
X = X.squeeze()
try:
X_spl = np.zeros((X.shape[0], self.n_knots - 1))
except IndexError: # For arrays with only one element
X_spl = np.zeros((1, self.n_knots - 1))
X_spl[:, 0] = X.squeeze()
def d(knot_idx, x):
def ppart(t): return np.maximum(0, t)
def cube(t): return t*t*t
numerator = (cube(ppart(x - self.knots[knot_idx]))
- cube(ppart(x - self.knots[self.n_knots - 1])))
denominator = self.knots[self.n_knots - 1] - self.knots[knot_idx]
return numerator / denominator
for i in range(0, self.n_knots - 2):
X_spl[:, i+1] = (d(i, X) - d(self.n_knots - 2, X)).squeeze()
return X_spl
You could use this numpy/scipy implementation of natural cubic smoothing spline for univariate/multivariate data smoothing. Smoothing parameter should be in range [0.0, 1.0]. If we use smoothing parameter equal to 1.0 we get natural cubic spline interpolant without data smoothing. Also the implementation supports vectorization for univariate data.
Univariate example:
import numpy as np
import matplotlib.pyplot as plt
import csaps
np.random.seed(1234)
x = np.linspace(-5., 5., 25)
y = np.exp(-(x/2.5)**2) + (np.random.rand(25) - 0.2) * 0.3
sp = csaps.UnivariateCubicSmoothingSpline(x, y, smooth=0.85)
xs = np.linspace(x[0], x[-1], 150)
ys = sp(xs)
plt.plot(x, y, 'o', xs, ys, '-')
plt.show()
Bivariate example:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import csaps
xdata = [np.linspace(-3, 3, 61), np.linspace(-3.5, 3.5, 51)]
i, j = np.meshgrid(*xdata, indexing='ij')
ydata = (3 * (1 - j)**2. * np.exp(-(j**2) - (i + 1)**2)
- 10 * (j / 5 - j**3 - i**5) * np.exp(-j**2 - i**2)
- 1 / 3 * np.exp(-(j + 1)**2 - i**2))
np.random.seed(12345)
noisy = ydata + (np.random.randn(*ydata.shape) * 0.75)
sp = csaps.MultivariateCubicSmoothingSpline(xdata, noisy, smooth=0.988)
ysmth = sp(xdata)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_wireframe(j, i, noisy, linewidths=0.5, color='r')
ax.scatter(j, i, noisy, s=5, c='r')
ax.plot_surface(j, i, ysmth, linewidth=0, alpha=1.0)
plt.show()
The python package patsy has functions for generating spline bases, including a natural cubic spline basis. Described in the documentation.
Any library can then be used for fitting a model, e.g. scikit-learn or statsmodels.
The df parameter for cr() can be used to control the "smoothness"
Note that too low df can result to underfit (see below).
A simple example using scikit-learn.
import numpy as np
from sklearn.linear_model import LinearRegression
from patsy import cr
import matplotlib.pyplot as plt
n_obs = 600
np.random.seed(0)
x = np.linspace(-3, 3, n_obs)
y = 1 / (x ** 2 + 1) * np.cos(np.pi * x) + np.random.normal(0, 0.2, size=n_obs)
def plot_smoothed(df=5):
# Generate spline basis with different degrees of freedom
x_basis = cr(x, df=df, constraints="center")
# Fit model to the data
model = LinearRegression().fit(x_basis, y)
# Get estimates
y_hat = model.predict(x_basis)
plt.plot(x, y_hat, label=f"df={df}")
plt.scatter(x, y, s=4, color="tab:blue")
for df in (5, 7, 10, 25):
plot_smoothed(df)
plt.legend()
plt.title(f"Natural cubic spline with varying degrees of freedom")
plt.show()
For a project of mine, I needed to create intervals for time-series modeling, and to make the procedure more efficient I created tsmoothie: A python library for time-series smoothing and outlier detection in a vectorized way.
It provides different smoothing algorithms together with the possibility to computes intervals.
In the case of SplineSmoother of natural cubic type:
import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *
def func(x):
return 1/(1+25*x**2)
# make example data
x = np.linspace(-1,1,300)
y = func(x) + np.random.normal(0, 0.2, len(x))
# operate smoothing
smoother = SplineSmoother(n_knots=10, spline_type='natural_cubic_spline')
smoother.smooth(y)
# generate intervals
low, up = smoother.get_intervals('prediction_interval', confidence=0.05)
# plot the first smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.plot(smoother.data[0], '.k')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)
I point out also that tsmoothie can carry out the smoothing of multiple time-series in a vectorized way
The programming language R offers a very good implementation of natural cubic smoothing splines. You can use R functions in Python with rpy2:
import rpy2.robjects as robjects
r_y = robjects.FloatVector(y_train)
r_x = robjects.FloatVector(x_train)
r_smooth_spline = robjects.r['smooth.spline'] #extract R function# run smoothing function
spline1 = r_smooth_spline(x=r_x, y=r_y, spar=0.7)
ySpline=np.array(robjects.r['predict'](spline1,robjects.FloatVector(x_smooth)).rx2('y'))
plt.plot(x_smooth,ySpline)
If you want to directly set lambda: spline1 = r_smooth_spline(x=r_x, y=r_y, lambda=42) doesn't work, because lambda has already another meaning in Python, but there is a solution: How to use the lambda argument of smooth.spline in RPy WITHOUT Python interprating it as lambda.
To get the code running you first need to define the data x_train and y_train and you can define x_smooth=np.array(np.linspace(-3,5,1920)). if you want to plot it between -3 and 5 in Full-HD-resolution.
Note that this code is not fully compatible with Jupyter-notebooks for the latest versions of rpy2. You can fix this by using !pip install -Iv rpy2==3.4.2 as described in NotImplementedError: Conversion 'rpy2py' not defined for objects of type '<class 'rpy2.rinterface.SexpClosure'>' only after I run the code twice

Categories