Error propagation in a linear fit using python - python

Lets say I take multiple measurements of some dependent variable y relative to some independent variable x. I also record the uncertainty in each measurement dy. As an example this may look like
import numpy as np
x = np.array([1, 2, 3, 4])
y = np.array([4.1, 5.8, 8.1, 9.7])
dy = np.array([0.2, 0.3, 0.2, 0.4])
Now assume I expect the measured values to obey a linear relationship y = mx + b and I want to determine the y value y_umn for some unmeasured x value x_unm. I can perform a linear fit in Python pretty easily if I don't consider the error:
fit_params, residuals, rank, s_values, rcond = np.polyfit(x, y, 1, full=True)
poly_func = np.poly1d(fit_params)
x_unm # The unmeasured x value
y_unm = poly_func(x_unm) # The unmeasured x value
I have two problems with this approach. First is that np.polyfit does not incorporate the error on each point. Second is that I have no idea what the uncertainty on y_unm is.
Does anyone know how to fit data with uncertainties in a way that would allow me to determine the uncertainty in y_unm?

This is a problem that can be done analytically, but that is perhaps better suited as a math/statistics discussion. For example see (among many sources):
The error in the fit can be calculated analytically. It is important to note though that the fit itself is different when accounting for errors in the measurements.
In python I am not sure of a built in function that handles errors but here is an example of doing a chi-squared minimization using scipy.optimize.fmin
#Calculate Chi^2 function to minimize
def chi_2(params,x,y,sigy):
return sum(((y-m*x-c)/sigy)**2)
For comparison I used this, your polyfit solution, and the analytic solution and plotted for the data you gave.
The results for the parameters from the given techniques:
Weighted Chi-squared with fmin:
Linear fits to given data
Here is the full code:
import numpy as np
from scipy.optimize import fmin
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4])
y = np.array([4.1, 5.8, 8.1, 9.7])
dy = np.array([0.2, 0.3, 0.2, 0.4])
#Calculate Chi^2 function to minimize
def chi_2(params,x,y,sigy):
return sum(((y-m*x-c)/sigy)**2)
#Unweighted fit to compare
#Analytic solution
plt.plot(xplt,yplt1,label='Error Weighted',color='black')
plt.plot(xplt,yplt2,label='Non-Error Weighted',color='blue')
plt.plot(xplt,yplt3,label='Error Weighted Analytic',linestyle='--',color='red')


How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat =
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Simulating correlated lognormals in Python

I'm following the answer in this question How can I sample a multivariate log-normal distribution in Python?, but I'm getting that the marginal distributions of the sample data fail to have the same mean and standard deviation of the inputted marginals. For example, consider the multivariate distribution below in the code sample. If we label the marginals as X, Y, and Z, then I would expect that the scale and location parameters (implied from the sample data) to match inputted data. However, for X, you can see below that the scale and location parameters are 0.1000 and 0.5219. So the scale is what we expect, but the location is off by 4%. I'm thinking I'm doing something wrong with the covariance matrix, but I can't seem to figure out what is wrong. I tried setting the correlation matrix to the identity matrix and then the location and scale of the sample data match with the inputted data. Something must be wrong with my covariance matrix, or I'm making another fundamental error. Any help would be appreciated. Please advise if the question is unclear.
import pandas as pd
import numpy as np
from copy import deepcopy
mu = [0.1, 0.2, 0.3]
sigma = [0.5, 0.8, 0.6]
sims = 3000000
rho = [[1, 0.9, 0.3], [0.9, 1, 0.8], [0.3, 0.8 ,1]]
cov = deepcopy(rho)
for row in range(len(rho)):
for col in range(len(rho)):
cov[row][col] = rho[row][col] * sigma[row] * sigma[col]
mvn = np.random.multivariate_normal(mu, cov, size=sims)
sim = pd.DataFrame(np.exp(mvn), columns=['X', 'Y', 'Z'])
def computeImpliedLogNormalsParams(mean, std):
# This method implies lognormal params which match the moments inputed
secondMoment = std * std + mean *mean
location = np.log(mean*mean / np.sqrt(secondMoment))
scale = np.sqrt(np.log(secondMoment / (mean * mean)))
return (location, scale)
def printDistributionProp(col, sim):
print(f"Mean = {sim[col].mean()}, std = {sim[col].std()}")
location, scale = computeImpliedLogNormalsParams(sim[col].mean(), sim[col].std())
print(f"Matching moments gives a lognormal with location {location} and scale {scale}")
printDistributionProp('X', sim)
Mean = 1.2665338803521895, std = 0.708713940557892
Matching moments gives a lognormal with location 0.10008162992913544 and scale 0.5219239625443672
Observing the output, we would expect that scale parameter to be very close to 0.5, but it's a bit off. Increasing the number of simulations does nothing since the value has converged.
The covariance matrix isn't positive semidefinite:
>>> mvn = np.random.multivariate_normal(mu, cov, size=sims, check='raise')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 4542, in mtrand.RandomState.multivariate_normal
ValueError: covariance is not symmetric positive-semidefinite.
and therefore there is no distribution of data that actually has the requested covariance structure. At a high-level, consider that you are specifying X and Z to both be highly correlated with Y (0.8 and 0.9), but at the same time to be rather weakly correlated with each other (0.3). A detailed discussion specifically about three variable correlation constraints can be found on Mathematics SE.
I don't know the internals of how NumPy gets around it (you should have seen a warning), but if you check the final correlation structure:
>>> np.corrcoef(mvn.T)
array([[1. , 0.79817321, 0.33343102],
[0.79817321, 1. , 0.74525583],
[0.33343102, 0.74525583, 1. ]])
one can see that the X and Z have lower correlations with Y and higher correlation with each other than originally specified by rho. Again, not sure how exactly the variances get adjusted, but because the covariance is impossible, NumPy can pretty much do what it wants; fortunately, it seems to stay pretty close.

using undetermined number of parameters in scipy function curve_fit

First question:
I'm trying to fit experimental datas with function of the following form:
f(x) = m_o*(1-exp(-t_o*x)) + ... + m_j*(1-exp(-t_j*x))
Currently, I don't find a way to have an undetermined number of parameters m_j, t_j, I'm forced to do something like this:
def fitting_function(x, m_1, t_1, m_2, t_2):
return m_1*(1.-numpy.exp(-t_1*x)) + m_2*(1.-numpy.exp(-t_2*x))
parameters, covariance = curve_fit(fitting_function, xExp, yExp, maxfev = 100000)
(xExp and yExp are my experimental points)
Is there a way to write my fitting function like this:
def fitting_function(x, li):
res = 0.
for el in range(len(li) / 2):
res += li[2*idx]*(1-numpy.exp(-li[2*idx+1]*x))
return res
where li is the list of fitting parameters and then do a curve_fitting? I don't know how to tell to curve_fitting what is the number of fitting parameters.
When I try this kind of form for fitting_function, I have errors like "ValueError: Unable to determine number of fit parameters."
Second question:
Is there any way to force my fitting parameters to be positive?
Any help appreciated :)
See my question and answer here. I've also made a minimal working example demonstrating how it could be done for your application. I make no claims that this is the best way - I am muddling through all this myself, so any critiques or simplifications are appreciated.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def wrapper(x, *args): #take a list of arguments and break it down into two lists for the fit function to understand
N = len(args)/2
amplitudes = list(args[0:N])
timeconstants = list(args[N:2*N])
return fit_func(x, amplitudes, timeconstants)
def fit_func(x, amplitudes, timeconstants): #the actual fit function
fit = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
fit += m*(1.0-np.exp(-t*x))
return fit
def gen_data(x, amplitudes, timeconstants, noise=0.1): #generate some fake data
y = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
y += m*(1.0-np.exp(-t*x))
if noise:
y += np.random.normal(0, noise, size=len(x))
return y
def main():
x = np.arange(0,100)
amplitudes = [1, 2, 3]
timeconstants = [0.5, 0.2, 0.1]
y = gen_data(x, amplitudes, timeconstants, noise=0.01)
p0 = [1, 2, 3, 0.5, 0.2, 0.1]
popt, pcov = curve_fit(lambda x, *p0: wrapper(x, *p0), x, y, p0=p0) #call with lambda function
yfit = gen_data(x, popt[0:3], popt[3:6], noise=0)
print popt
print pcov
if __name__=="__main__":
A word of warning, though. A linear sum of exponentials is going to make the fit EXTREMELY sensitive to any noise, particularly for a large number of parameters. You can test that by adding even a small amount of noise to the data generated in the script - even small deviations cause it to get the wrong answer entirely while the fit still looks perfectly valid by eye (test with noise=0, 0.01, and 0.1). Be very careful interpreting your results even if the fit looks good. It's also a form that allows for variable swapping: the best fit solution is the same even if you swap any pairs of (m_i, t_i) with (m_j, t_j), meaning your chi-square has multiple identical local minima that might mean your variables get swapped around during fitting, depending on your initial conditions. This is unlikely to be a numeriaclly robust way to extract these parameters.
To your second question, yes, you can, by defining your exponentials like so:
Basically, square them all in your actual fit function, fit them, and then square the results (which could be negative or positive) to get your actual parameters. You can also define variables to be between a certain range by using different proxy forms.

Linear fit including all errors with NumPy/SciPy

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.
There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.
Fit with scipy.optimize.curve_fit gives me:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.
I read about another fit method here as well, so I tried to fit with scipy.odr as well:
Beta: [ 2.00538124 2.95000413]
Beta Std Error: [ 0.00652719 0.03870884]
Same but with 20 * y_err:
Beta: [ 2.00517894 2.9489472 ]
Beta Std Error: [ 0.00642428 0.03647149]
The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.
Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.
I have now read about in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.
Fit with linfit:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00772283 0.04449971]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.15445662 0.88999413]
Fit with linfit(relsigma=True):
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Should I answer my question or just close/delete it now?
One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.
This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.
Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:
Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.
Please note that, from the documentation of curvefit:
sigma : None or N-length sequence
If not None, this vector will be used as relative weights in the
least-squares problem.
The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.
When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458, 3.97843448]))
('Errors: ', array([ 0.09617346, 0.64127574]))
Compares to the original result:
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386, 2.97817514]))
('Errors: ', array([ 0.00782591, 0.02983339]))
Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.
Short answer
For absolute values that include uncertainty in y (and in x for odr case):
In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov))
where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
For relative values (excludes uncertainty):
In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:
p, cov = numpy.polyfit(x, y, 1,cov = True)
errorbars = numpy.sqrt(numpy.diag(cov))
Long answer
There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?
My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing
diagonal elements of the covariance matrix cov(0,0) and
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant
odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207]
curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713]
polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]
Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.
I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).
import scipy.odr
import scipy.optimize
import numpy
x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4
def stddev(cov): return numpy.sqrt(numpy.diag(cov))
def f(B, x): return B[0]*x + B[1]
linear = scipy.odr.Model(f)
mydata = scipy.odr.RealData(x, y, sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput =
cov = myoutput.cov_beta
sd = myoutput.sd_beta
p = myoutput.beta
print 'odr: ', cov[0,0], cov[1,1], sd, stddev(cov)
p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = False,
xtol = 1e-20)
p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = True,
xtol = 1e-20)
print 'curve_fit: ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)
p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit: ', cov4[0,0], cov4[1,1], stddev(cov4)

Spline representation with scipy.interpolate: Poor interpolation for low-amplitude, rapidly oscillating functions

I need to (numerically) calculate the first and second derivative of a function for which I've attempted to use both splrep and UnivariateSpline to create splines for the purpose of interpolation the function to take the derivatives.
However, it seems that there's an inherent problem in the spline representation itself for functions who's magnitude is order 10^-1 or lower and are (rapidly) oscillating.
As an example, consider the following code to create a spline representation of the sine function over the interval (0,6*pi) (so the function oscillates three times only):
import scipy
from scipy import interpolate
import numpy
from numpy import linspace
import math
from math import sin
k = linspace(0, 6.*pi, num=10000) #interval (0,6*pi) in 10'000 steps
A = 1.e0 # Amplitude of sine function
for i in range(len(k)):
tck =interpolate.UnivariateSpline(x, y, w=None, bbox=[None, None], k=5, s=2)
Below are the results for M for A = 1.e0 and A = 1.e-2 Amplitude = 1 Amplitude = 1/100
Clearly the interpolated function created by the splines is totally incorrect! The 2nd graph does not even oscillate the correct frequency.
Does anyone have any insight into this problem? Or know of another way to create splines within numpy/scipy?
I'm guessing that your problem is due to aliasing.
What is x in your example?
If the x values that you're interpolating at are less closely spaced than your original points, you'll inherently lose frequency information. This is completely independent from any type of interpolation. It's inherent in downsampling.
Nevermind the above bit about aliasing. It doesn't apply in this case (though I still have no idea what x is in your example...
I just realized that you're evaluating your points at the original input points when you're using a non-zero smoothing factor (s).
By definition, smoothing won't fit the data exactly. Try putting s=0 in instead.
As a quick example:
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
x = np.linspace(0, 6.*np.pi, num=100) #interval (0,6*pi) in 10'000 steps
A = 1.e-4 # Amplitude of sine function
y = A*np.sin(x)
fig, axes = plt.subplots(nrows=2)
for ax, s, title in zip(axes, [2, 0], ['With', 'Without']):
yinterp = interpolate.UnivariateSpline(x, y, s=s)(x)
ax.plot(x, yinterp, label='Interpolated')
ax.plot(x, y, 'bo',label='Original')
ax.set_title(title + ' Smoothing')
The reason that you're only clearly seeing the effects of smoothing with a low amplitude is due to the way the smoothing factor is defined. See the documentation for scipy.interpolate.UnivariateSpline for more details.
Even with a higher amplitude, the interpolated data won't match the original data if you use smoothing.
For example, if we just change the amplitude (A) to 1.0 in the code example above, we'll still see the effects of smoothing...
The problem is in choosing suitable values for the s parameter. Its values depend on the scaling of the data.
Reading the documentation carefully, one can deduce that the parameter should be chosen around s = len(y) * np.var(y), i.e. # of data points * variance. Taking for example s = 0.05 * len(y) * np.var(y) gives a smoothing spline that does not depend on the scaling of the data or the number of data points.
EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.
