Related
I need to fit a sine curve created from two sine waves and extract the parameters for the fitted curve (such as frequency, amplitude, etc).
Data example:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(0, 50, 0.01)
x2 = np.arange(0, 100, 0.02)
x3 = np.arange(0, 150, 0.03)
sin1 = np.sin(x)
sin2 = np.sin(x2)
sin3= np.sin(x3/2)
sin4 = sin1 + sin2+sin3
plt.plot(x, sin4)
plt.show()
I used the codes provided in this answer.
yy = sin4
tt = x
res = fit_sin(tt, yy)
print(str(i), "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
fit_values=res["fitfunc"](tt)
Frequenc_fit= res['freq']
print(i, Frequenc_fit)
Frequenc_fit=Frequenc_fit
Amp_fit=res['amp']
Omega_fit=res['omega']
Phase_fit=res['phase']
Offset_fit=res['offset']
maxcov_fit=res['maxcov']
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt,fit_values, "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
I got a fitted sine curve with a single frequency and amplitude as follows:
2 Amplitude=1.0149282025860233, Angular freq.=2.01112187048004, phase=-0.2730905030152767, offset=0.003304158823058212, Max. Cov.=0.0015266032307905222
2 0.3200799868471169
Is there a method to obtain fitted curve matches with the original one?
Supposing that the function to be fitted is
y(x)=a * sin( w * x )+b * sin( W * x )
the principle of the method below is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
The graphical representation of the result is :
Blue curve : From data obtained by scanning the graph given in the question.
Black curve : From the above calculus.
The available data was not accurate because it comes from scanning of the original figure. The deviation is mainly due to the numerical integrations in computing the values of SS and SSSS (Four successive numerical integrations is not accurate especially with biaised data).
Probably the correct result should be : w=2 , W=1 , a=1 , b=1.
NOTE : The above method is not iterative and thus doesn't requires guessed values of the parameters to start an iterative process. The approximate results of the parameters can be good initial values in order to use an iterative non-linear regression process.
NOTE : If the values of w and W where known a-priori the solving thanks to linear regression would be very simple and much accurate (Only the last 2X2 matrix calculus shown above).
I have a series of coordinates that I want to apply a KDE to, and have been using scipy.stats.gaussian_kde to do so. The issue here is that this function expects a discrete set of coordinates, which it would then perform a density estimation of.
This causes issues when I wish to log my data (for sets where the coordinates are particularly sparese, and using the untouched data gives very little information). As you can imagine, if you must work with discrete amounts of points, if 2 points appear 18 times and the other 24 times, taking the log of 18 and 24 will make them identical, as they must be rounded to the nearest integer in order to remain discrete.
As a work around for this, I have been using the weights parameter in the scipy.stats.gaussian_kde function. Instead of having an array where each point appears an amount of times equal its density, each point appears a single time, and is instead weighted by its density. So now, using the example before, the 2 points that have density 18 and 24 will not be identical as with weightings these densities can be continuous.
This works and produces what appears to be a good estimate, however using these two different methods, they both produce graphs with minor differences. If I had just been using one method, I would remain blissfully ignorant, but now that I've used both, I can't be sure the estimate is reasonable.
Is there a reason these two methods produce differing results?
See below some example code that reproduces the issue:
from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
discrete_points = np.random.randint(0,10,size=(2,400))
continuous_points = [[],[]]
continuous_weights = []
recorded_points = []
for i in range(discrete_points.shape[1]):
p = discrete_points[:,i]
if tuple(p) in recorded_points:
continuous_weights[recorded_points.index(tuple(p))] += 1
else:
continuous_points[0].append(p[0])
continuous_points[1].append(p[1])
continuous_weights.append(1)
recorded_points.append(tuple(p))
resolution = 1
kde = gaussian_kde(discrete_points)
x, y = discrete_points
# https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
x_step = int((max(x)-min(x))/resolution)
y_step = int((max(y)-min(y))/resolution)
xgrid = np.linspace(min(x), max(x), x_step+1)
ygrid = np.linspace(min(y), max(y), y_step+1)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
Zgrid = Z.reshape(Xgrid.shape)
ext = [min(x), max(x), min(y), max(y)]
earth = plt.cm.gist_earth_r
plt.imshow(Zgrid,
origin='lower', aspect='auto',
extent=ext,
alpha=0.8,
cmap=earth)
plt.title("Discrete method (no weights)")
plt.savefig("noweights.png")
kde = gaussian_kde(continuous_points, weights=continuous_weights)
x, y = continuous_points
# https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
x_step = int((max(x)-min(x))/resolution)
y_step = int((max(y)-min(y))/resolution)
xgrid = np.linspace(min(x), max(x), x_step+1)
ygrid = np.linspace(min(y), max(y), y_step+1)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
Zgrid = Z.reshape(Xgrid.shape)
ext = [min(x), max(x), min(y), max(y)]
earth = plt.cm.gist_earth_r
plt.imshow(Zgrid,
origin='lower', aspect='auto',
extent=ext,
alpha=0.8,
cmap=earth)
plt.title("Continuous method (weights)")
plt.savefig("weights.png")
Which produces the following plots:
and
An import aspect of a kde is the bandwidth used. Scipy's gaussian_kde uses "Scott's factor" as a guess for the bandwidth.
In particular, gaussian_kde uses n**(-1./(d+4)) where d is the dimension (2 in this case), and n
the number of data points in case of the non-weighted version
the "effective number of datapoints" in case of the weighted version; it is calculated as neff = sum(weights)^2 / sum(weights^2)
In the example of the post n = 400 and neff = sum(continuous_weights)**2 / sum([w**2 for w in continuous_weights]) = 84.0336.
To get the same result, the same bandwidth should be used in both cases. It can be set explicitly as gaussian_kde(..., bw_method=bandwidth.
bandwidth = discrete_points.shape[1]**(-1./(2+4))
# kde without weights
kde = gaussian_kde(discrete_points, bw_method=bandwidth)
# kde for the weighted points
kde = gaussian_kde(continuous_points, weights=continuous_weights, bw_method=bandwidth)
If you plan to create multiple plots, you probably want to use the same bandwidth for all of them, independent to the number of points or the weights. You might want to experiment with the resolution and the bandwidth. A higher bandwidth smooths everything out over a larger distance, a smaller bandwidth is more faithful to given data.
I want to calculate a simple linear regression where I need to force a particular value for one point. Namely, I have x and y arrays, and I want my regression f(x) to force f(x[-1]) == y[-1] - that is, the prediction over the last element of x should be equal to the last element of y.
Is there a way to do it using Python and scikit-learn?
Here's a slightly roundabout trick that will do it.
Try re-centering your data, i.e. subtract x[-1], y[-1] from all datapoints so that x[-1], y[-1] is now the origin.
Now fit your data using sklearn.linear_model.LinearRegression with fit_intercept set to False. This way, the data is fit so that the line is forced to pass through the origin. Because we've re-centered the data, the origin corresponds to x[-1], y[-1].
When you use the model to make predictions, subtract x[-1] from any datapoint for which you are making a prediction, then add y[-1] to the resulting prediction, and this will give you the same results as forcing your model to pass through x[-1], y[-1].
This is a little roundabout but it's the simplest way that occurs to me to do it using the sklearn linear regression function (without writing your own).
The suggestion from HappyDog is great as a quick way to get a fit however I'd like to introduce another method which doesn't require any manipulation of your data. The method will use the scipy.optimize.curve_fit method to fit your data.
First, we need to realize that a normal linear regression will find A and B such that y=Ax+B provides the best fit to the input data. Your requirements state that the fit must pass through the final point in your sample data set. Essentially we'll be dropping a line that passes through your final point and rotating it around this point until we can minimize the errors.
Take a look at the point-slope equation for a line: y-yi = m*(x-xi) where (xi, yi) is any point on that line. If we make the substution that this (xi, yi) point is the final point from your data set and solve for y, we get y=m*(x-xf)+yf. This is the model we will fit.
Translating this model to a python-function, we have:
def model(x, m, xf, yf):
return m*(x-xf)+yf
We create a mock-data set for this example and just for demonstration purposes we will significantly shift the final y-value:
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
We're almost ready to perform the fit. The curve_fit function expects a callable function (model) to fit, the x and y data, and a list of the guesses of each parameter we are trying to fit. Since our model accepts two extra "constant" arguments (xf and yf), we use functools.partial to "set" these arguments based on our data.
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
Now we can fit!
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
And we look at the results:
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
Putting all this together in one code block and including imports gives:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import functools
def model(x, m, xf, yf):
return m*(x-xf)+yf
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
We see a line passing through the final point, as required, and have found the best slope to represent this dataset.
I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:
plt.plot(xvalues, yvalues)
plt.show()
I assume the data needs to be smoothed somehow in order to give a smooth pdf.
I would like to plot the pdf. How can I do that?
The raw data is at: http://dpaste.com/1HVK5DR .
There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.
So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)
import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file
print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]
#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)
Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form
c(t) = sum a(i) * phi(t - x(i)) #(not part of the program)
where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.
Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just
c(t) = sum a(i) * phi'(t - x(i))
where phi' is the derivative of phi. #(not part of the program)
Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.
#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20 #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)
#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae # coeffs
#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd # coeffs
fig,ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')
ax2.plot(xd, yd, '-')
fig.savefig('graph.png')
print('done')
You need the derivative to go from CDF to PDF
PDF(x) = d CDF(x)/ dx
With NumPy, you could use gradient
pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()
or manual differential
pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()
Both produce something like, updated picture it got botched somehow
I'm trying to obtain a confidence interval on an exponential fit to some x,y data (available here). Here's the MWE I have to find the best exponential fit to the data:
from pylab import *
from scipy.optimize import curve_fit
# Read data.
x, y = np.loadtxt('exponential_data.dat', unpack=True)
def func(x, a, b, c):
'''Exponential 3-param function.'''
return a * np.exp(b * x) + c
# Find best fit.
popt, pcov = curve_fit(func, x, y)
print popt
# Plot data and best fit curve.
scatter(x, y)
x = linspace(11, 23, 100)
plot(x, func(x, *popt), c='r')
show()
which produces:
How can I obtain the 95% (or some other value) confidence interval on this fit preferably using either pure python, numpy or scipy (which are the packages I already have installed)?
You can use the uncertainties module to do the uncertainty calculations.
uncertainties keeps track of uncertainties and correlation. You can create correlated uncertainties.ufloat directly from the output of curve_fit.
To be able to do those calculation on non-builtin operations such as exp you need to use the functions from uncertainties.unumpy.
You should also avoid your from pylab import * import. This even overwrites python built-ins such as sum.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
import uncertainties as unc
import matplotlib.pyplot as plt
import uncertainties.unumpy as unp
def func(x, a, b, c):
'''Exponential 3-param function.'''
return a * np.exp(b * x) + c
x, y = np.genfromtxt('data.txt', unpack=True)
popt, pcov = curve_fit(func, x, y)
a, b, c = unc.correlated_values(popt, pcov)
# Plot data and best fit curve.
plt.scatter(x, y, s=3, linewidth=0, alpha=0.3)
px = np.linspace(11, 23, 100)
# use unumpy.exp
py = a * unp.exp(b * px) + c
nom = unp.nominal_values(py)
std = unp.std_devs(py)
# plot the nominal value
plt.plot(px, nom, c='r')
# And the 2sigma uncertaintie lines
plt.plot(px, nom - 2 * std, c='c')
plt.plot(px, nom + 2 * std, c='c')
plt.savefig('fit.png', dpi=300)
And the result:
Gabriel's answer is incorrect. Here in red the 95% confidence band for his data as calculated by GraphPad Prism:
Background: the "confidence interval of a fitted curve" is typically called confidence band. For a 95% confidence band, one can be 95% confident that it contains the true curve. (This is different from prediction bands, shown above in gray. Prediction bands are about future data points. For more details, see, e.g., this page of the GraphPad Curve Fitting Guide.)
In Python, kmpfit can calculate the confidence band for non-linear least squares. Here for Gabriel's example:
from pylab import *
from kapteyn import kmpfit
x, y = np.loadtxt('_exp_fit.txt', unpack=True)
def model(p, x):
a, b, c = p
return a*np.exp(b*x)+c
f = kmpfit.simplefit(model, [.1, .1, .1], x, y)
print f.params
# confidence band
a, b, c = f.params
dfdp = [np.exp(b*x), a*x*np.exp(b*x), 1]
yhat, upper, lower = f.confidence_band(x, dfdp, 0.95, model)
scatter(x, y, marker='.', s=10, color='#0000ba')
ix = np.argsort(x)
for i, l in enumerate((upper, lower, yhat)):
plot(x[ix], l[ix], c='g' if i == 2 else 'r', lw=2)
show()
The dfdp are the partial derivatives ∂f/∂p of the model f = a*e^(b*x) + c with respect to each parameter p (i.e., a, b, and c). For background, see the kmpfit Tutorial or this page of the GraphPad Curve Fitting Guide. (Unlike my sample code, the kmpfit Tutorial does not use confidence_band() from the library but its own, slightly different, implementation.)
Finally, the Python plot matches the Prism one:
Notice: the actual answer to obtaining the fitted curve's confidence interval is given by Ulrich here.
After some research (see here, here and 1.96) I came up with my own solution.
It accepts an arbitrary X% confidence interval and plots upper and lower curves.
Here's the MWE:
from pylab import *
from scipy.optimize import curve_fit
from scipy import stats
def func(x, a, b, c):
'''Exponential 3-param function.'''
return a * np.exp(b * x) + c
# Read data.
x, y = np.loadtxt('exponential_data.dat', unpack=True)
# Define confidence interval.
ci = 0.95
# Convert to percentile point of the normal distribution.
# See: https://en.wikipedia.org/wiki/Standard_score
pp = (1. + ci) / 2.
# Convert to number of standard deviations.
nstd = stats.norm.ppf(pp)
print nstd
# Find best fit.
popt, pcov = curve_fit(func, x, y)
# Standard deviation errors on the parameters.
perr = np.sqrt(np.diag(pcov))
# Add nstd standard deviations to parameters to obtain the upper confidence
# interval.
popt_up = popt + nstd * perr
popt_dw = popt - nstd * perr
# Plot data and best fit curve.
scatter(x, y)
x = linspace(11, 23, 100)
plot(x, func(x, *popt), c='g', lw=2.)
plot(x, func(x, *popt_up), c='r', lw=2.)
plot(x, func(x, *popt_dw), c='r', lw=2.)
text(12, 0.5, '{}% confidence interval'.format(ci * 100.))
show()
curve_fit() returns the covariance matrix - pcov -- which holds the estimated uncertainties (1 sigma). This assumes errors are normally distributed, which is sometimes questionable.
You might also consider using the lmfit package (pure python, built on top of scipy), which provides a wrapper around scipy.optimize fitting routines (including leastsq(), which is what curve_fit() uses) and can, among other things, calculate confidence intervals explicitly.
I've always been a fan of simple bootstrapping to get confidence intervals. If you have n data points, then use the random package to select n points from your data WITH RESAMPLING (i.e. allow your program to get the same point multiple times if that's what it wants to do - very important). Once you've done that, plot the resampled points and get the best fit. Do this 10,000 times, getting a new fit line each time. Then your 95% confidence interval is the pair of lines that enclose 95% of the best fit lines you made.
It's a pretty easy method to program in Python, but it's a bit unclear how this would work out from a statistical point of view. Some more information on why you want to do this would probably lead to more appropriate answers for your task.