How do you fit a polynomial to a data set? - python

I'm working on two functions. I have two data sets, eg [[x(1), y(1)], ..., [x(n), y(n)]], dataSet and testData.
createMatrix(D, S) which returns a data matrix, where D is the degree and S is a vector of real numbers [s(1), s(2), ..., s(n)].
I know numpy has a function called polyfit. But polyfit takes in three variables, any advice on how I'd create the matrix?
polyFit(D), which takes in the polynomial of degree D and fits it to the data sets using linear least squares. I'm trying to return the weight vector and errors. I also know that there is lstsq in numpy.linag that I found in this question: Fitting polynomials to data
Is it possible to use that question to recreate what I'm trying?
This is what I have so far, but it isn't working.
def createMatrix(D, S):
x = []
y = []
for i in dataSet:
x.append(i[0])
y.append(i[1])
polyfit(x, y, D)
What I don't get here is what does S, the vector of real numbers, have to do with this?
def polyFit(D)
I'm basing a lot of this on the question posted above. I'm unsure about how to get just w though, the weight vector. I'll be coding the errors, so that's fine I was just wondering if you have any advice on getting the weight vectors themselves.

It looks like all createMatrix is doing is creating the two vectors required by polyfit. What you have will work, but, the more pythonic way to do it is
def createMatrix(dataSet, D):
D = 3 # set this to whatever degree you're trying
x, y = zip(*dataSet)
return polyfit(x, y, D)
(This S/O link provides a detailed explanation of the zip(*dataSet) idiom.)
This will return a vector of coefficients that you can then pass to something like poly1d to generate results. (Further explanation of both polyfit and poly1d can be found here.)
Obviously, you'll need to decide what value you want for D. The simple answer to that is 1, 2, or 3. Polynomials of higher order than cubic tend to be rather unstable and the intrinsic errors make their output rather meaningless.
It sounds like you might be trying to do some sort of correlation analysis (i.e., does y vary with x and, if so, to what extent?) You'll almost certainly want to just use linear (D = 1) regression for this type of analysis. You can try to do a least squares quadratic fit (D = 2) but, again, the error bounds are probably wider than your assumptions (e.g. normality of distribution) will tolerate.

Related

Exponential fit with the least squares Python

I have a very specific task, where I need to find the slope of my exponential function.
I have two arrays, one denoting the wavelength range between 400 and 750 nm, the other the absorption spectrum. x = wavelengths, y = absorption.
My fit function should look something like that:
y_mod = np.float(a_440) * np.exp(-S*(x - 440.))
where S is the slope and in the image equals 0.016, which should be in the range of S values I should get (+/- 0.003). a_440 is the reference absorption at 440 nm, x is the wavelength.
Modelled vs. original plot:
I would like to know how to define my function in order to get an exponential fit (not on log transformed quantities) of it without guessing beforehand what the S value is.
What I've tried so far was to define the function in such way:
def func(x, a, b):
return a * np.exp(-b * (x-440))
And it gives pretty nice matches
fitted vs original.
What I'm not sure is whether this approach is correct or should I do it differently?
How would one use also the least squares or the absolute differences in y approaches for minimization in order to remove the effect of overliers?
Is it possible to also add random noise to the data and recompute the fit?
Your situation is the same as the one described in the documentation for scipy's curve_fit.
The problem you're incurring is that your definition of the function accepts only one argument when it should receive three: x (the independent variable where the function is evaluated), plus a_440 and S.
Cleaning a bit, the function should be more like this.
def func(x, A, S):
return A*np.exp(-S*(x-440.))
It might be that you run into a warning about the covariance matrix. you solve that by providing a decent starting point to the curve_fit through the argument p0 and providing a list. For example in this case p0=[1,0.01] and in the fitting call it would look like the following
curve_fit(func, x, y, p0=[1,0.01])

Reducing difference between two graphs by optimizing more than one variable in MATLAB/Python?

Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)

Finding complex roots from set of non-linear equations in python

I have been testing an algorithm that has been published in literature that involves solving a set of 'm' non-linear equations in both Matlab and Python. The set of non-linear equations involves input variables that contain complex numbers, and therefore the resulting solutions should also be complex. As of now, I have been able to get pretty good results in Matlab by using the following lines of code:
lambdas0 = ones(1,m)*1e-5;
options = optimset('Algorithm','levenberg-marquardt',...
'MaxFunEvals',1000000,'MaxIter',10000,'TolX',1e-20,...
'TolFun',1e-20);
Eq = #(lambda)maxentfun(lambda,m,h,g);
[lambdasf] = fsolve(Eq,lambdas0,options);
where h and g are a complex matrix and vector, respectively. The solution converges very well for a wide range of initial values.
I have been trying to mimic these results in Python with very little success however. The numerical solvers seem to be set up much differently, and the 'levenburg-marquardt' algorithm exists under the function root. In python this algorithm cannot handle complex roots, and when I run the following lines:
lambdas0 = np.ones(m)*1e-5
sol = root(maxentfun, lambdas0, args = (m,h,g), method='lm', tol = 1e-20, options = {'maxiter':10000, 'xtol':1e-20})
lambdasf = sol.x
I get the following error:
minpack.error: Result from function call is not a proper array of floats.
I have tried using some of the other algorithms, such as 'broyden2' and 'anderson', but they are much inferior to Matlab, and only give okay results after playing around with the initial conditions. The function 'fsolve' also cannot handle complex variables either.
I was wondering if there is something I am applying incorrectly, and if anybody has an idea on maybe how to properly solve complex non-linear equations in Python.
Thank you very much
When I encounter this type of problem I try to rewrite my function as an array of real and imaginary parts. For example, if f is your function which takes complex input array x (say x has size 2, for simplicity)
from numpy import *
def f(x):
# Takes a complex-valued vector of size 2 and outputs a complex-valued vector of size 2
return [x[0]-3*x[1]+1j+2, x[0]+x[1]] # <-- for example
def real_f(x1):
# converts a real-valued vector of size 4 to a complex-valued vector of size 2
# outputs a real-valued vector of size 4
x = [x1[0]+1j*x1[1],x1[2]+1j*x1[3]]
actual_f = f(x)
return [real(actual_f[0]),imag(actual_f[0]),real(actual_f[1]),imag(actual_f[1])]
The new function, real_f can be used in fsolve: the real and imaginary parts of the function are simultaneously solved for, treating the real and imaginary parts of the input argument as independent.
Here append() and extend() methods can be used to make it automatic and easily extendable to N number of variables
def real_eqns(y1):
y=[]
for i in range(N):
y.append(y1[2*i+0]+1j*y1[2*i+1])
real_eqns1 = eqns(y)
real_eqns=[]
for i in range(N):
real_eqns.extend([real_eqns1[i].real,real_eqns1[i].imag])
return real_eqns

Derivative of an array in python?

Currently I have two numpy arrays: x and y of the same size.
I would like to write a function (possibly calling numpy/scipy... functions if they exist):
def derivative(x, y, n = 1):
# something
return result
where result is a numpy array of the same size of x and containing the value of the n-th derivative of y regarding to x (I would like the derivative to be evaluated using several values of y in order to avoid non-smooth results).
This is not a simple problem, but there are a lot of methods that have been devised to handle it. One simple solution is to use finite difference methods. The command numpy.diff() uses finite differencing where you can specify the order of the derivative.
Wikipedia also has a page that lists the needed finite differencing coefficients for different derivatives of different accuracies. If the numpy function doesn't do what you want.
Depending on your application you can also use scipy.fftpack.diff which uses a completely different technique to do the same thing. Though your function needs a well defined Fourier transform.
There are lots and lots and lots of variants (e.g. summation by parts, finite differencing operators, or operators designed to preserve known evolution constants in your system of equations) on both of the two ideas above. What you should do will depend a great deal on what the problem is that you are trying to solve.
The good thing is that there is a lot of work has been done in this field. The Wikipedia page for Numerical Differentiation has some resources (though it is focused on finite differencing techniques).
The findiff project is a Python package that can do derivatives of arrays of any dimension with any desired accuracy order (of course depending on your hardware restrictions). It can handle arrays on uniform as well as non-uniform grids and also create generalizations of derivatives, i.e. general linear combinations of partial derivatives with constant and variable coefficients.
Would something like this solve your problem?
def get_inflection_points(arr, n=1):
"""
returns inflextion points from array
arr: array
n: n-th discrete difference
"""
inflections = []
dx = 0
for i, x in enumerate(np.diff(arr, n)):
if x >= dx and i > 0:
inflections.append(i*n)
dx = x
return inflections

Generalized least square on large dataset

I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.
Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];
You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m

Categories