Fit a given function - python

I have a basic question. I want to use scikit-learn to fit a polynomial model to my data. I could do that by PolynomialFeatures but I want to fit a polynomial with some specific form.
For example, if I have 2 features I want to create a model such that:
F = a1 * x1 + a2 * x2 + a3 * x1 * x2 + a4 * x1^2 + a5 * x2^3
Can You please guide me how can I do that? I could not find any example that I can use for my purpose.

I've used the following method to try and fit curves that map to specific functions, you might be able to rework some of this to meet your needs:
First, define your model function as a function that takes a value of x, and some set of parameters and which return an associated y value.
You'll need to be sure that your function really is a function in the mathematical sense (i.e. that they return a single value of y for any input value of x)
This is your "model" function - for example :
# A kind of elliptic curve with two parameters n, m
def mn_elliptical(x,m,n):
return 1-((1 - (x)**(n))**(m))
For 2-dimensional models (i.e. where inputs are x and y, and there's a third output of z) there are ways of formulating your model and input data discussed here: https://scipython.com/blog/non-linear-least-squares-fitting-of-a-two-dimensional-data/ - also, for an example of this in practice, see end of this answer.
Then, using the scipy.optimize.curve_fit method, you need to feed it a pair of arrays, one for the x's and one for the y's of the known observations you have collected, against which the fitting will take place.
xdata = [ ... all x values for all observations ... ]
ydata = [ ... all observed y values in the same order as above ... ]
from scipy.optimize import curve_fit
fitp, fite = curve_fit(mn_elliptical, xdata, ydata))
This will yield fitp the optimal parameters fit by the method, and fite an output describing how much (least squares) error remains after having performed the fit. If your fite values are too big, then it's likely your model function isn't a good one.
You can help guide the process by helping set the expected bounds of the parameters you want to return, and this can speed things up significantly - or, if you've got a skewy function, it can help focus in on the right values that would otherwise get missed. These details are covered in more depth in the linked scipy docs.
Having validated and accepted the amount of error, you can can then retrieve the parameters from fitp and use these to generate additional values of x through your (now fitted) model, and get predicted results.
new_y = mn_elliptical(x, *fitp)
Which will yield a single result - use more advanced numpy/pandas methods to generate multiple results from arrays of x values that you supply.
Just to demonstrate that 2-dimensional use-case, let's imagine a crudely plotted circle, with points A,B,C,D,E at the following xy coordinates (4,1), (6.5,3.5), (4,6), (1.5,3.5), (2,2)
We know that a circle follows the formula (x-cx)^2+(y-cy)^2=r^2, so can write that in a fittable function form:
def circ(xy, cx, cy, r):
x,y = xy
return (((x-cx)**2) + ((y-cy)**2))-(r**2)
Notice that I've flattened the return value to always be zero (at least for values of xy that conform) due to the nature of the formula.
We use the observed data points laid out here:
xdata = np.array([4,6.5,4,1.5,2])
ydata = np.array([1,3.5,6,3.5,2])
zdata = np.array([0,0,0,0,0])
And transform that data based on the method used in the linked article on 2-dimensions.
xdata = np.vstack((xdata.ravel(), ydata.ravel()))
ydata = zdata.ravel()
And then feed this into the 2-d circ function.
curve_fit(circ,xdata,ydata)
This yields:
(array([ 4. , 3.5, -2.5]),
array([[ 0., -0., -0.],
[-0., 0., -0.],
[-0., -0., 0.]]))
The first part of which describes the (cx, cy, r) parameters from my fit circ function, the first two being x,y coordinates of the centre, and the third being the radius of the circle. Based on my pencil on paper drawing, this is pretty much on the money.
The second part describes the errors encountered, which for this hand-drawn example are not horrendous.

Related

How do you fit a polynomial to a data set?

I'm working on two functions. I have two data sets, eg [[x(1), y(1)], ..., [x(n), y(n)]], dataSet and testData.
createMatrix(D, S) which returns a data matrix, where D is the degree and S is a vector of real numbers [s(1), s(2), ..., s(n)].
I know numpy has a function called polyfit. But polyfit takes in three variables, any advice on how I'd create the matrix?
polyFit(D), which takes in the polynomial of degree D and fits it to the data sets using linear least squares. I'm trying to return the weight vector and errors. I also know that there is lstsq in numpy.linag that I found in this question: Fitting polynomials to data
Is it possible to use that question to recreate what I'm trying?
This is what I have so far, but it isn't working.
def createMatrix(D, S):
x = []
y = []
for i in dataSet:
x.append(i[0])
y.append(i[1])
polyfit(x, y, D)
What I don't get here is what does S, the vector of real numbers, have to do with this?
def polyFit(D)
I'm basing a lot of this on the question posted above. I'm unsure about how to get just w though, the weight vector. I'll be coding the errors, so that's fine I was just wondering if you have any advice on getting the weight vectors themselves.
It looks like all createMatrix is doing is creating the two vectors required by polyfit. What you have will work, but, the more pythonic way to do it is
def createMatrix(dataSet, D):
D = 3 # set this to whatever degree you're trying
x, y = zip(*dataSet)
return polyfit(x, y, D)
(This S/O link provides a detailed explanation of the zip(*dataSet) idiom.)
This will return a vector of coefficients that you can then pass to something like poly1d to generate results. (Further explanation of both polyfit and poly1d can be found here.)
Obviously, you'll need to decide what value you want for D. The simple answer to that is 1, 2, or 3. Polynomials of higher order than cubic tend to be rather unstable and the intrinsic errors make their output rather meaningless.
It sounds like you might be trying to do some sort of correlation analysis (i.e., does y vary with x and, if so, to what extent?) You'll almost certainly want to just use linear (D = 1) regression for this type of analysis. You can try to do a least squares quadratic fit (D = 2) but, again, the error bounds are probably wider than your assumptions (e.g. normality of distribution) will tolerate.

Generating 3D Gaussian Data [duplicate]

This question already has answers here:
Generating 3D Gaussian distribution in Python
(2 answers)
Closed 2 years ago.
I'm trying to generate a 3D distribution, where x, y represents the surface plane, and z is the magnitude of some value, distributed over a range.
I'm looking at numpy's multivariate_normal, but it only lets me get a number of samples. I'd like the ability to specify some x, y coordinate, and get back what the z value should be; so I'd be able to query gp(x, y) and get back a z value that adheres to some mean and covariance.
Perhaps a more illustrative (toy) example: assume I have some temperature distribution that can be modeled as a gaussian process. So I might have a mean temperature of 20 at (0, 0), and some covariance [[1, 0], [0, 1]]. I'd like to be able to create a model that I can then query at different x, y locations to get the temperature at that position (so, at (5, 5) I might get back something like 7 degrees).
How to best accomplish this?
I assume that your data can be copied to a single np.array, which I will refer to as X in my code, with shape X.shape = (n,2), where n is the number of data points you have and you can have n = 1, if you wish to test a single point at a time. 2, of course, refers to the 2D space spanned by your coordinates (x and y) base. Then:
def estimate_gaussian(X):
return X.mean(axis=0), np.cov(X.T)
def mva_gaussian( X, mu, sigma2 ):
k = len(mu)
# check if sigma2 is a vector and, if yes, use as the diagonal of the covariance matrix
if sigma2.ndim == 1 :
sigma2 = np.diag(sigma2)
X = X - mu
return (2 * np.pi)**(-k/2) * np.linalg.det(sigma2)**(-0.5) * \
np.exp( -0.5 * np.sum( np.multiply( X.dot( np.linalg.inv(sigma2) ), X ), axis=1 ) ).reshape( ( X.shape[0], 1 ) )
will do what you want - that is, given data points you will get the value of the gaussian function at those points (or a single point). This is actually a generalized version of what you need, as this function can describe a multivariate gaussian. You seem to be interested in the k = 2 case and a diagonal covariance matrix sigma2.
Moreover, this is also a probability distribution - which you say you don't want. We don't have enough info to know what exactly it is you're trying to fit to (i.e. what you expect the three parameters of the gaussian function to be. Usually, people are interested in a normal distribution). Nevertheless, you can simply change the parameters in the return statement of the mva_gaussian function according to your needs and ignore the estimate gaussian function if you don't want a normalized distribution (although a normalized function would still give you what you seek - a real valued temperature - as long as you know the normalization process - which you do :-) ).
You can create a multivariate normal using scipy.stats.multivariate_normal.
>>> import scipy.stats
>>> dist = scipy.stats.multivariate_normal(mean=[2,3], cov=[[1,0],
[0,1]])
Then to find p(x,y) you can use pdf
>>> dist.pdf([2,3])
0.15915494309189535
>>> dist.pdf([1,1])
0.013064233284684921
Which represents the probability (which you called z) given any [x,y]

Reducing difference between two graphs by optimizing more than one variable in MATLAB/Python?

Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)

Fast 3D interpolation of atmospheric data in Numpy/Scipy

I am trying to interpolate 3D atmospheric data from one vertical coordinate to another using Numpy/Scipy. For example, I have cubes of temperature and relative humidity, both of which are on constant, regular pressure surfaces. I want to interpolate the relative humidity to constant temperature surface(s).
The exact problem I am trying to solve has been asked previously here, however, the solution there is very slow. In my case, I have approximately 3M points in my cube (30x321x321), and that method takes around 4 minutes to operate on one set of data.
That post is nearly 5 years old. Do newer versions of Numpy/Scipy perhaps have methods that handle this faster? Maybe new sets of eyes looking at the problem have a better approach? I'm open to suggestions.
EDIT:
Slow = 4 minutes for one set of data cubes. I'm not sure how else I can quantify it.
The code being used...
def interpLevel(grid,value,data,interp='linear'):
"""
Interpolate 3d data to a common z coordinate.
Can be used to calculate the wind/pv/whatsoever values for a common
potential temperature / pressure level.
grid : numpy.ndarray
The grid. For example the potential temperature values for the whole 3d
grid.
value : float
The common value in the grid, to which the data shall be interpolated.
For example, 350.0
data : numpy.ndarray
The data which shall be interpolated. For example, the PV values for
the whole 3d grid.
kind : str
This indicates which kind of interpolation will be done. It is directly
passed on to scipy.interpolate.interp1d().
returns : numpy.ndarray
A 2d array containing the *data* values at *value*.
"""
ret = np.zeros_like(data[0,:,:])
for yIdx in xrange(grid.shape[1]):
for xIdx in xrange(grid.shape[2]):
# check if we need to flip the column
if grid[0,yIdx,xIdx] > grid[-1,yIdx,xIdx]:
ind = -1
else:
ind = 1
f = interpolate.interp1d(grid[::ind,yIdx,xIdx], \
data[::ind,yIdx,xIdx], \
kind=interp)
ret[yIdx,xIdx] = f(value)
return ret
EDIT 2:
I could share npy dumps of sample data, if anyone was interested enough to see what I am working with.
Since this is atmospheric data, I imagine that your grid does not have uniform spacing; however if your grid is rectilinear (such that each vertical column has the same set of z-coordinates) then you have some options.
For instance, if you only need linear interpolation (say for a simple visualization), you can just do something like:
# Find nearest grid point
idx = grid[:,0,0].searchsorted(value)
upper = grid[idx,0,0]
lower = grid[idx - 1, 0, 0]
s = (value - lower) / (upper - lower)
result = (1-s) * data[idx - 1, :, :] + s * data[idx, :, :]
(You'll need to add checks for value being out of range, of course).For a grid your size, this will be extremely fast (as in tiny fractions of a second)
You can pretty easily modify the above to perform cubic interpolation if need be; the challenge is in picking the correct weights for non-uniform vertical spacing.
The problem with using scipy.ndimage.map_coordinates is that, although it provides higher order interpolation and can handle arbitrary sample points, it does assume that the input data be uniformly spaced. It will still produce smooth results, but it won't be a reliable approximation.
If your coordinate grid is not rectilinear, so that the z-value for a given index changes for different x and y indices, then the approach you are using now is probably the best you can get without a fair bit of analysis of your particular problem.
UPDATE:
One neat trick (again, assuming that each column has the same, not necessarily regular, coordinates) is to use interp1d to extract the weights doing something like follows:
NZ = grid.shape[0]
zs = grid[:,0,0]
ident = np.identity(NZ)
weight_func = interp1d(zs, ident, 'cubic')
You only need to do the above once per grid; you can even reuse weight_func as long as the vertical coordinates don't change.
When it comes time to interpolate then, weight_func(value) will give you the weights, which you can use to compute a single interpolated value at (x_idx, y_idx) with:
weights = weight_func(value)
interp_val = np.dot(data[:, x_idx, y_idx), weights)
If you want to compute a whole plane of interpolated values, you can use np.inner, although since your z-coordinate comes first, you'll need to do:
result = np.inner(data.T, weights).T
Again, the computation should be practically immediate.
This is quite an old question but the best way to do this nowadays is to use MetPy's interpolate_1d funtion:
https://unidata.github.io/MetPy/latest/api/generated/metpy.interpolate.interpolate_1d.html
There is a new implementation of Numba accelerated interpolation on regular grids in 1, 2, and 3 dimensions:
https://github.com/dbstein/fast_interp
Usage is as follows:
from fast_interp import interp2d
import numpy as np
nx = 50
ny = 37
xv, xh = np.linspace(0, 1, nx, endpoint=True, retstep=True)
yv, yh = np.linspace(0, 2*np.pi, ny, endpoint=False, retstep=True)
x, y = np.meshgrid(xv, yv, indexing='ij')
test_function = lambda x, y: np.exp(x)*np.exp(np.sin(y))
f = test_function(x, y)
test_x = -xh/2.0
test_y = 271.43
fa = test_function(test_x, test_y)
interpolater = interp2d([0,0], [1,2*np.pi], [xh,yh], f, k=5, p=[False,True], e=[1,0])
fe = interpolater(test_x, test_y)

Linear fit including all errors with NumPy/SciPy

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.
There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.
Fit with scipy.optimize.curve_fit gives me:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.
I read about another fit method here as well, so I tried to fit with scipy.odr as well:
Beta: [ 2.00538124 2.95000413]
Beta Std Error: [ 0.00652719 0.03870884]
Same but with 20 * y_err:
Beta: [ 2.00517894 2.9489472 ]
Beta Std Error: [ 0.00642428 0.03647149]
The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.
Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.
I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.
Fit with linfit:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00772283 0.04449971]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.15445662 0.88999413]
Fit with linfit(relsigma=True):
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Should I answer my question or just close/delete it now?
One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.
This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.
Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:
Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.
Please note that, from the documentation of curvefit:
sigma : None or N-length sequence
If not None, this vector will be used as relative weights in the
least-squares problem.
The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.
When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458, 3.97843448]))
('Errors: ', array([ 0.09617346, 0.64127574]))
Compares to the original result:
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386, 2.97817514]))
('Errors: ', array([ 0.00782591, 0.02983339]))
Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.
Short answer
For absolute values that include uncertainty in y (and in x for odr case):
In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov))
where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
flag.
For relative values (excludes uncertainty):
In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:
p, cov = numpy.polyfit(x, y, 1,cov = True)
errorbars = numpy.sqrt(numpy.diag(cov))
Long answer
There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?
My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing
diagonal elements of the covariance matrix cov(0,0) and
cov(1,1),
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant
odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207]
curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713]
polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]
Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.
I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).
import scipy.odr
import scipy.optimize
import numpy
x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4
def stddev(cov): return numpy.sqrt(numpy.diag(cov))
def f(B, x): return B[0]*x + B[1]
linear = scipy.odr.Model(f)
mydata = scipy.odr.RealData(x, y, sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput = myodr.run()
cov = myoutput.cov_beta
sd = myoutput.sd_beta
p = myoutput.beta
print 'odr: ', cov[0,0], cov[1,1], sd, stddev(cov)
p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = False,
xtol = 1e-20)
p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = True,
xtol = 1e-20)
print 'curve_fit: ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)
p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit: ', cov4[0,0], cov4[1,1], stddev(cov4)

Categories