I a have two sets of data of which I want to find a correlation. Although there is quite some scattering of data there's obvious a relation. I currently use numpy polyfit (8th order) but there is some "wiggling" of the line (especially at the beginning and the end) which is not appropriate. Secondly I don't think the fit is very well at the beginning of the line (the curve should be slightly steeper.
How can I get a best fit "spline" through these data points?
My current code:
# fit regression line
regressionLineOrder = 8
regressionLine = np.polyfit(data['x'], data['y'], regressionLineOrder)
p = np.poly1d(regressionLine)
Take a look at #MatthewDrury's answer for Why use regularisation in polynomial regression instead of lowering the degree?. It's simply fantastic and spot on. The most interesting bit comes in at the end when he starts talking about using a natural cubic spline to fit a regression in place of a regularized polynomial of degree 10. You could use the implementation of scipy.interpolate.CubicSpline to accomplish something very similar. There are a ton of classes for other spline methods contained in scipy.interpolate for similar methods.
Here is a simple example:
from scipy.interpolate import CubicSpline
cs = CubicSpline(data['x'], data['y'])
x_range = np.arange(x_min, x_max, some_step)
plt.plot(x_range, cs(x_range), label='Cubic Spline')
There are some possible issues with your data set... from your plot of n (x,y) points, they are linked with straight lines; if you display points instead, should see the points density along your domain, and it's not evenly distributed as the lines are not. Let's say your domain is [xmin,xmax], an 8th order polynom is good for interpolation, but it wiggles because of the high order and also because the point density is oddly distributed. Polynoms are not good for extrapolation, once there are no control points outside your domain. You could fix that with a spline, a cubic natural spline will control the derivative at xmin and xmax, but to do that, you should sort your dataset (x axis) and take a subsample of the n points with rolling average as control points to the spline algoritm. If your problem has an analytical solution (a gaussian variogram, for instance, looks like your points distribution), just try optimizing the parameters (range and sill, for the gaussian variogram, for instance) to minimize error inside the domain and follow the assintotes outside.
Related
My problem is, that I have about 50.000 non-linear data points (x,y,z) with z depending on the independent variables x and y. From one side, so from a two-dimensional perspective, the data points look like a polynomial of degree 7. Unfortunately I cannot show this data.
My goal is to find a polynomial in 3D that can fit this data, without knowing the degree of the polynom beforehand. So I would like a function like f(x,y) = ax^3 + bx^2 + cx^2y + dy^3 + ...
Unfortunately, in python I have only found something like surface-fitting, where you need the degree beforehand. Or something like transforming the polynomial problem into a mutlivariable linear problem with scikit-learn. The later had very poor results with my dataset.
Does anyone know of a better method for this problem? Many thanks in advance.
As far as fitting a polynomial to a surface, I think your best bet is to try different sets of polynomials and rank them based on fit, as described here.
If you are willing to try different surface fitting methods, I would recommend looking into what scipy has to offer, particularly in the Multivariate, unstructured data section. scipy.interpolate.griddata, for example, uses a cubic spline to interpolate between data points. See below code for a demo:
import numpy as np
from scipy.interpolate import griddata
# X and Y features are a 2D numpy array
xy = np.random.randn(20,2)
# z is nonlinear function of x and y
z = xy[:,0] + xy[:,1]**2
# make grid of x and y points to interpolate
xsurf = np.arange(-3,3,0.1)
ysurf = xsurf
xsurf, ysurf = np.meshgrid(xsurf,ysurf)
surfPts = griddata(xy,z, np.vstack((xsurf.flatten(),ysurf.flatten())).T)
that code will yield the following surface fit:
I am new in Python and a bit confused with the interpolation and Least-squares fitting of two ndarrays.
I have 2 ndarrays:
My final goal is to make Least-squares fitting of the modelled spectrum (blue curve) to the observed spectrum (orange curve).
Blue curve ndarray has the following parameters:
Orange curve ndarray has the following parameters:
As a first and the easiest step I wanted to plot the residuals (difference) between that two ndarrays, but the problem is that since they have different sizes 391 and 256 respectively. I've tried to use numpy.reshape or ndarray.resphape functions, but they lead to an errors.
Probably the proper solution will be to start with the interpolation of the blue curve into the less denser grid of the orange curve. I've tried to use numpy.interp function but it also leads to an errors.
Something along the lines of the following:
import numpy as np
import matplotlib.pyplot as plt
n_denser = 33
n_coarser = 7
x_denser = np.linspace(0,1,n_denser)
y_denser = np.power(x_denser, 2) + np.random.randn(n_denser)/10.
x_coarser = np.linspace(0,1,n_coarser)
y_coarser = np.power(x_coarser, 2) + np.random.randn(n_coarser)/10. + 0.5
y_dense_interp = np.interp(x_coarser, x_denser, y_denser)
plt.plot(x_denser, y_denser, 'b+-')
plt.plot(x_coarser, y_coarser, 'ro:')
plt.plot(x_coarser, y_dense_interp, 'go')
plt.legend(['dense data', 'coarse data', 'interp data'])
plt.show()
Which returns something like:
Your confusion seems to stem from mixing up the methods you mention. Least-squares is not a method for interpolation, rather it is a minimization curve fitting method. One key difference is that with interpolation the plots always pass through the original data points. With least-squares this can happen bit it is not generally the case.
Cubic-spline interpolation will give you 'nice' plots if you need to pass through the original data points.
If you want to use least-squares, you need to know what degree polynomial you want to fit. The most common is linear (first order).
Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)
I am considering using this method to interpolate some 3D points I have. As an input I have atmospheric concentrations of a gas at various elevations over an area. The data I have appears as values every few feet of vertical elevation for several tens of feet, but horizontally separated by many hundreds of feet (so 'columns' of tightly packed values).
The assumption is that values vary in the vertical direction significantly more than in the horizontal direction at any given point in time.
I want to perform 3D kriging with that assumption accounted for (as a parameter I can adjust or that is statistically defined - either/or).
I believe the scikit learn module can do this. If it can, my question is how do I create a discrete cell output? That is, output into a 3D grid of data with dimensions of, say, 50 x 50 x 1 feet. Ideally, I would like an output of [x_location, y_location, value] with separation of those (or similar) distances.
Unfortunately I don't have a lot of time to play around with it, so I'm just hoping to figure out if this is possible in Python before delving into it. Thanks!
Yes, you can definitely do that in scikit_learn.
In fact, it is a basic feature of kriging/Gaussian process regression that you can use anisotropic covariance kernels.
As it is precised in the manual (cited below) ou can either set the parameters of the covariance yourself or estimate them. And you can choose either having all parameters equal or all different.
theta0 : double array_like, optional
An array with shape (n_features, ) or (1, ). The parameters in the
autocorrelation model. If thetaL and thetaU are also specified, theta0
is considered as the starting point for the maximum likelihood
estimation of the best set of parameters. Default assumes isotropic
autocorrelation model with theta0 = 1e-1.
In the 2d case, something like this should work:
import numpy as np
from sklearn.gaussian_process import GaussianProcess
x = np.arange(1,51)
y = np.arange(1,51)
X, Y = np.meshgrid(lons, lats)
points = zip(obs_x, obs_y)
values = obs_data # Replace with your observed data
gp = GaussianProcess(theta0=0.1, thetaL=.001, thetaU=1., nugget=0.001)
gp.fit(points, values)
XY_pairs = np.column_stack([X.flatten(), Y.flatten()])
predicted = gp.predict(XY_pairs).reshape(X.shape)
I need to convolute the next curve with a Gaussian function of specific parameters centered at 3934.8A.
The problem I see is that my curve is a discrete array and the Gaussian would be a well define continuos function. How can I make this work?
To do this, you need to create a Gaussian that's discretized at the same spatial scale as your curve, then just convolve.
Specifically, say your original curve has N points that are uniformly spaced along the x-axis (where N will generally be somewhere between 50 and 10,000 or so). Then the point spacing along the x-axis will be (physical range)/(digital range) = (3940-3930)/N, and the code would look like this:
dx = float(3940-3930)/N
gx = np.arange(-3*sigma, 3*sigma, dx)
gaussian = np.exp(-(x/sigma)**2/2)
result = np.convolve(original_curve, gaussian, mode="full")
Here this is a zero-centered gaussian and does not include the offset you refer to (which to me would just add confusion, since the convolution by its nature is a translating operation, so starting with something already translated is confusing).
I highly recommend keeping everything in real, physical units, as I did above. Then it's clear, for example, what the width of the gaussian is, etc.