numpy.polyfit with adapted parameters - python

Regarding to this: polynomial equation parameters
where I get 3 parameters for a squared function y = a*x² + b*x + c now I want only to get the first parameter for a squared function which describes my function y = a*x². With other words: I want to set b=c=0 and get the adapted parameter for a. In case I understand it right, polyfit isn't able to do this.

This can be done by numpy.linalg.lstsq. To explain how to use it, it is maybe easiest to show how you would do a standard 2nd order polyfit 'by hand'. Assuming you have your measurement vectors x and y, you first construct a so-called design matrix M like so:
M = np.column_stack((x**2, x, np.ones_like(x)))
after which you can obtain the usual coefficients as the least-square solution to the equation M * k = y using lstsq like this:
k, _, _, _ = np.linalg.lstsq(M, y)
where k is the column vector [a, b, c] with the usual coefficients. Note that lstsq returns some other parameters, which you can ignore. This is a very powerful trick, which allows you to fit y to any linear combination of the columns you put into your design matrix. It can be used e.g. for 2D fits of the type z = a * x + b * y (see e.g. this example, where I used the same trick in Matlab), or polyfits with missing coefficients like in your problem.
In your case, the design matrix is simply a single column containing x**2. Quick example:
import numpy as np
import matplotlib.pylab as plt
# generate some noisy data
x = np.arange(1000)
y = 0.0001234 * x**2 + 3*np.random.randn(len(x))
# do fit
M = np.column_stack((x**2,)) # construct design matrix
k, _, _, _ = np.linalg.lstsq(M, y) # least-square fit of M * k = y
# quick plot
plt.plot(x, y, '.', x, k*x**2, 'r', linewidth=3)
plt.legend(('measurement', 'fit'), loc=2)
plt.title('best fit: y = {:.8f} * x**2'.format(k[0]))
plt.show()
Result:

The coefficients are get to minimize the squared error, you don't assign them. However, you can set some of the coefficients to zero if they are too much insignificant. E.g., I have a list of points on curve y = 33*x²:
In [51]: x=np.arange(20)
In [52]: y=33*x**2 #y = 33*x²
In [53]: coeffs=np.polyfit(x, y, 2)
In [54]: coeffs
Out[54]: array([ 3.30000000e+01, 8.99625199e-14, -7.62430619e-13])
In [55]: epsilon=np.finfo(np.float32).eps
In [56]: coeffs[np.abs(coeffs)<epsilon]=0
In [57]: coeffs
Out[57]: array([ 33., 0., 0.])

Related

Why is scipy.optimize.curve_fit not producing a line of best fit for my points?

I am trying to plot several datasets for repeat R-T measurements and fit a cubic root line of best fit through each dataset using scipy.optimize.curve_fit.
My code produces a line for each dataset, but not a cubic root line of best fit. Each dataset is colour-coded to its corresponding line of best fit:
I've tried increasing the order of magnitude of my data, as I heard that sometimes scipy.optimize.curve_fit doesn't like very small numbers, but this made no change. If anyone could point out where I am going wrong I would be extremely grateful:
import numpy as np
from scipy.optimize import curve_fit
import scipy.optimize as scpo
import matplotlib.pyplot as plt
files = [ '50mA30%set1.lvm','50mA30%set3.lvm', '50mA30%set4.lvm',
'50mA30%set5.lvm']
for file in files:
data = numpy.loadtxt(file)
current_YBCO = data[:,1]
voltage_YBCO = data[:,2]
current_thermometer = data[:,3]
voltage_thermometer = data[:,4]
T = data[:,5]
R = voltage_thermometer/current_thermometer
p = np.polyfit(R, T, 4)
T_fit = p[0]*R**4 + p[1]*R**3 + p[2]*R**2 + p[3]*R + p[4]
y = voltage_YBCO/current_YBCO
def test(T_fit, a, b, c):
return a * (T_fit+b)**(1/3) + c
param, param_cov = curve_fit(test, np.array(T_fit), np.array(y),
maxfev=100000)
ans = (param[0]*(np.array(T_fit)+param[1])**(1/3)+param[2])
plt.scatter(T_fit,y, 0.5)
plt.plot(T_fit, ans, '--', label ="optimized data")
plt.xlabel("YBCO temperature(K)")
plt.ylabel("Resistance of YBCO(Ohms)")
plt.xlim(97, 102)
plt.ylim(-.00025, 0.00015)
Two things are making this harder for you.
First, cube roots of negative numbers for numpy arrays. If you try this you'll see that you aren't getting the result you want:
x = np.array([-8, 0, 8])
x**(1/3) # array([nan, 0., 2.])
This means that your test function is going to have a problem any time it gets a negative value, and you need the negative values to create the left hand side of the curves. Instead, use np.cbrt
x = np.array([-8, 0, 8])
np.cbrt(x) # array([-2., 0., 2.])
Secondly, your function is
def test(T_fit, a, b, c):
return a * (T_fit + b)**(1/3) + c
Unfortunately, this just doesn't look very much like the graph you show. This makes it really hard for the optimisation to find a "good" fit. Things I particularly dislike about this function are
it goes vertical at T_fit == b. Your data has a definite slope at this point
it keeps growing quite strongly away from T_fit = b. Your data goes horizontal.
However, it is sometimes possible to get a more "sensible fit" by giving the optimisation a good starting point.
You haven't given us any code to work from, which makes this much harder. So, by way of illustration, try this:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
fig, ax = plt.subplots(1)
# Generate some data which looks a bit like yours
x = np.linspace(95, 105, 1001)
y = 0.001 * (-0.5 + 1/(1 + np.exp((100-x)/0.5)) + 0.125 * np.random.rand(len(x)))
# A fitting function
def fit(x, a, b, c):
return a * np.cbrt((x + b)) + c
# Perform the fitting
param, param_cov = scipy.optimize.curve_fit(fit, x, y, p0=[0.004, -100, 0.001], maxfev=100000)
# Calculate the fitted data:
yf = fit(x, *param)
print(param)
# Plot data and the fitted curve
ax.plot(x, y, '.')
ax.plot(x, yf, '-')
Now, if I run this code I do get a fit which roughly follows the data. However, if I take the initial guess out, i.e. do the fitting by calling
param, param_cov = scipy.optimize.curve_fit(fit, x, y, maxfev=100000)
then the fit is much worse. The reason why is that curve_fit will start from an initial guess of [1, 1, 1]. The solution which looks approximately right lies in a different valley to [1, 1, 1] and therefore it isn't the solution which is found. Said another way, it only finds the local minimum, not the global.

Doing many iterations of scipy's `curve_fit` in one go

Consider the following MWE
import numpy as np
from scipy.optimize import curve_fit
X=np.arange(1,10,1)
Y=abs(X+np.random.randn(15,9))
def linear(x, a, b):
return (x/b)**a
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(linear, X, Y[ix])
coeffs.append(c0)
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(linear, XX, Y.flatten())
I have a problem where I have to do basically that, but instead of 15 iterations it's thousands and it's pretty slow.
Is there any way to do all of those iterations at once with curve_fit? I know the result from the function is supposed to be a 1D-array, so just passing the args like this
c0, pcov = curve_fit(nlinear, X, Y)
is not going to work. Also I think the answer has to be in flattening Y, so I can get a flattened result, but I just can't get anything to work.
EDIT
I know that if I do something like
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(nlinear, XX, Y.flatten())
then I get a "mean" value of the coefficients, but that's not what I want.
EDIT 2
For the record, I solved with using Jacques Kvam's set-up but implemented using Numpy (because of a limitation)
lX = np.log(X)
lY = np.log(Y)
A = np.vstack([lX, np.ones(len(lX))]).T
m, c=np.linalg.lstsq(A, lY.T)[0]
And then m is a and to get b:
b=np.exp(-c/m)
Least squares won't give the same result because the noise is transformed by log in this case. If the noise is zero, both methods give the same result.
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
X=np.arange(1,7)
Y = np.zeros((4, 6))
for i in range(4):
b = a = i + 1
Y[i] = (X/b)**a + 0.01 * randn(6)
def linear(x, a, b):
return (x/b)**a
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(linear, X, Y[ix])
coeffs.append(c0)
coefs is
[array([ 0.99309127, 0.98742861]),
array([ 2.00197613, 2.00082722]),
array([ 2.99130237, 2.99390585]),
array([ 3.99644048, 3.9992937 ])]
I'll use scikit-learn's implementation of linear regression since I believe that scales well.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Take logs of X and Y
lX = np.log(X)[None, :]
lY = np.log(Y)
Now fit and check that coeffiecients are the same as before.
lr.fit(lX.T, lY.T)
lr.coef_
Which gives similar exponent.
array([ 0.98613517, 1.98643974, 2.96602892, 4.01718514])
Now check the divisor.
np.exp(-lr.intercept_ / lr.coef_.ravel())
Which gives similar coefficient, you can see the methods diverging somewhat though in their answers.
array([ 0.99199406, 1.98234916, 2.90677142, 3.73416501])
It might be useful in some situations to have the best fit parameters as a numpy array for further calculations. One can add the following after the for loop:
bestfit_par = np.asarray(coeffs)

Python: Scipy's curve_fit for NxM arrays?

Usually I use Scipy.optimize.curve_fit to fit custom functions to data.
Data in this case was always a 1 dimensional array.
Is there a similiar function for a two dimensional array?
So, for example, I have a 10x10 numpy array. Then I have a function that does some stuff and creates a 10x10 numpy array, and I want to fit the function, so that the resulting 10x10 array has the best fit to the input array.
Maybe an example is better :)
data = pyfits.getdata('data.fits') #fits is an image format, this gives me a NxM numpy array
mod1 = pyfits.getdata('mod1.fits')
mod2 = pyfits.getdata('mod2.fits')
mod3 = pyfits.getdata('mod3.fits')
mod1_1D = numpy.ravel(mod1)
mod2_1D = numpy.ravel(mod2)
mod3_1D = numpy.ravel(mod3)
def dostuff(a,b): #originaly this is a function for 2D arrays
newdata = (mod1_1D*12)+(mod2_1D)**a - mod3_1D/b
return newdata
Now a and b should be fitted, so that newdata is as close as possible to data.
What I got so far:
data1D = numpy.ravel(data)
data_X = numpy.arange(data1D.size)
fit = curve_fit(dostuff,data_X,data1D)
But print fit only gives me
(array([ 1.]), inf)
I do have some nans in the arrays, maybe thats a problem?
The goal is to express the 2D function as a 1D function: g(x, y, ...) --> f(xy, ...)
Converting the coordinate pair (x, y) into a single number xy may seem tricky at first. But it's actually quite simple. Just enumerate all data points and you have a single number that uniquely defines each coordinate pair. The fitted function simply has to reconstruct the original coordinates, do it's calculations and return the result.
Example that fits a 2D linear gradient in a 20x10 image:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
n, m = 10, 20
# noisy example data
x = np.arange(m).reshape(1, m)
y = np.arange(n).reshape(n, 1)
z = x + y * 2 + np.random.randn(n, m) * 3
def f(xy, a, b):
i = xy // m # reconstruct y coordinates
j = xy % m # reconstruct x coordinates
out = i * a + j * b
return out
xy = np.arange(z.size) # 0 is the top left pixel and 199 is the top right pixel
res = sp.optimize.curve_fit(f, xy, np.ravel(z))
z_est = f(xy, *res[0])
z_est2d = z_est.reshape(n, m)
plt.subplot(2, 1, 1)
plt.plot(np.ravel(z), label='original')
plt.plot(z_est, label='fitted')
plt.legend()
plt.subplot(2, 2, 3)
plt.imshow(z)
plt.xlabel('original')
plt.subplot(2, 2, 4)
plt.imshow(z_est2d)
plt.xlabel('fitted')
I would recommend using symfit for this, I wrote that to take care of all of the magic for you automatically.
In symfit you would just write the equation pretty much as you would on paper, and then you can run the fit.
I would do something like this:
from symfit import parameters, variables, Fit
# Assuming all this data is in the form of NxM arrays
data = pyfits.getdata('data.fits')
mod1 = pyfits.getdata('mod1.fits')
mod2 = pyfits.getdata('mod2.fits')
mod3 = pyfits.getdata('mod3.fits')
a, b = parameters('a, b')
x, y, z, u = variables('x, y, z, u')
model = {u: (x * 12) + y**a - z / b}
fit = Fit(model, x=mod1, y=mod2, z=mod3, u=data)
fit_result = fit.execute()
print(fit_result)
Unfortunatelly I have not yet included examples of the kind you need in the docs yet, but if you just look at the docs I think you can figure it out in case this doesn't work out of the box.

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

What's the error of numpy.polyfit?

I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.
If you specify full=True in your call to polyfit, it will include extra information:
>>> x = np.arange(100)
>>> y = x**2 + 3*x + 5 + np.random.rand(100)
>>> np.polyfit(x, y, 2)
array([ 0.99995888, 3.00221219, 5.56776641])
>>> np.polyfit(x, y, 2, full=True)
(array([ 0.99995888, 3.00221219, 5.56776641]), # coefficients
array([ 7.19260721]), # residuals
3, # rank
array([ 11.87708199, 3.5299267 , 0.52876389]), # singular values
2.2204460492503131e-14) # conditioning threshold
The residual value returned is the sum of the squares of the fit errors, not sure if this is what you are after:
>>> np.sum((np.polyval(np.polyfit(x, y, 2), x) - y)**2)
7.1926072073491056
In version 1.7 there is also a cov keyword that will return the covariance matrix for your coefficients, which you could use to calculate the uncertainty of the fit coefficients themselves.
As you can see in the documentation:
Returns
-------
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first.
If `y` was 2-D, the coefficients for `k`-th data set are in ``p[:,k]``.
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
Which means that if you can do a fit and get the residuals as:
import numpy as np
x = np.arange(10)
y = x**2 -3*x + np.random.random(10)
p, res, _, _, _ = numpy.polyfit(x, y, deg, full=True)
Then, the p are your fit parameters, and the res will be the residuals, as described above. The _'s are because you don't need to save the last three parameters, so you can just save them in the variable _ which you won't use. This is a convention and is not required.
#Jaime's answer explains what the residual means. Another thing you can do is look at those squared deviations as a function (the sum of which is res). This is particularly helpful to see a trend that didn't fit sufficiently. res can be large because of statistical noise, or possibly systematic poor fitting, for example:
x = np.arange(100)
y = 1000*np.sqrt(x) + x**2 - 10*x + 500*np.random.random(100) - 250
p = np.polyfit(x,y,2) # insufficient degree to include sqrt
yfit = np.polyval(p,x)
figure()
plot(x,y, label='data')
plot(x,yfit, label='fit')
plot(x,yfit-y, label='var')
So in the figure, note the bad fit near x = 0:

Categories