How to implement a constrained linear fit in Python? - python

I'm trying to fit a linear model to a set of data, with the constraint that all the residuals (model - data) are positive - in other words, the model should be the "best overestimate". Without this constraint, linear models can be easily found with numpy's polyfit as shown below.
import numpy as np
import matplotlib.pyplot as plt
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
plt.scatter(x,y)
coeff = np.polyfit(x,y,1)
plt.plot(x,np.polyval(coeff,x),c='r',label='numpy-polyval')
plt.plot(x,np.polyval([-2,-3.6],x),c='g',label='desired-fit') #a rough guess of the desired result
plt.legend()
example1
Is there an efficient way to implement a linear fit with this type of constraint?

This is a quadratic programming problem. There are several libraries (CVXOPT, quadprog etc.) that can be used to solve it. Here is an example using quadprog:
import numpy as np
import matplotlib.pyplot as plt
import quadprog
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
A = np.c_[x, np.ones(len(x))]
y = np.array(y)
G = A.T # A
a = A.T # y
C = A.T
b = y
coeffs = quadprog.solve_qp(G, a, C, b)[0]
plt.scatter(x, y)
plt.plot(x, np.polyval(coeffs, x), c='r')
plt.show()
This gives:
See e.g. this post for more information. In describes, in particular, how to set up a linear regression problem as a quadratic programming problem.
As a side note, the optimal line will always pass through one data point, but it need not pass through two such points. For example, take x = [-1., 0., 1.] and y = [1., 2., 1.].

Yes, the best fit is a line through the top two points. I do an argsort to find the top Ys, compute the slope and y-intercept, and off we go:
import numpy as np
import matplotlib.pyplot as plt
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
plt.scatter(x,y)
coeff = np.polyfit(x,y,1)
model1 = np.polyval(coeff,x)
model1 += (y-model1).max()
print(model1)
print(sum((y-model1)**2))
z = np.argsort(y)
pt0 = (x[z[-1]],y[z[-1]])
pt1 = (x[z[-2]],y[z[-2]])
m = (pt1[1]-pt0[1])/(pt1[0]-pt0[0])
b = pt0[1]-m*pt0[0]
model2 = np.polyval([m,b],x)
print(model2)
print(sum((y-model2)**2))
plt.plot(x,model1,c='r',label='numpy-polyval')
plt.plot(x,model2,c='g',label='generated')
plt.legend()
plt.show()
Output:

Related

how to fit a function to data in python

I want to fit a function to the independant (X) and dependent (y) variables:
import numpy as np
y = np.array([1.45952016, 1.36947283, 1.31433227, 1.24076599, 1.20577963,
1.14454815, 1.13068077, 1.09638278, 1.08121406, 1.04417094,
1.02251471, 1.01268524, 0.98535659, 0.97400591])
X = np.array([4.571428571362048, 8.771428571548313, 12.404761904850602, 17.904761904850602,
22.904761904850602, 31.238095237873495, 37.95833333302289,
44.67857142863795, 51.39880952378735, 64.83928571408615,
71.5595238097012, 85., 98.55357142863795, 112.1071428572759])
I already tried scipy package in this way:
from scipy.optimize import curve_fit
def func (x, a, b, c):
return 1/(a*(x**2) + b*(x**1) + c)
g = [1, 1, 1]
c, cov = curve_fit (func, X.flatten(), y.flatten(), g)
test_ar = np.arange(min(X), max(X), 0.25)
pred = np.empty(len(test_ar))
for i in range (len(test_ar)):
pred[i] = func(test_ar[i], c[0], c[1], c[2])
I can add higher orders of polynomial to make my func more accurate but I want to keep it simple. I very much appreciate if anyone an give me some help on how to find another function or make my prediction better. The figure also shows the result of the prediction:
First thing you want to do is to specify how do you measure "accuracy" which in your case is not an appropriate term at all.
What are you essentially doing is called linear regression. Suitable metrics in this case are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE). It is up to you to decide which metric to use and what threshold to set for being "acceptable".
The image that you are showing above (where you've fitted the line) looks fine BUT please expand your X-axis from -100 to 300 and show us the image again this is a problem with high degree polynomials.
This is a 101 example how to use regression in scikit-learn. In your case if you want to use x^2 or x^3 for predicting y, you just need to add them in to the data ... Currently your X variable is an array (a vector) you need to expand that to become a matrix where each column is a feature (x, x^2, x^3 ...)
here is some code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
y = [1.45952016, 1.36947283, 1.31433227, 1.24076599,
1.20577963, 1.14454815, 1.13068077, 1.09638278,
1.08121406, 1.04417094, 1.02251471, 1.01268524, 0.98535659,
0.97400591]
x = [4.571428571362048, 8.771428571548313, 12.404761904850602,
17.904761904850602, 22.904761904850602, 31.238095237873495,
37.95833333302289, 44.67857142863795, 51.39880952378735,
64.83928571408615, 71.5595238097012, 85., 98.55357142863795, 112.1071428572759]
df = pd.DataFrame({
'x' : x,
'x^2': [i**2 for i in x],
'x^3': [i**3 for i in x],
'y': y
})
X = df[['x','x^2','x^3']]
y = df['y']
model = linear_model.LinearRegression()
model.fit(X, y)
y1 = model.predict(X)
coef = model.coef_
intercept = model.intercept_
you can see the coefficients from the coef variable:
array([-1.67456732e-02, 2.03899728e-04, -8.70976426e-07])
you can see the intercept from the intercept variable:
1.5042389677980577
which in your case means -> y1 = -1.67e-2x +2.03e-4x^2 -8.70e-7x^3 + 1.5

Why is scipy.optimize.curve_fit not producing a line of best fit for my points?

I am trying to plot several datasets for repeat R-T measurements and fit a cubic root line of best fit through each dataset using scipy.optimize.curve_fit.
My code produces a line for each dataset, but not a cubic root line of best fit. Each dataset is colour-coded to its corresponding line of best fit:
I've tried increasing the order of magnitude of my data, as I heard that sometimes scipy.optimize.curve_fit doesn't like very small numbers, but this made no change. If anyone could point out where I am going wrong I would be extremely grateful:
import numpy as np
from scipy.optimize import curve_fit
import scipy.optimize as scpo
import matplotlib.pyplot as plt
files = [ '50mA30%set1.lvm','50mA30%set3.lvm', '50mA30%set4.lvm',
'50mA30%set5.lvm']
for file in files:
data = numpy.loadtxt(file)
current_YBCO = data[:,1]
voltage_YBCO = data[:,2]
current_thermometer = data[:,3]
voltage_thermometer = data[:,4]
T = data[:,5]
R = voltage_thermometer/current_thermometer
p = np.polyfit(R, T, 4)
T_fit = p[0]*R**4 + p[1]*R**3 + p[2]*R**2 + p[3]*R + p[4]
y = voltage_YBCO/current_YBCO
def test(T_fit, a, b, c):
return a * (T_fit+b)**(1/3) + c
param, param_cov = curve_fit(test, np.array(T_fit), np.array(y),
maxfev=100000)
ans = (param[0]*(np.array(T_fit)+param[1])**(1/3)+param[2])
plt.scatter(T_fit,y, 0.5)
plt.plot(T_fit, ans, '--', label ="optimized data")
plt.xlabel("YBCO temperature(K)")
plt.ylabel("Resistance of YBCO(Ohms)")
plt.xlim(97, 102)
plt.ylim(-.00025, 0.00015)
Two things are making this harder for you.
First, cube roots of negative numbers for numpy arrays. If you try this you'll see that you aren't getting the result you want:
x = np.array([-8, 0, 8])
x**(1/3) # array([nan, 0., 2.])
This means that your test function is going to have a problem any time it gets a negative value, and you need the negative values to create the left hand side of the curves. Instead, use np.cbrt
x = np.array([-8, 0, 8])
np.cbrt(x) # array([-2., 0., 2.])
Secondly, your function is
def test(T_fit, a, b, c):
return a * (T_fit + b)**(1/3) + c
Unfortunately, this just doesn't look very much like the graph you show. This makes it really hard for the optimisation to find a "good" fit. Things I particularly dislike about this function are
it goes vertical at T_fit == b. Your data has a definite slope at this point
it keeps growing quite strongly away from T_fit = b. Your data goes horizontal.
However, it is sometimes possible to get a more "sensible fit" by giving the optimisation a good starting point.
You haven't given us any code to work from, which makes this much harder. So, by way of illustration, try this:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
fig, ax = plt.subplots(1)
# Generate some data which looks a bit like yours
x = np.linspace(95, 105, 1001)
y = 0.001 * (-0.5 + 1/(1 + np.exp((100-x)/0.5)) + 0.125 * np.random.rand(len(x)))
# A fitting function
def fit(x, a, b, c):
return a * np.cbrt((x + b)) + c
# Perform the fitting
param, param_cov = scipy.optimize.curve_fit(fit, x, y, p0=[0.004, -100, 0.001], maxfev=100000)
# Calculate the fitted data:
yf = fit(x, *param)
print(param)
# Plot data and the fitted curve
ax.plot(x, y, '.')
ax.plot(x, yf, '-')
Now, if I run this code I do get a fit which roughly follows the data. However, if I take the initial guess out, i.e. do the fitting by calling
param, param_cov = scipy.optimize.curve_fit(fit, x, y, maxfev=100000)
then the fit is much worse. The reason why is that curve_fit will start from an initial guess of [1, 1, 1]. The solution which looks approximately right lies in a different valley to [1, 1, 1] and therefore it isn't the solution which is found. Said another way, it only finds the local minimum, not the global.

L1 norm instead of L2 norm for cost function in regression model

I was wondering if there's a function in Python that would do the same job as scipy.linalg.lstsq but uses “least absolute deviations” regression instead of “least squares” regression (OLS). I want to use the L1 norm, instead of the L2 norm.
In fact, I have 3d points, which I want the best-fit plane of them. The common approach is by the least square method like this Github link. But It's known that this doesn't give the best fit always, especially when we have interlopers in our set of data. And it's better to calculate the least absolute deviation. The difference between the two methods is explained more here.
It'll not be solved by functions such as MAD since it's an Ax = b matrix equations and requires loops to minimizes the results. I want to know if anyone knows of a relevant function in Python - probably in a linear algebra package - that would calculate “least absolute deviations” regression?
This is not so difficult to roll yourself, using scipy.optimize.minimize and a custom cost_function.
Let us first import the necessities,
from scipy.optimize import minimize
import numpy as np
And define a custom cost function (and a convenience wrapper for obtaining the fitted values),
def fit(X, params):
return X.dot(params)
def cost_function(params, X, y):
return np.sum(np.abs(y - fit(X, params)))
Then, if you have some X (design matrix) and y (observations), we can do the following,
output = minimize(cost_function, x0, args=(X, y))
y_hat = fit(X, output.x)
Where x0 is some suitable initial guess for the optimal parameters (you could take #JamesPhillips' advice here, and use the fitted parameters from an OLS approach).
In any case, when test-running with a somewhat contrived example,
X = np.asarray([np.ones((100,)), np.arange(0, 100)]).T
y = 10 + 5 * np.arange(0, 100) + 25 * np.random.random((100,))
I find,
fun: 629.4950595335436
hess_inv: array([[ 9.35213468e-03, -1.66803210e-04],
[ -1.66803210e-04, 1.24831279e-05]])
jac: array([ 0.00000000e+00, -1.52587891e-05])
message: 'Optimization terminated successfully.'
nfev: 144
nit: 11
njev: 36
status: 0
success: True
x: array([ 19.71326758, 5.07035192])
And,
fig = plt.figure()
ax = plt.axes()
ax.plot(y, 'o', color='black')
ax.plot(y_hat, 'o', color='blue')
plt.show()
With the fitted values in blue, and the data in black.
You can solve your problem using scipy.minimize function. You have to set the function you want to minimize (in our case a plane with the form Z= aX + bY + c) and the error function (L1 norm) then run the minimizer with some starting value.
import numpy as np
import scipy.linalg
from scipy.optimize import minimize
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def fit(X, params):
# 3d Plane Z = aX + bY + c
return X.dot(params[:2]) + params[2]
def cost_function(params, X, y):
# L1- norm
return np.sum(np.abs(y - fit(X, params)))
We generate 3d points
# Generating 3-dim points
mean = np.array([0.0,0.0,0.0])
cov = np.array([[1.0,-0.5,0.8], [-0.5,1.1,0.0], [0.8,0.0,1.0]])
data = np.random.multivariate_normal(mean, cov, 50)
Last we run the minimizer
output = minimize(cost_function, [0.5,0.5,0.5], args=(np.c_[data[:,0], data[:,1]], data[:, 2]))
y_hat = fit(np.c_[data[:,0], data[:,1]], output.x)
X,Y = np.meshgrid(np.arange(min(data[:,0]), max(data[:,0]), 0.5), np.arange(min(data[:,1]), max(data[:,1]), 0.5))
XX = X.flatten()
YY = Y.flatten()
# # evaluate it on grid
Z = output.x[0]*X + output.x[1]*Y + output.x[2]
fig = plt.figure(figsize=(10,10))
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, alpha=0.2)
ax.scatter(data[:,0], data[:,1], data[:,2], c='r')
plt.show()
Note: I have used the previous response code and the code from the github as a start

Python curve_fit with multiple independent variables

Python's curve_fit calculates the best-fit parameters for a function with a single independent variable, but is there a way, using curve_fit or something else, to fit for a function with multiple independent variables? For example:
def func(x, y, a, b, c):
return log(a) + b*log(x) + c*log(y)
where x and y are the independent variable and we would like to fit for a, b, and c.
You can pass curve_fit a multi-dimensional array for the independent variables, but then your func must accept the same thing. For example, calling this array X and unpacking it to x, y for clarity:
import numpy as np
from scipy.optimize import curve_fit
def func(X, a, b, c):
x,y = X
return np.log(a) + b*np.log(x) + c*np.log(y)
# some artificially noisy data to fit
x = np.linspace(0.1,1.1,101)
y = np.linspace(1.,2., 101)
a, b, c = 10., 4., 6.
z = func((x,y), a, b, c) * 1 + np.random.random(101) / 100
# initial guesses for a,b,c:
p0 = 8., 2., 7.
print(curve_fit(func, (x,y), z, p0))
Gives the fit:
(array([ 9.99933937, 3.99710083, 6.00875164]), array([[ 1.75295644e-03, 9.34724308e-05, -2.90150983e-04],
[ 9.34724308e-05, 5.09079478e-06, -1.53939905e-05],
[ -2.90150983e-04, -1.53939905e-05, 4.84935731e-05]]))
optimizing a function with multiple input dimensions and a variable number of parameters
This example shows how to fit a polynomial with a two dimensional input (R^2 -> R) by an increasing number of coefficients. The design is very flexible so that the callable f from curve_fit is defined once for any number of non-keyword arguments.
minimal reproducible example
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def poly2d(xy, *coefficients):
x = xy[:, 0]
y = xy[:, 1]
proj = x + y
res = 0
for order, coef in enumerate(coefficients):
res += coef * proj ** order
return res
nx = 31
ny = 21
range_x = [-1.5, 1.5]
range_y = [-1, 1]
target_coefficients = (3, 0, -19, 7)
xs = np.linspace(*range_x, nx)
ys = np.linspace(*range_y, ny)
im_x, im_y = np.meshgrid(xs, ys)
xdata = np.c_[im_x.flatten(), im_y.flatten()]
im_target = poly2d(xdata, *target_coefficients).reshape(ny, nx)
fig, axs = plt.subplots(2, 3, figsize=(29.7, 21))
axs = axs.flatten()
ax = axs[0]
ax.set_title('Unknown polynomial P(x+y)\n[secret coefficients: ' + str(target_coefficients) + ']')
sm = ax.imshow(
im_target,
cmap = plt.get_cmap('coolwarm'),
origin='lower'
)
fig.colorbar(sm, ax=ax)
for order in range(5):
ydata=im_target.flatten()
popt, pcov = curve_fit(poly2d, xdata=xdata, ydata=ydata, p0=[0]*(order+1) )
im_fit = poly2d(xdata, *popt).reshape(ny, nx)
ax = axs[1+order]
title = 'Fit O({:d}):'.format(order)
for o, p in enumerate(popt):
if o%2 == 0:
title += '\n'
if o == 0:
title += ' {:=-{w}.1f} (x+y)^{:d}'.format(p, o, w=int(np.log10(max(abs(p), 1))) + 5)
else:
title += ' {:=+{w}.1f} (x+y)^{:d}'.format(p, o, w=int(np.log10(max(abs(p), 1))) + 5)
title += '\nrms: {:.1f}'.format( np.mean((im_fit-im_target)**2)**.5 )
ax.set_title(title)
sm = ax.imshow(
im_fit,
cmap = plt.get_cmap('coolwarm'),
origin='lower'
)
fig.colorbar(sm, ax=ax)
for ax in axs.flatten():
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
P.S. The concept of this answer is identical to my other answer here, but the code example is way more clear. At the time given, I will delete the other answer.
Fitting to an unknown numer of parameters
In this example, we try to reproduce some measured data measData.
In this example measData is generated by the function measuredData(x, a=.2, b=-2, c=-.8, d=.1). I practice, we might have measured measData in a way - so we have no idea, how it is described mathematically. Hence the fit.
We fit by a polynomial, which is described by the function polynomFit(inp, *args). As we want to try out different orders of polynomials, it is important to be flexible in the number of input parameters.
The independent variables (x and y in your case) are encoded in the 'columns'/second dimension of inp.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def measuredData(inp, a=.2, b=-2, c=-.8, d=.1):
x=inp[:,0]
y=inp[:,1]
return a+b*x+c*x**2+d*x**3 +y
def polynomFit(inp, *args):
x=inp[:,0]
y=inp[:,1]
res=0
for order in range(len(args)):
print(14,order,args[order],x)
res+=args[order] * x**order
return res +y
inpData=np.linspace(0,10,20).reshape(-1,2)
inpDataStr=['({:.1f},{:.1f})'.format(a,b) for a,b in inpData]
measData=measuredData(inpData)
fig, ax = plt.subplots()
ax.plot(np.arange(inpData.shape[0]), measData, label='measuered', marker='o', linestyle='none' )
for order in range(5):
print(27,inpData)
print(28,measData)
popt, pcov = curve_fit(polynomFit, xdata=inpData, ydata=measData, p0=[0]*(order+1) )
fitData=polynomFit(inpData,*popt)
ax.plot(np.arange(inpData.shape[0]), fitData, label='polyn. fit, order '+str(order), linestyle='--' )
ax.legend( loc='upper left', bbox_to_anchor=(1.05, 1))
print(order, popt)
ax.set_xticklabels(inpDataStr, rotation=90)
Result:
Yes. We can pass multiple variables for curve_fit. I have written a piece of code:
import numpy as np
x = np.random.randn(2,100)
w = np.array([1.5,0.5]).reshape(1,2)
esp = np.random.randn(1,100)
y = np.dot(w,x)+esp
y = y.reshape(100,)
In the above code I have generated x a 2D data set in shape of (2,100) i.e, there are two variables with 100 data points. I have fit the dependent variable y with independent variables x with some noise.
def model_func(x,w1,w2,b):
w = np.array([w1,w2]).reshape(1,2)
b = np.array([b]).reshape(1,1)
y_p = np.dot(w,x)+b
return y_p.reshape(100,)
We have defined a model function that establishes relation between y & x.
Note: The shape of output of the model function or predicted y should be (length of x,)
popt, pcov = curve_fit(model_func,x,y)
The popt is an 1D numpy array containing predicted parameters. In our case there are 3 parameters.
Yes, there is: simply give curve_fit a multi-dimensional array for xData.

numpy.polyfit with adapted parameters

Regarding to this: polynomial equation parameters
where I get 3 parameters for a squared function y = a*x² + b*x + c now I want only to get the first parameter for a squared function which describes my function y = a*x². With other words: I want to set b=c=0 and get the adapted parameter for a. In case I understand it right, polyfit isn't able to do this.
This can be done by numpy.linalg.lstsq. To explain how to use it, it is maybe easiest to show how you would do a standard 2nd order polyfit 'by hand'. Assuming you have your measurement vectors x and y, you first construct a so-called design matrix M like so:
M = np.column_stack((x**2, x, np.ones_like(x)))
after which you can obtain the usual coefficients as the least-square solution to the equation M * k = y using lstsq like this:
k, _, _, _ = np.linalg.lstsq(M, y)
where k is the column vector [a, b, c] with the usual coefficients. Note that lstsq returns some other parameters, which you can ignore. This is a very powerful trick, which allows you to fit y to any linear combination of the columns you put into your design matrix. It can be used e.g. for 2D fits of the type z = a * x + b * y (see e.g. this example, where I used the same trick in Matlab), or polyfits with missing coefficients like in your problem.
In your case, the design matrix is simply a single column containing x**2. Quick example:
import numpy as np
import matplotlib.pylab as plt
# generate some noisy data
x = np.arange(1000)
y = 0.0001234 * x**2 + 3*np.random.randn(len(x))
# do fit
M = np.column_stack((x**2,)) # construct design matrix
k, _, _, _ = np.linalg.lstsq(M, y) # least-square fit of M * k = y
# quick plot
plt.plot(x, y, '.', x, k*x**2, 'r', linewidth=3)
plt.legend(('measurement', 'fit'), loc=2)
plt.title('best fit: y = {:.8f} * x**2'.format(k[0]))
plt.show()
Result:
The coefficients are get to minimize the squared error, you don't assign them. However, you can set some of the coefficients to zero if they are too much insignificant. E.g., I have a list of points on curve y = 33*x²:
In [51]: x=np.arange(20)
In [52]: y=33*x**2 #y = 33*x²
In [53]: coeffs=np.polyfit(x, y, 2)
In [54]: coeffs
Out[54]: array([ 3.30000000e+01, 8.99625199e-14, -7.62430619e-13])
In [55]: epsilon=np.finfo(np.float32).eps
In [56]: coeffs[np.abs(coeffs)<epsilon]=0
In [57]: coeffs
Out[57]: array([ 33., 0., 0.])

Categories