I want to fit a function to the independant (X) and dependent (y) variables:
import numpy as np
y = np.array([1.45952016, 1.36947283, 1.31433227, 1.24076599, 1.20577963,
1.14454815, 1.13068077, 1.09638278, 1.08121406, 1.04417094,
1.02251471, 1.01268524, 0.98535659, 0.97400591])
X = np.array([4.571428571362048, 8.771428571548313, 12.404761904850602, 17.904761904850602,
22.904761904850602, 31.238095237873495, 37.95833333302289,
44.67857142863795, 51.39880952378735, 64.83928571408615,
71.5595238097012, 85., 98.55357142863795, 112.1071428572759])
I already tried scipy package in this way:
from scipy.optimize import curve_fit
def func (x, a, b, c):
return 1/(a*(x**2) + b*(x**1) + c)
g = [1, 1, 1]
c, cov = curve_fit (func, X.flatten(), y.flatten(), g)
test_ar = np.arange(min(X), max(X), 0.25)
pred = np.empty(len(test_ar))
for i in range (len(test_ar)):
pred[i] = func(test_ar[i], c[0], c[1], c[2])
I can add higher orders of polynomial to make my func more accurate but I want to keep it simple. I very much appreciate if anyone an give me some help on how to find another function or make my prediction better. The figure also shows the result of the prediction:
First thing you want to do is to specify how do you measure "accuracy" which in your case is not an appropriate term at all.
What are you essentially doing is called linear regression. Suitable metrics in this case are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE). It is up to you to decide which metric to use and what threshold to set for being "acceptable".
The image that you are showing above (where you've fitted the line) looks fine BUT please expand your X-axis from -100 to 300 and show us the image again this is a problem with high degree polynomials.
This is a 101 example how to use regression in scikit-learn. In your case if you want to use x^2 or x^3 for predicting y, you just need to add them in to the data ... Currently your X variable is an array (a vector) you need to expand that to become a matrix where each column is a feature (x, x^2, x^3 ...)
here is some code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
y = [1.45952016, 1.36947283, 1.31433227, 1.24076599,
1.20577963, 1.14454815, 1.13068077, 1.09638278,
1.08121406, 1.04417094, 1.02251471, 1.01268524, 0.98535659,
x = [4.571428571362048, 8.771428571548313, 12.404761904850602,
17.904761904850602, 22.904761904850602, 31.238095237873495,
37.95833333302289, 44.67857142863795, 51.39880952378735,
64.83928571408615, 71.5595238097012, 85., 98.55357142863795, 112.1071428572759]
df = pd.DataFrame({
'x' : x,
'x^2': [i**2 for i in x],
'x^3': [i**3 for i in x],
'y': y
X = df[['x','x^2','x^3']]
y = df['y']
model = linear_model.LinearRegression()
model.fit(X, y)
y1 = model.predict(X)
coef = model.coef_
intercept = model.intercept_
you can see the coefficients from the coef variable:
array([-1.67456732e-02, 2.03899728e-04, -8.70976426e-07])
you can see the intercept from the intercept variable:
which in your case means -> y1 = -1.67e-2x +2.03e-4x^2 -8.70e-7x^3 + 1.5
I'm trying to fit a linear model to a set of data, with the constraint that all the residuals (model - data) are positive - in other words, the model should be the "best overestimate". Without this constraint, linear models can be easily found with numpy's polyfit as shown below.
import numpy as np
import matplotlib.pyplot as plt
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
coeff = np.polyfit(x,y,1)
plt.plot(x,np.polyval([-2,-3.6],x),c='g',label='desired-fit') #a rough guess of the desired result
Is there an efficient way to implement a linear fit with this type of constraint?
This is a quadratic programming problem. There are several libraries (CVXOPT, quadprog etc.) that can be used to solve it. Here is an example using quadprog:
import numpy as np
import matplotlib.pyplot as plt
import quadprog
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
A = np.c_[x, np.ones(len(x))]
y = np.array(y)
G = A.T # A
a = A.T # y
C = A.T
b = y
coeffs = quadprog.solve_qp(G, a, C, b)[0]
plt.scatter(x, y)
plt.plot(x, np.polyval(coeffs, x), c='r')
This gives:
See e.g. this post for more information. In describes, in particular, how to set up a linear regression problem as a quadratic programming problem.
As a side note, the optimal line will always pass through one data point, but it need not pass through two such points. For example, take x = [-1., 0., 1.] and y = [1., 2., 1.].
Yes, the best fit is a line through the top two points. I do an argsort to find the top Ys, compute the slope and y-intercept, and off we go:
import numpy as np
import matplotlib.pyplot as plt
x = [-4.12179107e-01, -1.40664082e-01, -5.52301563e-06, 1.82898473e-01]
y = [-4.14846251, -3.31607886, -3.57827245, -5.09914559]
coeff = np.polyfit(x,y,1)
model1 = np.polyval(coeff,x)
model1 += (y-model1).max()
z = np.argsort(y)
pt0 = (x[z[-1]],y[z[-1]])
pt1 = (x[z[-2]],y[z[-2]])
m = (pt1[1]-pt0[1])/(pt1[0]-pt0[0])
b = pt0[1]-m*pt0[0]
model2 = np.polyval([m,b],x)
Is there any way I can fit two independent variables and one dependent variable in numpy.polyfit()?
I have a panda data frame that I loaded from a csv file.
I wish to include two columns as independent variables to run multiple linear regression using NumPy.
Currently my simple linear regression looks like this:
model_combined = np.polyfit(data.Exercise, y, 1)
I wish to include data.Age in x as well.
Assuming your equation is a * exercise + b * age + intercept = y, you can fit a multiple linear regression with numpy or scikit-learn as follows:
from sklearn import linear_model
import numpy as np
X = np.random.randint(low=1, high=10, size=20).reshape(10, 2)
X = np.c_[X, np.ones(X.shape[0])] # add intercept
y = np.random.randint(low=1, high=10, size=10)
# Option 1
a, b, intercept = np.linalg.pinv((X.T).dot(X)).dot(X.T.dot(y))
print(a, b, intercept)
# Option 2
a, b, intercept = np.linalg.lstsq(X,y, rcond=None)[0]
print(a, b, intercept)
# Option 3
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(X, y)
I want to iteratively fit a curve to data in python with the following approach:
Fit a polynomial curve (or any non-linear approach)
Discard values > 2 standard deviation from mean of the curve
repeat steps 1 and 2 till all values are within confidence interval of the curve
I can fit a polynomial curve as follows:
vals = array([0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
x_values = np.linspace(0, 1, len(vals))
poly_degree = 3
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
How do I do steps 2 and 3?
With the eliminating points too far from an expected solution, you are probably looking for RANSAC (RANdom SAmple Consensus), which is fitting a curve (or any other function) to data within certain bounds, like your case with 2*STD.
You can use scikit-learn RANSAC estimator which is well aligned with included regressors such as LinearRegression. For your polynomial case you need to define your own regression class:
from sklearn.metrics import mean_squared_error
class PolynomialRegression(object):
def __init__(self, degree=3, coeffs=None):
self.degree = degree
self.coeffs = coeffs
def fit(self, X, y):
self.coeffs = np.polyfit(X.ravel(), y, self.degree)
def get_params(self, deep=False):
return {'coeffs': self.coeffs}
def set_params(self, coeffs=None, random_state=None):
self.coeffs = coeffs
def predict(self, X):
poly_eqn = np.poly1d(self.coeffs)
y_hat = poly_eqn(X.ravel())
return y_hat
def score(self, X, y):
return mean_squared_error(y, self.predict(X))
and then you can use RANSAC
from sklearn.linear_model import RANSACRegressor
ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
residual_threshold=2 * np.std(y_vals),
ransac.fit(np.expand_dims(x_vals, axis=1), y_vals)
inlier_mask = ransac.inlier_mask_
Note, the X variable is transformed to 2d array as it is required by sklearn RANSAC implementation and in our custom class flatten back because of numpy polyfit function works with 1d array.
y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
plt.plot(x_vals, y_vals, 'bx', label='input samples')
plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
plt.plot(x_vals, y_hat, 'r-', label='estimated curve')
moreover, playing with the polynomial order and residual distance I got following results with degree=4 and range 1*STD
Another option is to use higher order regressor like Gaussian process
from sklearn.gaussian_process import GaussianProcessRegressor
ransac = RANSACRegressor(GaussianProcessRegressor(),
Talking about generalization to DataFrame, you just need to set that all columns except one are features and the one remaining is the output, like here:
import pandas as pd
df = pd.DataFrame(np.array([x_vals, y_vals]).T)
ransac.fit(df[df.columns[:-1]], df[df.columns[-1]])
y_hat = ransac.predict(df[df.columns[:-1]])
it doesn't look like you'll get anything worthwhile following that procedure, there are much better techniques for handling unexpected data. googling for "outlier detection" would be a good start.
with that said, here's how to answer your question:
start by pulling in libraries and getting some data:
import matplotlib.pyplot as plt
import numpy as np
Y = np.array([
0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
X = np.linspace(0, 1, len(Y))
next do an initial plot of the data:
plt.plot(X, Y, '.')
as this lets you see what we're dealing with and whether a polynomial would ever be a good fit --- short answer is that this method isn't going to get very far with this sort of data
at this point we should stop, but to answer the question I'll go on, mostly following your polynomial fitting code:
poly_degree = 5
sd_cutoff = 1 # 2 keeps everything
coeffs = np.polyfit(X, Y, poly_degree)
poly_eqn = np.poly1d(coeffs)
Y_hat = poly_eqn(X)
delta = Y - Y_hat
sd_p = np.std(delta)
ok = abs(delta) < sd_p * sd_cutoff
hopefully this makes sense, I use a higher degree polynomial and only cutoff at 1SD because otherwise nothing will be thrown away. the ok array contains True values for those points that are within sd_cutoff standard deviations
to check this, I'd then do another plot. something like:
plt.scatter(X, Y, color=np.where(ok, 'k', 'r'))
Y_hat - sd_p * sd_cutoff,
Y_hat + sd_p * sd_cutoff,
plt.plot(X, Y_hat)
which gives me:
so the black dots are the points to keep (i.e. X[ok] gives me these back, and np.where(ok) gives you indicies).
you can play around with the parameters, but you probably want a distribution with fatter tails (e.g. a Student's T-distribution) but, as I said above, using Google for outlier detection would be my suggestion
There are three functions need to solve this. First a line fitting function is necesary to fit a line to a set of points:
def fit_line(x_values, vals, poly_degree):
coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)
return poly_eqn, y_hat
We need to know the standard deviation from the points to the line. This function computes that standard deviation:
def compute_sd(x_values, vals, y_hat):
distances = []
for x,y, y1 in zip(x_values, vals, y_hat): distances.append(abs(y - y1))
return np.std(distances)
Finally, we need to compare the distance from a point to the line. The point needs to be thrown out if the distance from the point to the line is greater than two times the standard deviation.
def compare_distances(x_values, vals):
new_vals, new_x_vals = [],[]
for x,y in zip(x_values, vals):
y1 = np.polyval(poly_eqn, x)
distance = abs(y - y1)
if distance < 2*sd:
plt.plot((x,x),(y,y1), c='g')
plt.plot((x,x),(y,y1), c='r')
plt.scatter(x,y, c='r')
return new_vals, new_x_vals
As you can see in the following graphs, this method does not work well for fitting a line to data that has a lot of outliers. All the points end up getting eliminated for being too far from the fitted line.
while len(vals)>0:
poly_eqn, y_hat = fit_line(x_values, vals, poly_degree)
plt.scatter(x_values, vals)
plt.plot(x_values, y_hat)
sd = compute_sd(x_values, vals, y_hat)
new_vals, new_x_vals = compare_distances(x_values, vals)
vals, x_values = np.array(new_vals), np.array(new_x_vals)
Consider the following MWE
import numpy as np
from scipy.optimize import curve_fit
def linear(x, a, b):
return (x/b)**a
for ix in range(Y.shape[0]):
c0, pcov = curve_fit(linear, X, Y[ix])
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(linear, XX, Y.flatten())
I have a problem where I have to do basically that, but instead of 15 iterations it's thousands and it's pretty slow.
Is there any way to do all of those iterations at once with curve_fit? I know the result from the function is supposed to be a 1D-array, so just passing the args like this
c0, pcov = curve_fit(nlinear, X, Y)
is not going to work. Also I think the answer has to be in flattening Y, so I can get a flattened result, but I just can't get anything to work.
I know that if I do something like
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(nlinear, XX, Y.flatten())
then I get a "mean" value of the coefficients, but that's not what I want.
For the record, I solved with using Jacques Kvam's set-up but implemented using Numpy (because of a limitation)
lX = np.log(X)
lY = np.log(Y)
A = np.vstack([lX, np.ones(len(lX))]).T
m, c=np.linalg.lstsq(A, lY.T)[0]
And then m is a and to get b:
Least squares won't give the same result because the noise is transformed by log in this case. If the noise is zero, both methods give the same result.
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
Y = np.zeros((4, 6))
for i in range(4):
b = a = i + 1
Y[i] = (X/b)**a + 0.01 * randn(6)
def linear(x, a, b):
return (x/b)**a
for ix in range(Y.shape[0]):
c0, pcov = curve_fit(linear, X, Y[ix])
coefs is
[array([ 0.99309127, 0.98742861]),
array([ 2.00197613, 2.00082722]),
array([ 2.99130237, 2.99390585]),
array([ 3.99644048, 3.9992937 ])]
I'll use scikit-learn's implementation of linear regression since I believe that scales well.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Take logs of X and Y
lX = np.log(X)[None, :]
lY = np.log(Y)
Now fit and check that coeffiecients are the same as before.
lr.fit(lX.T, lY.T)
Which gives similar exponent.
array([ 0.98613517, 1.98643974, 2.96602892, 4.01718514])
Now check the divisor.
np.exp(-lr.intercept_ / lr.coef_.ravel())
Which gives similar coefficient, you can see the methods diverging somewhat though in their answers.
array([ 0.99199406, 1.98234916, 2.90677142, 3.73416501])
It might be useful in some situations to have the best fit parameters as a numpy array for further calculations. One can add the following after the for loop:
bestfit_par = np.asarray(coeffs)
Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!