I have a simple data;
x = numpy.array([1,2,3,
4,5,6,
7,8,9,
10,11,12,
13,14,15,
16,17,18,
19,20,21,
22,23,24])
y = numpy.array([2149,2731,3397,
3088,2928,2108,
1200,659,289,
1141,1726,2910,
4410,5213,5851,
5817,5307,4314,
3656,3081,3103,
3535,4512,5584])
I can create linear regression and make guess with this code:
z = numpy.polyfit(x, y, 1)
p = numpy.poly1d(z)
But I want to create non linear regression of this data and draw graph with code like this:
import matplotlib.pyplot as plt
xp1 = numpy.linspace(1,24,100)
plt.plot(x, y, 'r--', xp1, p(xp1))
plt.show()
I saw a code like this but that couldn't help me:
def func(x, a, b, c):
return a*np.exp(-b*x) + c
...
popt, pcov = curve_fit(func, x, y)
...
So what's the code for making non linear regression and what can i make some guesses with non linear equation?
What you are referring to is the scipy module. You are right in that this is probably the module you want to be using.
Then, what you are interested in knowing is how curve_fit(func, x, y) works. The idea is that you want to minimize the difference between some function model (like y = m*x + b for a line) and the points on your model. The func argument represents this model: you are making a function that takes in as its first argument the dependent variable of the model (x in my example) and for all subsequent arguments the parameters of the model (those would be m and b in the case of the linear model). The x and y you have already figured out.
The real problem though, and yes I realize I'm not answering your question, is that you need to figure out manually some sort of model for your data (at least the type of model: exponential, linear, polynomial, etc.). There is no easy way out of that. Judging from your data, though I would go for a model of the form
y = a*sin(b*x + c) + d*x + e
or a 5 degree polynomial.
Related
I am trying to get a*log(b/x)^c type fit for the following data (simplified for 10 data points)
I have tried methods described in some other questions like this one using both curve_fit and lmfit but the solution never converges. My guess is that my initial conditions are bad. I was able to get the other exponential function commented out fit but the application requires a log fit of the form given. The data with the fit that works is attached for reference.
import numpy as np
from scipy.optimize import curve_fit
x=[0, 0.89790454, 1.79580908, 2.69371362, 3.59161816, 4.48952269, 5.38742723, 6.28533177, 7.18323631, 8.08114085]
y=[0.39599324, 0.10255828, 0.07094521, 0.05500624, 0.04636146, 0.04585985, 0.0398909, 0.03340628, 0.03041699, 0.02498938]
x = np.array(x,dtype=float)
y = np.array(y,dtype=float)
def func(x, a, b, c):
#return a*np.exp(-c*(x*b))+d
return a*(np.log(b/x)**c)
popt, pcov = curve_fit(func, x, y, p0=[.5,.5,1],maxfev=10000)
print(popt)
a,b ,c = np.asarray(popt)
Replace your function with,
def func(x, a, b, c):
#return a*np.exp(-c*(x*b))+d
t1 = np.log(b/x)
t2 = a*t1**c
print(a,b,c,t1, t2)
return t;
Yow will rapidly see that t1 = np.log(b / x) may be negative (this happens whenever b < x). A power of a negative number to a non-integer power is not a real number, and here numpy is producing nan results.
I have no difficukty with my fitting software (result below).
Often a cause of difficulty with non-linear fitting using iterative method of regression is the setting of initial values of the parameters to start the iterative process.
I am trying to fit a model to a spectrum (see below) using the least squares minimisation in matplotlib. The spectrum contains features characterised by the following curves:
Since the spectrum contains features of an exponential and Gaussian function, I combined the first two characteristics (the Gaussian and exponential functions) and fitted the model using the least squares minimisation method in python. See the code below.
#The model to fit to the data
def f(a, x):
return a[0]*np.exp(-((x - a[1])**2)/(2*(a[2])**2)) + a[0]*np.exp(-a[1]*(x - a[2])) +a[0]
#Residuals
def err(a, x, y):
return f(a, x) - y
xdata, ydata = np.loadtxt('data2.csv', delimiter = ',', unpack = True)
#Performing the fit
a0 = [0.1, 1, 1] #initial guess
a, success = leastsq(err, a0, args = (xdata, ydata))
x = np.linspace(xdata.min(), xdata.max(), 1000)
plt.plot(x, f(a, x), 'r-')
plt.step(xdata,ydata)
#plt.title('Spectrum')
Output:
The curve I am trying to fit on the spectrum is shown in red but it's not oscillating around the peak. I tried increasing or decreasing the initial guesses but that does not help. Is my model wrong perhaps? Or am I missing something in my code? Please help, your help will be much appreciated. Thank you so much in advance.
Link to the data
I want to calculate a simple linear regression where I need to force a particular value for one point. Namely, I have x and y arrays, and I want my regression f(x) to force f(x[-1]) == y[-1] - that is, the prediction over the last element of x should be equal to the last element of y.
Is there a way to do it using Python and scikit-learn?
Here's a slightly roundabout trick that will do it.
Try re-centering your data, i.e. subtract x[-1], y[-1] from all datapoints so that x[-1], y[-1] is now the origin.
Now fit your data using sklearn.linear_model.LinearRegression with fit_intercept set to False. This way, the data is fit so that the line is forced to pass through the origin. Because we've re-centered the data, the origin corresponds to x[-1], y[-1].
When you use the model to make predictions, subtract x[-1] from any datapoint for which you are making a prediction, then add y[-1] to the resulting prediction, and this will give you the same results as forcing your model to pass through x[-1], y[-1].
This is a little roundabout but it's the simplest way that occurs to me to do it using the sklearn linear regression function (without writing your own).
The suggestion from HappyDog is great as a quick way to get a fit however I'd like to introduce another method which doesn't require any manipulation of your data. The method will use the scipy.optimize.curve_fit method to fit your data.
First, we need to realize that a normal linear regression will find A and B such that y=Ax+B provides the best fit to the input data. Your requirements state that the fit must pass through the final point in your sample data set. Essentially we'll be dropping a line that passes through your final point and rotating it around this point until we can minimize the errors.
Take a look at the point-slope equation for a line: y-yi = m*(x-xi) where (xi, yi) is any point on that line. If we make the substution that this (xi, yi) point is the final point from your data set and solve for y, we get y=m*(x-xf)+yf. This is the model we will fit.
Translating this model to a python-function, we have:
def model(x, m, xf, yf):
return m*(x-xf)+yf
We create a mock-data set for this example and just for demonstration purposes we will significantly shift the final y-value:
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
We're almost ready to perform the fit. The curve_fit function expects a callable function (model) to fit, the x and y data, and a list of the guesses of each parameter we are trying to fit. Since our model accepts two extra "constant" arguments (xf and yf), we use functools.partial to "set" these arguments based on our data.
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
Now we can fit!
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
And we look at the results:
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
Putting all this together in one code block and including imports gives:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import functools
def model(x, m, xf, yf):
return m*(x-xf)+yf
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
We see a line passing through the final point, as required, and have found the best slope to represent this dataset.
I am experimenting with Python to fit curves to a series of data-points, summary below:
From the below, it would seem that polynomials of order greater than 2 are the best fit, followed by linear and finally exponential which has the overall worst outcome.
While I appreciate this might not be exponential growth, I just wanted to know whether you would expect the exponential function perform so badly (basically the coefficient of x,b, has been set to 0 and an arbitrary point has been picked on the curve to intersect) or if I have somehow done something wrong in my code to fit.
The code I'm using to fit is as follows:
# Fitting
def exponenial_func(x,a,b,c):
return a*np.exp(-b*x)+c
def linear(x,m,c):
return m*x+c
def quadratic(x,a,b,c):
return a*x**2 + b*x+c
def cubic(x,a,b,c,d):
return a*x**3 + b*x**2 + c*x + d
x = np.array(x)
yZero = np.array(cancerSizeMean['levelZero'].values)[start:]
print len(x)
print len(yZero)
popt, pcov = curve_fit(exponenial_func,x, yZero, p0=(1,1,1))
expZeroFit = exponenial_func(x, *popt)
plt.plot(x, expZeroFit, label='Control, Exponential Fit')
popt, pcov = curve_fit(linear, x, yZero, p0=(1,1))
linearZeroFit = linear(x, *popt)
plt.plot(x, linearZeroFit, label = 'Control, Linear')
popt, pcov = curve_fit(quadratic, x, yZero, p0=(1,1,1))
quadraticZeroFit = quadratic(x, *popt)
plt.plot(x, quadraticZeroFit, label = 'Control, Quadratic')
popt, pcov = curve_fit(cubic, x, yZero, p0=(1,1,1,1))
cubicZeroFit = cubic(x, *popt)
plt.plot(x, cubicZeroFit, label = 'Control, Cubic')
*Edit: curve_fit is imported from the scipy.optimize package
from scipy.optimize import curve_fit
curve_fit tends to perform poorly if you give it a poor initial guess with functions like the exponential that could end up with very large numbers. You could try altering the maxfev input so that it runs more iterations. otherwise, I would suggest trying with with something like:
p0=(1000,-.005,0)
-.01, since it ~doubles from 300 to 500 and you have -b in your eqn, 100 0 since it is ~3000 at 300 (1.5 doublings from 0). See how that turns out
As for why the initial exponential doesn't work at all, your initial guess is b=1, and x is in range of (300,1000) or range. This means python is calculating exp(-300) which either throws an exception or is set to 0. At this point, whether b is increased or decreased, the exponential is going to still be set to 0 for any value in the general vicinity of the initial estimate.
Basically, python uses a numerical method with limited precision, and the exponential estimate went outside of the range of values it can handle
I'm not sure how you're fitting the curves -- are you using polynomial least squares? In that case, you'd expect the fit to improve with each additional degree of flexibility, and you choose the power based on diminishing marginal improvement / outside theory.
The improving fit should look something like this.
I actually wrote some code to do Polynomial Least Squares in python for a class a while back, which you can find here on Github. It's a bit hacky though and loosely commented since I was just using it to solve exercises. Hope it's helpful.
Below is an example of using Curve_Fit from Scipy based on a linear equation. My understanding of Curve Fit in general is that it takes a plot of random points and creates a curve to show the "best fit" to a series of data points. My question is using scipy curve_fit it returns:
"Optimal values for the parameters so that the sum of the squared error of f(xdata, *popt) - ydata is minimized".
What exactly do these two values mean in simple English? Thanks!
import numpy as np
from scipy.optimize import curve_fit
# Creating a function to model and create data
def func(x, a, b):
return a * x + b
# Generating clean data
x = np.linspace(0, 10, 100)
y = func(x, 1, 2)
# Adding noise to the data
yn = y + 0.9 * np.random.normal(size=len(x))
# Executing curve_fit on noisy data
popt, pcov = curve_fit(func, x, yn)
# popt returns the best fit values for parameters of
# the given model (func).
print(popt)
You're asking SciPy to tell you the "best" line through a set of pairs of points (x, y).
Here's the equation of a straight line:
y = a*x + b
The slope of the line is a; the y-intercept is b.
You have two parameters, a and b, so you only need two equations to solve for two unknowns. Two points define a line, right?
So what happens when you have more than two points? You can't go through all the points. How do you choose the slope and intercept to give you the "best" line?
One way is to define "best" is to calculate the slope and intercept that minimize the square of the difference between each y value and the predicted y at that x on the line:
error = sum[(y(i) - (a*x(i) + b))^2]
It's an easy exercise if you know calculus: take the first derivatives of error w.r.t. a and b and set them equal to zero. You'll have two equations with two unknowns, a and b. You solve them to get the coefficients for the "best" line.