I want to calculate a simple linear regression where I need to force a particular value for one point. Namely, I have x and y arrays, and I want my regression f(x) to force f(x[-1]) == y[-1] - that is, the prediction over the last element of x should be equal to the last element of y.
Is there a way to do it using Python and scikit-learn?
Here's a slightly roundabout trick that will do it.
Try re-centering your data, i.e. subtract x[-1], y[-1] from all datapoints so that x[-1], y[-1] is now the origin.
Now fit your data using sklearn.linear_model.LinearRegression with fit_intercept set to False. This way, the data is fit so that the line is forced to pass through the origin. Because we've re-centered the data, the origin corresponds to x[-1], y[-1].
When you use the model to make predictions, subtract x[-1] from any datapoint for which you are making a prediction, then add y[-1] to the resulting prediction, and this will give you the same results as forcing your model to pass through x[-1], y[-1].
This is a little roundabout but it's the simplest way that occurs to me to do it using the sklearn linear regression function (without writing your own).
The suggestion from HappyDog is great as a quick way to get a fit however I'd like to introduce another method which doesn't require any manipulation of your data. The method will use the scipy.optimize.curve_fit method to fit your data.
First, we need to realize that a normal linear regression will find A and B such that y=Ax+B provides the best fit to the input data. Your requirements state that the fit must pass through the final point in your sample data set. Essentially we'll be dropping a line that passes through your final point and rotating it around this point until we can minimize the errors.
Take a look at the point-slope equation for a line: y-yi = m*(x-xi) where (xi, yi) is any point on that line. If we make the substution that this (xi, yi) point is the final point from your data set and solve for y, we get y=m*(x-xf)+yf. This is the model we will fit.
Translating this model to a python-function, we have:
def model(x, m, xf, yf):
return m*(x-xf)+yf
We create a mock-data set for this example and just for demonstration purposes we will significantly shift the final y-value:
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
We're almost ready to perform the fit. The curve_fit function expects a callable function (model) to fit, the x and y data, and a list of the guesses of each parameter we are trying to fit. Since our model accepts two extra "constant" arguments (xf and yf), we use functools.partial to "set" these arguments based on our data.
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
Now we can fit!
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
And we look at the results:
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
Putting all this together in one code block and including imports gives:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import functools
def model(x, m, xf, yf):
return m*(x-xf)+yf
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
We see a line passing through the final point, as required, and have found the best slope to represent this dataset.
Related
Below (blue dashed line) is what I get when I try to do linear regression on my data. It looks very off (but maybe it's correct?) Here is the image (does not allow me to embed):
and here is the code:
mm, cs, err = get_cols(data)
a = np.asarray(mm, dtype=float)
b = np.asarray(cs, dtype=float)
ax.errorbar(a, b, xerr=None, yerr=err, fmt='o', c='b', label='Detection Rate')
logB = np.log10(b)
m, y0 = np.polyfit(a, logB, 1)
ax.plot(a, np.exp(a*m+y0), '--')
The log scale of matplotlib uses the logarithm of base 10 by default. It therefore makes sense to use np.log10(b) to transform the data for fitting.
However, once fitting is done, the data needs to be backtransformed using the inverse of the transformation function.
In case of y = log10(x) the inverse is x = 10**(y), while
in case of y = log(x) the inverse is x = exp(y).
So you need to decide for one of the cases.
I'm trying to find exponential growth trends in a polynomial model and having issues identifying them. I've looked at scipy.optimize.curve_fit in python and can't seem to figure out how to know the shape of the curve without plotting it. The end goal is to find trends that look like:
I tried:
z = np.polyfit(x, y, 3)
f = np.poly1d(z)
# calculate new x's and y's
x_new = np.linspace(x[0], x[-1], 50)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
plt.show()
But I cannot identify the shape of the curve in a single value, which could then be sorted to see which have positive trends.
identify the shape of the curve in a single value
This can be done with polyfit by extracting just one, most important, coefficient of the fitted polynomial. Of course, a single value can only give a rough idea of some trend: up/down, concave up/concave down. Here is how:
x = np.arange(600)
y = np.exp(x/300) + np.cos(x/30) + np.random.uniform(size=x.shape) # simulated data
slope = np.polyfit(x, y, 1)[0]
concavity = np.polyfit(x, y, 2)[0]
print(slope, concavity)
The slope is the leading coefficient of the linear fit, which is positive (about 0.01), indicating upward trend.
The concavity, the leading coefficient of the quadratic fit, is also positive (the shape is concave up on average).
I'm trying to fit a histogram with some data in it using scipy.optimize.curve_fit. If I want to add an error in y, I can simply do so by applying a weight to the fit. But how to apply the error in x (i. e. the error due to binning in case of histograms)?
My question also applies to errors in x when making a linear regression with curve_fit or polyfit; I know how to add errors in y, but not in x.
Here an example (partly from the matplotlib documentation):
import numpy as np
import pylab as P
from scipy.optimize import curve_fit
# create the data histogram
mu, sigma = 200, 25
x = mu + sigma*P.randn(10000)
# define fit function
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2*sigma**2))
# the histogram of the data
n, bins, patches = P.hist(x, 50, histtype='step')
sigma_n = np.sqrt(n) # Adding Poisson errors in y
bin_centres = (bins[:-1] + bins[1:])/2
sigma_x = (bins[1] - bins[0])/np.sqrt(12) # Binning error in x
P.setp(patches, 'facecolor', 'g', 'alpha', 0.75)
# fitting and plotting
p0 = [700, 200, 25]
popt, pcov = curve_fit(gauss, bin_centres, n, p0=p0, sigma=sigma_n, absolute_sigma=True)
x = np.arange(100, 300, 0.5)
fit = gauss(x, *popt)
P.plot(x, fit, 'r--')
Now, this fit (when it doesn't fail) does consider the y-errors sigma_n, but I haven't found a way to make it consider sigma_x. I scanned a couple of threads on the scipy mailing list and found out how to use the absolute_sigma value and a post on Stackoverflow about asymmetrical errors, but nothing about errors in both directions. Is it possible to achieve?
scipy.optmize.curve_fit uses standard non-linear least squares optimization and therefore only minimizes the deviation in the response variables. If you want to have an error in the independent variable to be considered you can try scipy.odr which uses orthogonal distance regression. As its name suggests it minimizes in both independent and dependent variables.
Have a look at the sample below. The fit_type parameter determines whether scipy.odr does full ODR (fit_type=0) or least squares optimization (fit_type=2).
EDIT
Although the example worked it did not make much sense, since the y data was calculated on the noisy x data, which just resulted in an unequally spaced indepenent variable. I updated the sample which now also shows how to use RealData which allows for specifying the standard error of the data instead of the weights.
from scipy.odr import ODR, Model, Data, RealData
import numpy as np
from pylab import *
def func(beta, x):
y = beta[0]+beta[1]*x+beta[2]*x**3
return y
#generate data
x = np.linspace(-3,2,100)
y = func([-2.3,7.0,-4.0], x)
# add some noise
x += np.random.normal(scale=0.3, size=100)
y += np.random.normal(scale=0.1, size=100)
data = RealData(x, y, 0.3, 0.1)
model = Model(func)
odr = ODR(data, model, [1,0,0])
odr.set_job(fit_type=2)
output = odr.run()
xn = np.linspace(-3,2,50)
yn = func(output.beta, xn)
hold(True)
plot(x,y,'ro')
plot(xn,yn,'k-',label='leastsq')
odr.set_job(fit_type=0)
output = odr.run()
yn = func(output.beta, xn)
plot(xn,yn,'g-',label='odr')
legend(loc=0)
I have a simple data;
x = numpy.array([1,2,3,
4,5,6,
7,8,9,
10,11,12,
13,14,15,
16,17,18,
19,20,21,
22,23,24])
y = numpy.array([2149,2731,3397,
3088,2928,2108,
1200,659,289,
1141,1726,2910,
4410,5213,5851,
5817,5307,4314,
3656,3081,3103,
3535,4512,5584])
I can create linear regression and make guess with this code:
z = numpy.polyfit(x, y, 1)
p = numpy.poly1d(z)
But I want to create non linear regression of this data and draw graph with code like this:
import matplotlib.pyplot as plt
xp1 = numpy.linspace(1,24,100)
plt.plot(x, y, 'r--', xp1, p(xp1))
plt.show()
I saw a code like this but that couldn't help me:
def func(x, a, b, c):
return a*np.exp(-b*x) + c
...
popt, pcov = curve_fit(func, x, y)
...
So what's the code for making non linear regression and what can i make some guesses with non linear equation?
What you are referring to is the scipy module. You are right in that this is probably the module you want to be using.
Then, what you are interested in knowing is how curve_fit(func, x, y) works. The idea is that you want to minimize the difference between some function model (like y = m*x + b for a line) and the points on your model. The func argument represents this model: you are making a function that takes in as its first argument the dependent variable of the model (x in my example) and for all subsequent arguments the parameters of the model (those would be m and b in the case of the linear model). The x and y you have already figured out.
The real problem though, and yes I realize I'm not answering your question, is that you need to figure out manually some sort of model for your data (at least the type of model: exponential, linear, polynomial, etc.). There is no easy way out of that. Judging from your data, though I would go for a model of the form
y = a*sin(b*x + c) + d*x + e
or a 5 degree polynomial.
How do you calculate a best fit line in python, and then plot it on a scatterplot in matplotlib?
I was I calculate the linear best-fit line using Ordinary Least Squares Regression as follows:
from sklearn import linear_model
clf = linear_model.LinearRegression()
x = [[t.x1,t.x2,t.x3,t.x4,t.x5] for t in self.trainingTexts]
y = [t.human_rating for t in self.trainingTexts]
clf.fit(x,y)
regress_coefs = clf.coef_
regress_intercept = clf.intercept_
This is multivariate (there are many x-values for each case). So, X is a list of lists, and y is a single list.
For example:
x = [[1,2,3,4,5], [2,2,4,4,5], [2,2,4,4,1]]
y = [1,2,3,4,5]
But how do I do this with higher order polynomial functions. For example, not just linear (x to the power of M=1), but binomial (x to the power of M=2), quadratics (x to the power of M=4), and so on. For example, how to I get the best fit curves from the following?
Extracted from Christopher Bishops's "Pattern Recognition and Machine Learning", p.7:
The accepted answer to this question
provides a small multi poly fit library which will do exactly what you need using numpy, and you can plug the result into the plotting as I've outlined below.
You would just pass in your arrays of x and y points and the degree(order) of fit you require into multipolyfit. This returns the coefficients which you can then use for plotting using numpy's polyval.
Note: The code below has been amended to do multivariate fitting, but the plot image was part of the earlier, non-multivariate answer.
import numpy
import matplotlib.pyplot as plt
import multipolyfit as mpf
data = [[1,1],[4,3],[8,3],[11,4],[10,7],[15,11],[16,12]]
x, y = zip(*data)
plt.plot(x, y, 'kx')
stacked_x = numpy.array([x,x+1,x-1])
coeffs = mpf(stacked_x, y, deg)
x2 = numpy.arange(min(x)-1, max(x)+1, .01) #use more points for a smoother plot
y2 = numpy.polyval(coeffs, x2) #Evaluates the polynomial for each x2 value
plt.plot(x2, y2, label="deg=3")
Note: This was part of the answer earlier on, it is still relevant if you don't have multivariate data. Instead of coeffs = mpf(..., use coeffs = numpy.polyfit(x,y,3)
For non-multivariate data sets, the easiest way to do this is probably with numpy's polyfit:
numpy.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
Least squares polynomial fit.
Fit a polynomial p(x) = p[0] * x**deg + ... + p[deg] of degree deg to points (x, y). Returns a vector of coefficients p that minimises the squared error.
Slightly out of context because the resulting function is not a polynomial, but still interesting perhaps. One major problem with polynomial fitting is Runge's phenomenon: The higher the degree, the more dramatic oscillations will occur. This isn't just constructed either but it will come back to bite you.
As a remedy, I created smoothfit a while ago. It solves an appropriate least-squares problem and gives nice results, e.g.:
import numpy as np
import matplotlib.pyplot as plt
import smoothfit
x = [1, 4, 8, 11, 10, 15, 16]
y = [1, 3, 3, 4, 7, 11, 12]
a = 0.0
b = 17.0
plt.plot(x, y, 'kx')
lmbda = 3.0 # controls the smoothness
n = 100
u = smoothfit.fit1d(x, y, a, b, n, lmbda)
x = np.linspace(a, b, n)
vals = [u(xx) for xx in x]
plt.plot(x, vals, "-")
plt.show()