Python linear regression, first order polynomial - python

I have a data set that I'm asked to fit a first degree polynomial to.
I'm using the numpy function polyfit, but I'm getting some pretty strange results
I use the following code to find that polynomial and plot it
import numpy as np
data = np.loadtxt('men-olympics-100.txt')
year = data[:,0]
time = data[:,1]
plt.scatter(year, time)
xplot=np.linspace(1896,2008,100)
poly =np.polyfit(year,time,1)
print(poly)
yplot = poly[0]+poly[1]*(xplot)
plt.plot(xplot,yplot)
This is the resulting plot
Clearly I have done something wrong here, but I cannot figure out exactly where. Am I using polyfit wrong, or am I plotting it wrong?

This line
yplot = poly[0]+poly[1]*(xplot)
needs to be like this
yplot = poly[1]+poly[0]*(xplot)
Or more generally (Thanks #Victor Chubukov)
np.polyval(poly,xplot)

Related

savgol_filter from scipy.signal library, get the resulting polinormial function?

savgol_filter gives me the series.
I want to get the underlying polynormial function.
The function of the red line in a below picture.
So that I can extrapolate a point beyond the given x range.
Or I can find the slope of the function at the two extreme data points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
yhat = savgol_filter(y, 51, 3) # window size 51, polynomial order 3
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
** edit**
Since the filter uses least squares regression to fit the data in a small window to a polynomial of given degree, you can probably only extrapolate from the ends. I think the fitted curve is a piecewise function of these 'fits' and each function would not be a good representation of the entire data as a whole. What you could do is take the end windows of your data, and fit them to the same polynomial degree as the savitzy golay filter (using scipy's polyfit). It likely will not be accurate very far from the window though.
You can also use scipy.signal.savgol_coeffs() to get the coefficients of the filter. I think you dot product the coefficient array with your array of data to get the value at each point. You can include a derivative argument to get the slope at the ends of your data.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_coeffs.html#scipy.signal.savgol_coeffs

Python Polynomial Regression on 3D Data points

My problem is, that I have about 50.000 non-linear data points (x,y,z) with z depending on the independent variables x and y. From one side, so from a two-dimensional perspective, the data points look like a polynomial of degree 7. Unfortunately I cannot show this data.
My goal is to find a polynomial in 3D that can fit this data, without knowing the degree of the polynom beforehand. So I would like a function like f(x,y) = ax^3 + bx^2 + cx^2y + dy^3 + ...
Unfortunately, in python I have only found something like surface-fitting, where you need the degree beforehand. Or something like transforming the polynomial problem into a mutlivariable linear problem with scikit-learn. The later had very poor results with my dataset.
Does anyone know of a better method for this problem? Many thanks in advance.
As far as fitting a polynomial to a surface, I think your best bet is to try different sets of polynomials and rank them based on fit, as described here.
If you are willing to try different surface fitting methods, I would recommend looking into what scipy has to offer, particularly in the Multivariate, unstructured data section. scipy.interpolate.griddata, for example, uses a cubic spline to interpolate between data points. See below code for a demo:
import numpy as np
from scipy.interpolate import griddata
# X and Y features are a 2D numpy array
xy = np.random.randn(20,2)
# z is nonlinear function of x and y
z = xy[:,0] + xy[:,1]**2
# make grid of x and y points to interpolate
xsurf = np.arange(-3,3,0.1)
ysurf = xsurf
xsurf, ysurf = np.meshgrid(xsurf,ysurf)
surfPts = griddata(xy,z, np.vstack((xsurf.flatten(),ysurf.flatten())).T)
that code will yield the following surface fit:

Discontinuity in numerical derivative of an interpolating cubic spline

I am trying to to calculate and plot the numerical derivative (dy/dx) from two lists x and y. I am using the scipy.interpolate.UnivariateSpline and scipy.interpolate.UnivariateSpline.derivative to compute the slope. The plot of y vs x seems to be C1 continuous and I was expecting the slope dy/dx to be smooth as well when plotted against x. But then what is causing the little bump in the plot here? Also any suggestion on how I can massage the code to make it C1 continuous?
import numpy as np
from matplotlib import pyplot as plt
from scipy.interpolate import UnivariateSpline
x=[20.14141131550861, 20.29161104293003, 20.458574567775457, 20.653802880772922, 20.910446090013004, 21.404599384233677, 21.427939384233678, 21.451279384233676, 21.474619384233677, 21.497959384233678, 21.52129938423368, 21.52130038423368, 21.54463938423368, 21.56797938423368, 21.59131938423368, 21.61465938423368, 21.63799938423368, 22.132152678454354, 22.388795887694435, 22.5840242006919]
y=[-1.6629252348586834, -1.7625046339166028, -1.875358801338162, -2.01040013818419, -2.193327440415778, -2.5538174545988306, -2.571799827167608, -2.5896274995868005, -2.607298426787476, -2.624811539182082, -2.642165776735291, -2.642165776735291, -2.659360089028171, -2.6763934353217587, -2.693264784620056, -2.7099731157324367, -2.7265165368570314, -3.0965791078676754, -3.290845721407758, -3.440799238587583]
spl1 = UnivariateSpline(x,y,s=0)
dydx = spl1.derivative(n=1)
T = dydx(x)
plt.plot(x,y,'-x')
plt.plot(x,T,'-')
plt.show()
The given data points look like they define a nice C1-smooth curve, but they do not. Plotting the slopes (difference of y over difference of x) shows this:
plt.plot(np.diff(y)/np.diff(x))
There are some duplicate values of y in the array, which look like they don't belong, also some near-duplicate (but not duplicate) values of x.
The easiest way to fix the spline is to allow a tiny bit of smoothing:
spl1 = UnivariateSpline(x, y, s=1e-5)
makes the derivative what you expected:
Removing the "bad apple" also helps, though not as much.
spl1 = UnivariateSpline(x[:10] + x[11:], y[:10] + y[11:], s=0)

Why is the graph of the linear regression incorrect? Code and image provided

I was doing a linear regression using statsmodels in Python, and when I plotted the result it appeared erroneous. I checked a different data set, using the code from this question.
But even when I use the following code (taken from the above linked question), the line of best fit is still not displayed correctly. I am unsure what the problem is.
Code:
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
X = np.random.rand(100)
Y = X + np.random.rand(100)*0.1
results = sm.OLS(Y,sm.add_constant(X)).fit()
print results.summary()
plt.scatter(X,Y)
X_plot = np.linspace(0,1,100)
plt.plot(X_plot, X_plot*results.params[0] + results.params[1])
plt.show()
My output:
Why isn't the line of best fit correct?
add_constant prepends the constant by default, which means that the constant is the first parameter and the slopes are the following parameters.
The predicted values are also available as fittedvalues or by calling predict without arguments.
For the explicit calculation the indices of params need to be corrected, i.e.
predicted = X_plot*results.params[1] + results.params[0]

Intuitive interpolation between unevenly spaced points

I have the following graph that I want to digitize to a high-quality publication grade figure using Python and Matplotlib:
I used a digitizer program to grab a few samples from one of the 3 data sets:
x_data = np.array([
1,
1.2371,
1.6809,
2.89151,
5.13304,
9.23238,
])
y_data = np.array([
0.0688824,
0.0490012,
0.0332843,
0.0235889,
0.0222304,
0.0245952,
])
I have already tried 3 different methods of fitting a curve through these data points. The first method being to draw a spline through the points using scipy.interpolate import spline
This results in (with the actual data points drawn as blue markers):
This is obvisously no good.
My second attempt was to draw a curve fit using a series of different order polinimials using scipy.optimize import curve_fit. Even up to a fourth order polynomial the answer is useless (the lower order ones were even more useless):
Finally, I used scipy.interpolate import interp1d to try and interpolate between the data points. Linear interpolation obviously yields expected results but the line are straight and the whole purpose of this exercise is to get a nice smooth curve:
If I then use cubic interpolation I get a rubish result, however quadratic interpolation yields a slightly better result:
But it's not quite there yet, and I don't think interp1d can do higher order interpolation.
Is there anyone out there who has a good method of doing this? Maybe I would be better off trying to do it in IPE or something?
Thank you!
A standard cubic spline is not very good at reasonable looking interpolations between data points that are very unevenly spaced. Fortunately, there are plenty of other interpolation algorithms and Scipy provides a number of them. Here are a few applied to your data:
import numpy as np
from scipy.interpolate import spline, UnivariateSpline, Akima1DInterpolator, PchipInterpolator
import matplotlib.pyplot as plt
x_data = np.array([1, 1.2371, 1.6809, 2.89151, 5.13304, 9.23238])
y_data = np.array([0.0688824, 0.0490012, 0.0332843, 0.0235889, 0.0222304, 0.0245952])
x_data_smooth = np.linspace(min(x_data), max(x_data), 1000)
fig, ax = plt.subplots(1,1)
spl = UnivariateSpline(x_data, y_data, s=0, k=2)
y_data_smooth = spl(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'b')
bi = Akima1DInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'g')
bi = PchipInterpolator(x_data, y_data)
y_data_smooth = bi(x_data_smooth)
ax.plot(x_data_smooth, y_data_smooth, 'k')
ax.plot(x_data_smooth, y_data_smooth)
ax.scatter(x_data, y_data)
plt.show()
I suggest looking through these, and also a few others, and finding one that matches what you think looks right. Also, though, you may want to sample a few more points. For example, I think the PCHIP algorithm wants to keep the fit monotonic between data points, so digitizing your minimum point would be useful (and probably a good idea regardless of the algorithm you use).

Categories