Summarize polynomial fit to data in a single number, without plotting - python

I'm trying to find exponential growth trends in a polynomial model and having issues identifying them. I've looked at scipy.optimize.curve_fit in python and can't seem to figure out how to know the shape of the curve without plotting it. The end goal is to find trends that look like:
I tried:
z = np.polyfit(x, y, 3)
f = np.poly1d(z)
# calculate new x's and y's
x_new = np.linspace(x[0], x[-1], 50)
y_new = f(x_new)
plt.plot(x,y,'o', x_new, y_new)
plt.xlim([x[0]-1, x[-1] + 1 ])
plt.show()
But I cannot identify the shape of the curve in a single value, which could then be sorted to see which have positive trends.

identify the shape of the curve in a single value
This can be done with polyfit by extracting just one, most important, coefficient of the fitted polynomial. Of course, a single value can only give a rough idea of some trend: up/down, concave up/concave down. Here is how:
x = np.arange(600)
y = np.exp(x/300) + np.cos(x/30) + np.random.uniform(size=x.shape) # simulated data
slope = np.polyfit(x, y, 1)[0]
concavity = np.polyfit(x, y, 2)[0]
print(slope, concavity)
The slope is the leading coefficient of the linear fit, which is positive (about 0.01), indicating upward trend.
The concavity, the leading coefficient of the quadratic fit, is also positive (the shape is concave up on average).

Related

Is there an a method to fit a wave created from two wave?

I need to fit a sine curve created from two sine waves and extract the parameters for the fitted curve (such as frequency, amplitude, etc).
Data example:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(0, 50, 0.01)
x2 = np.arange(0, 100, 0.02)
x3 = np.arange(0, 150, 0.03)
sin1 = np.sin(x)
sin2 = np.sin(x2)
sin3= np.sin(x3/2)
sin4 = sin1 + sin2+sin3
plt.plot(x, sin4)
plt.show()
I used the codes provided in this answer.
yy = sin4
tt = x
res = fit_sin(tt, yy)
print(str(i), "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
fit_values=res["fitfunc"](tt)
Frequenc_fit= res['freq']
print(i, Frequenc_fit)
Frequenc_fit=Frequenc_fit
Amp_fit=res['amp']
Omega_fit=res['omega']
Phase_fit=res['phase']
Offset_fit=res['offset']
maxcov_fit=res['maxcov']
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt,fit_values, "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
I got a fitted sine curve with a single frequency and amplitude as follows:
2 Amplitude=1.0149282025860233, Angular freq.=2.01112187048004, phase=-0.2730905030152767, offset=0.003304158823058212, Max. Cov.=0.0015266032307905222
2 0.3200799868471169
Is there a method to obtain fitted curve matches with the original one?
Supposing that the function to be fitted is
y(x)=a * sin( w * x )+b * sin( W * x )
the principle of the method below is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
The graphical representation of the result is :
Blue curve : From data obtained by scanning the graph given in the question.
Black curve : From the above calculus.
The available data was not accurate because it comes from scanning of the original figure. The deviation is mainly due to the numerical integrations in computing the values of SS and SSSS (Four successive numerical integrations is not accurate especially with biaised data).
Probably the correct result should be : w=2 , W=1 , a=1 , b=1.
NOTE : The above method is not iterative and thus doesn't requires guessed values of the parameters to start an iterative process. The approximate results of the parameters can be good initial values in order to use an iterative non-linear regression process.
NOTE : If the values of w and W where known a-priori the solving thanks to linear regression would be very simple and much accurate (Only the last 2X2 matrix calculus shown above).

Linear regression forcing one specific value

I want to calculate a simple linear regression where I need to force a particular value for one point. Namely, I have x and y arrays, and I want my regression f(x) to force f(x[-1]) == y[-1] - that is, the prediction over the last element of x should be equal to the last element of y.
Is there a way to do it using Python and scikit-learn?
Here's a slightly roundabout trick that will do it.
Try re-centering your data, i.e. subtract x[-1], y[-1] from all datapoints so that x[-1], y[-1] is now the origin.
Now fit your data using sklearn.linear_model.LinearRegression with fit_intercept set to False. This way, the data is fit so that the line is forced to pass through the origin. Because we've re-centered the data, the origin corresponds to x[-1], y[-1].
When you use the model to make predictions, subtract x[-1] from any datapoint for which you are making a prediction, then add y[-1] to the resulting prediction, and this will give you the same results as forcing your model to pass through x[-1], y[-1].
This is a little roundabout but it's the simplest way that occurs to me to do it using the sklearn linear regression function (without writing your own).
The suggestion from HappyDog is great as a quick way to get a fit however I'd like to introduce another method which doesn't require any manipulation of your data. The method will use the scipy.optimize.curve_fit method to fit your data.
First, we need to realize that a normal linear regression will find A and B such that y=Ax+B provides the best fit to the input data. Your requirements state that the fit must pass through the final point in your sample data set. Essentially we'll be dropping a line that passes through your final point and rotating it around this point until we can minimize the errors.
Take a look at the point-slope equation for a line: y-yi = m*(x-xi) where (xi, yi) is any point on that line. If we make the substution that this (xi, yi) point is the final point from your data set and solve for y, we get y=m*(x-xf)+yf. This is the model we will fit.
Translating this model to a python-function, we have:
def model(x, m, xf, yf):
return m*(x-xf)+yf
We create a mock-data set for this example and just for demonstration purposes we will significantly shift the final y-value:
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
We're almost ready to perform the fit. The curve_fit function expects a callable function (model) to fit, the x and y data, and a list of the guesses of each parameter we are trying to fit. Since our model accepts two extra "constant" arguments (xf and yf), we use functools.partial to "set" these arguments based on our data.
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
Now we can fit!
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
And we look at the results:
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
Putting all this together in one code block and including imports gives:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import functools
def model(x, m, xf, yf):
return m*(x-xf)+yf
x = np.linspace(0, 10, 100)
y = x + np.random.uniform(0, 3, len(x))
y[-1] += 10
partial_model = functools.partial(model, xf=x[-1], yf=y[-1])
p0 = [y[-1]/x[-1]] # Initial guess for m, as long as xf != 0
best_fit, covar = curve_fit(partial_model, x, y, p0=p0)
print("Best fit:", best_fit)
y_fit = model(x, best_fit[0], x[-1], y[-1])
intercept = model(0, best_fit[0], x[-1], y[-1]) # The y-intercept
plt.plot(x, y, "g*") # Input data will be green stars
plt.plot(x, y_fit, "r-") # Fit will be a red line
plt.legend(["Sample Data", f"y=mx+b ; m={best_fit[0]:.4f}, b={intercept:.4f}"])
plt.show()
We see a line passing through the final point, as required, and have found the best slope to represent this dataset.

Algorithm for generating smooth curves from incoming data points

I'm looking for an algorithm that smoothly interpolates points as they come in live.
For example, say I start with an array of 10 (x,y) pairs. I'm currently using scipy and a gaussian window to generate a smooth curve. However, what I can't figure out is how to update the smoothed curve in response to an 11th point generated at some future point (without completely redoing the smoothing for all 11 points).
What I'm looking for is an algorithm that follows the previous smooth curve up to the 10th (x,y) pair and also smoothly interpolates between the 10th and 11th pair (in a way that's similar to redoing the entire algorithm - so no sharp edges). Is there something out there that does what I'm looking for?
I think you could make use of a Cubic Spline. Given a list of n points (x_1, y_1)..(x_n, y_n), the algorithm finds a cubic polynomial p_k between (x_k, y_k) and (x_{k+1}, y_{k+1}) with the following constraints:
polynomials p_k and p_{k+1} passes through the point (x_{k+1}, y_{k+1});
polynomials p_k and p_{k+1} have the same first derivative at (x_{k+1}, y_{k+1});
polynomials p_k and p_{k+1} have the same second derivative at (x_{k+1}, y_{k+1}).
Also, there are some boundary conditions, defined for the first and the last polynomial. I have used natural, which forces the second derivative to zero at the end of the curves.
The steps that you could apply are:
Interpolate the first 10 points using the Cubic Spline.
Assign the first derivative value at p_10 to a variable d.
Run the Cubic Spline for p_10 and p_11, enforcing that the first derivative at p_10 is d and the second derivative at p_11 is zero.
From there, you can repeat the same steps for the remaining points.
This code will generate a interpolation for all points:
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import CubicSpline
height=4
n = 20
x = np.arange(n)
xs = np.arange(-0.1,n+0.1,0.1)
y = np.random.uniform(low=0, high=height, size=n)
plt.plot(x, y, 'o', label='data')
cs = CubicSpline(x, y)
plt.plot(xs, cs(xs), color='orange')
plt.ylim([0, height+1])
Now, this code will interpolate the first 10 points, followed by another interpolation between points 10 and 11:
k = 10
delta = 0.001
plt.plot(x, y, 'o', label='data')
xs = np.arange(x[0], x[k-1]+delta, delta)
cs = CubicSpline(x[0:k], y[0:k])
plt.plot(xs, cs(xs), color='red')
d = cs(x[k-1], 1)
xs2 = np.arange(x[k-1], x[k]+delta, delta)
cs2 = CubicSpline(x[k-1:k+1], y[k-1:k+1], bc_type=((1, d), 'natural'))
plt.plot(xs2, cs2(xs2), color='blue')
plt.ylim([0, height+1])

How to compute and plot the pdf from the empirical cdf?

I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:
plt.plot(xvalues, yvalues)
plt.show()
I assume the data needs to be smoothed somehow in order to give a smooth pdf.
I would like to plot the pdf. How can I do that?
The raw data is at: http://dpaste.com/1HVK5DR .
There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.
So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)
import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file
print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]
#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)
Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form
c(t) = sum a(i) * phi(t - x(i)) #(not part of the program)
where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.
Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just
c(t) = sum a(i) * phi'(t - x(i))
where phi' is the derivative of phi. #(not part of the program)
Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.
#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20 #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)
#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae # coeffs
#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd # coeffs
fig,ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')
ax2.plot(xd, yd, '-')
fig.savefig('graph.png')
print('done')
You need the derivative to go from CDF to PDF
PDF(x) = d CDF(x)/ dx
With NumPy, you could use gradient
pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()
or manual differential
pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()
Both produce something like, updated picture it got botched somehow

Generate surface equation from x, y, z data

What is the most effective way of generating an equation for a surface where x, y and z are known? There seems to be many ways to interpolate spline by spline from a few data points, however, I have all the data points that represent a smooth surface and would still not have a single equation representing the whole surface. Each spline is fairly simple in that it rises and falls once.
I have generated an equation from a least squares example on Brandon Stafford's blog, but the resulting equation does not represent the more complicated form.
I realize the cross terms are missing. How can I add each cross term (xy, xy^2, x^2y, x^3y, x^3y^2, y^3x, y^3x^2) into the script? Once I have cross terms do I need to add degrees for them?
# Set up the canonnical least squares form
Ax = np.vander(X,degree)
Ay = np.vander(Y,degree)
A = np.hstack((Ax,Ay))
# Solve for the least squares estimate of current
(coeffs, residuals, rank, sing_vals) = np.linalg.lstsq(A, Z)
# Extract coefficients and create polynomials in x and y
xcoeffs = coeffs[0:degree]
ycoeffs = coeffs[degree:2 * degree]
fx = np.poly1d(xcoeffs)
fy = np.poly1d(ycoeffs)
print fx
print fy

Categories