What does np.polyfit do and return? - python

I went through the docs but I'm not able to interpret correctly
IN my code, I wanted to find a line that goes through 2 points(x1,y1), (x2,y2), so I've used
np.polyfit((x1,x2),(y1,y2),1)
since its a 1 degree polynomial(a straight line)
It returns me [ -1.04 727.2 ]
Though my code (which is a much larger file) runs properly, and does what it is intended to do - i want to understand what this is returning
I'm assuming polyfit returns a line(curved, straight, whatever) that satisfies(goes through) the points given to it, so how can a line be represented with 2 points which it is returning?

From the numpy.polyfit documentation:
Returns:
p : ndarray, shape (deg + 1,) or (deg + 1, K)
Polynomial coefficients, highest power first. If y was 2-D, the coefficients for k-th data set are in p[:,k].
So these numbers are the coefficients of your polynomial. Thus, in your case:
y = -1.04*x + 727.2
By the way, numpy.polyfit will only return an equation that goes through all the points (say you have N) if the degree of the polynomial is at least N-1. Otherwise, it will return a best fit that minimises the squared error.

These are essentially the beta and the alpha values for the given data.
Where beta necessarily demonstrates the degree of volatility or the slope

Related

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

How can I change the number of basis functions when performing B-Spline fitting in scipy (python)?

I have a discrete set of points (x_n, y_n) that I would like to approximate/represent as a linear combination of B-spline basis functions. I need to be able to manually change the number of B-spline basis functions used by the method, and I am trying to implement this in python using scipy. To be specific, below is a bit of code that I am using:
import scipy
spl = scipy.interpolate.splrep(x, y)
However, unless I have misunderstood or missed something in the documentation, it seems I cannot change the number of B-spline basis functions that scipy uses. That seems to be set by the size of x and y. So, my specific questions are:
Can I change the number of B-spline basis functions used by scipy in the "splrep" function that I used above?
Once I have performed the transformation shown in the code above, how can I access the coefficients of the linear combination? Am I correct in thinking that these coefficients are stored in the vector spl[1]?
Is there a better method/toolbox that I should be using?
Thanks in advance for any help/guidance you can provide.
Yes, spl[1] are the coefficients, and spl[0] contains the knot vector.
However, if you want to have a better control, you can manipulate the BSpline objects and construct them with make_interp_spline or make_lsq_spline, which accepts the knot vector and that determines the b-spline basis functions to use.
You can change the number of B-spline basis functions, by supplying a knot vector with the t parameter. Since there is a connection number of knots = number of coefficients + degree + 1, the number of knots will also define the number of coefficients (== the number of basis functions).
The usage of the t parameter is not so intuitive since the given knots should be only the inner knots. So, for example, if you want 7 coefficients for a cubic spline you need to give 3 inner knots. Inside the function it pads the first and last (degree+1) knots with the xb and xe (clamped end conditions see for example here).
Furthermore, as the documentation says, the knots should satisfy the Schoenberg-Whitney conditions.
Here is an example code that does this:
# Input:
x = np.linspace(0,2*np.pi, 9)
y = np.sin(x)
# Your code:
spl = scipy.interpolate.splrep(x, y)
t,c,k = spl # knots, coefficients, degree (==3 for cubic)
# Computing the inner knots and using them:
t3 = np.linspace(x[0],x[-1],5) # five equally spaced knots in the interval
t3 = t3[1:-1] # take only the three inner values
spl3 = scipy.interpolate.splrep(x, y, t=t3)
Regarding your second question, you're right that the coefficients are indeed stored in spl[1]. However, note that (as the documentation says) the last (degree+1) values are zero-padded and should be ignored.
In order to evaluate the resulting B-spline you can use the function splev or the class BSpline.
Below is some example code that evaluates and draws the above splines (resulting in the following figure):
xx = np.linspace(x[0], x[-1], 101) # sample points
yy = scipy.interpolate.splev(xx, spl) # evaluate original spline
yy3 = scipy.interpolate.splev(xx, spl3) # evaluate new spline
plot(x,y,'b.') # plot original interpolation points
plot(xx,yy,'r-', label='spl')
plot(xx,yy3,'g-', label='spl3')

What do extra results of numpy.polyfit mean?

When creating a line of best fit with numpy's polyfit, you can specify the parameter full to be True. This returns 4 extra values, apart from the coefficents. What do these values mean and what do they tell me about how well the function fits my data?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
What i'm doing is:
bestFit = np.polyfit(x_data, y_data, deg=1, full=True)
and I get the result:
(array([ 0.00062008, 0.00328837]), array([ 0.00323329]), 2, array([
1.30236506, 0.55122159]), 1.1102230246251565e-15)
The documentation says that the four extra pieces of information are: residuals, rank, singular_values, and rcond.
Edit:
I am looking for a further explanation of how rcond and singular_values describes goodness of fit.
Thank you!
how rcond and singular_values describes goodness of fit.
Short answer: they don't.
They do not describe how well the polynomial fits the data; this is what residuals are for. They describe how numerically robust was the computation of that polynomial.
rcond
The value of rcond is not really about quality of fit, it describes the process by which the fit was obtained, namely a least-squares solution of a linear system. Most of the time the user of polyfit does not provide this parameter, so a suitable value is picked by polyfit itself. This value is then returned to the user for their information.
rcond is used for truncation in ill-conditioned matrices. Least squares solver does two things:
Finds x that minimizes the norm of residuals Ax-b
If multiple x achieve this minimum, returns x with the smallest norm among those.
The second clause occurs when some changes of x do not affect the right-hand side at all. But since floating point computations are imperfect, usually what happens is that some changes of x affect the right hand side very little. And this is where rcond is used to decide when "very little" should be considered as "zero up to noise".
For example, consider the system
x1 = 1
x1 + 0.0000000001 * x2 = 2
This one can be solved exactly: x1 = 1 and x2 = 10000000000. But... that tiny coefficient (that in reality, came after some matrix manipulations) has some numeric error in it; for all we know it could be negative, or zero. Should we let it have such huge influence on the solution?
So, in such a situation the matrix (specifically its singular values) gets truncated at level rcond. This leaves
x1 = 1
x1 = 2
for which the least-squares solution is x1 = 1.5, x2 = 0. Note that this solution is robust: no huge numbers from tiny fluctuations of coefficients.
Singular values
When one solves a linear system Ax = b in the least squares sense, the singular values of A determine how numerically tricky this is. Specifically, large disparity between largest and smallest singular values is problematic: such systems are ill-conditioned. An example is
0.835*x1 + 0.667*x2 = 0.168
0.333*x1 + 0.266*x2 = 0.0067
The exact solution is (1, -1). But if the right hand side is changed from 0.067 to 0.066, the solution is (-666, 834) -- totally different. The problem is that the singular values of A are (roughly) 1 and 1e-6; this magnifies any changes on the right by the factor of 1e6.
Unfortunately, polynomial fit often results in ill-conditioned matrices. For example, fitting a polynomial of degree 24 to 25 equally spaced data points is unadvisable.
import numpy as np
x = np.arange(25)
np.polyfit(x, x, 24, full=True)
The singular values are
array([4.68696731e+00, 1.55044718e+00, 7.17264545e-01, 3.14298605e-01,
1.16528492e-01, 3.84141241e-02, 1.15530672e-02, 3.20120674e-03,
8.20608411e-04, 1.94870760e-04, 4.28461687e-05, 8.70404409e-06,
1.62785983e-06, 2.78844775e-07, 4.34463936e-08, 6.10212689e-09,
7.63709211e-10, 8.39231664e-11, 7.94539407e-12, 6.32326226e-13,
4.09332903e-14, 2.05501534e-15, 7.55397827e-17, 4.81104905e-18,
8.98275758e-20]),
which, with the default value of rcond (5.55e-15 here), gets four of them truncated to 0.
The difference in magnitude between smallest and largest singular values indicates that perturbing the y-values by numbers of size 1e-15 can result in changes of about 1 to the coefficients. (Not every perturbation will do that, just some that happen to align with a singular vector for a small singular value).
Rank
The effective rank is just the number of singular values above the rcond threshold. In the above example it's 21. This means that even though the fit is for 25 points, and we get a polynomial with 25 coefficients, there are only 21 degrees of freedom in the solution.

Derivatives blow up in python

I am trying to find higher order derivatives of a dataset (x,y). x and y are 1D arrays of length N.
Let's say I generate them as :
xder0=np.linspace(0,10,1000)
yder0=np.sin(xder0)
I define the derivative function which takes in 2 array (x,y) and returns (x1, y1) where y1 is the derivative calculated at each index as : (y[i+1]-y[i])/(x[i+1]-x[i]). x1 is just the mean of x[i+1] and x[i]
Here is the function that does it:
def deriv(x,y):
delx =np.zeros((len(x)-1), dtype=np.longdouble)
ydiff=np.zeros((len(x)-1), dtype=np.longdouble)
for i in range(len(x)-1):
delx[i] =(x[i+1]+x[i])/2.0
ydiff[i] =(y[i+1]-y[i])/(x[i+1]-x[i])
return delx, ydiff
Now to calculate the first derivative, I call this function as:
xder1, yder1 = deriv(xder0, yder0)
Similarly for second derivative, I call this function giving first derivatives as input:
xder2, yder2 = deriv(xder1, yder1)
And it goes on:
xder3, yder3 = deriv(xder2, yder2)
xder4, yder4 = deriv(xder3, yder3)
xder5, yder5 = deriv(xder4, yder4)
xder6, yder6 = deriv(xder5, yder5)
xder7, yder7 = deriv(xder6, yder6)
xder8, yder8 = deriv(xder7, yder7)
xder9, yder9 = deriv(xder8, yder8)
Something peculiar happens after I reach order 7. The 7th order becomes very noisy! Earlier derivatives are all either sine or cos functions as expected. However 7th order is a noisy sine. And hence all derivatives after that blow up.
Any idea what is going on?
This is a well known stability issue with numerical interpolation using equally-spaced points. Read the answers at http://math.stackexchange.com.
To overcome this problem you have to use non-equally-spaced points, like the roots of Lagendre polynomial. The instability occurs due to the unavailability of information at the boundaries, thus more concentration of points at the boundaries is required, as per the roots of say Lagendre polynomials or others with similar properties such as Chebyshev polynomial.

Finding error range for peak point using polynomial fitting

I have data for a spectral line which makes a noisy U shaped curve .
I want to fit a curve and find the x,y values for the minimum point .
I then fitted a polynomial to it using polyfit .
I then found to minimum point on the fitted curve .
NB: The original curve is not symmetric (The left side is slightly steeper than the right .)
Therefore the min(original) is slightly left of min(fitted_curve)
How do I find the X and Y errors for this point ?
Here are the bones of my code :
import pylab , numpy
x = [... linear list of floats ...]
y = [... list of floats ...] # Produces a noisy U shaped curve .
fit = numpy.polyfit(x,y,3)
fit2 = numpy.polyval(fit,x) # Fit a third order polynomial .
miny = # min y value on fitted curve .
minx = # corresponding x value , not the actually min(x) .
pylab.plot(x,y,'k-')
pylab.plot(x,fitt,'r-')
pylab.plot(minx,miny,'ro')
pylab.show()
Now that I have the original [x,y] , the fitted curve [x,fitt2] and the minimum point on the fitted curve [minx,miny] , how do I find the error range for this single point ?
Thanks .
For numpy 1.7 the polyfit has the option cov=True. You get as additional output the covariance matrix. From this, using Gaussian error propagation, you can get the error of the minimum. But what kind of spectrum is it? Very often there are model shapes to fit, such that there is no need for the polynomial fit.
You might also want to look at scipy.optimize.curve_fit
PS: What makes you think that the true value is left of your fitted value. This would be true if your fit function was symmetric, being applied to the asymmetric peak. The third order polynomial, however, should be able to address asymmetry.

Categories