I use numpy.polyfit to fit a 2nd order polynom to a set of data
fit1, fit_err1, _, _, _ = np.polyfit(xint[:index_max],
yint[:index_max],
2,
full=True)
For some few examples of my data, the variable fit_err1 is empty although the fit was successful, i.e. fit1 is not empty!
Does anybody know what an empty residual means in this context? Thank you!
EDIT:
one example data set:
x = [-488., -478., -473.]
y = [ 0.02080881, 0.03233648, 0.03584448]
fit1, fit_err1, _, _, _ = np.polyfit(x, y, 2, full=True)
result:
fit1 = [ -3.00778818e-05 -2.79024663e-02 -6.43272769e+00]
fit_err1 = []
I know that fitting a 2nd order polynom to a set of three point is not very useful, but then i still expect the function to either raise a warning, or (as it actually determined a fit) return the actual residuals, or both (like "here are the residuals, but your conditions are poor!").
As pointed out by #Jaime, if you have three points a second order polynomial will fit it exactly. And your point that the error should be rather 0 than an empty array makes sense, but this is the current behavior of np.linalg.lstsq, which is where np.polyfit is wrapped around.
We can test this behavior doing the least-squares fit of a y = a*x**0 + b*x**1 + c*x**2 equation that we know the answer should be a=0, b=0, c=1:
np.linalg.lstsq([[1, 1 ,1], [1, 2, 4], [1, 3, 9]], [1, 4, 9])
#(array([ -3.43396424e-15, 3.88578059e-15, 1.00000000e+00]),
# array([], dtype=float64),
# 3,
# array([ 10.64956309, 1.2507034 , 0.15015641]))
where we can see that the second output is an empty array. And this is intended to work like this.
Related
I am testing the scipy.optimize function curve_fit(). I am testing on a Quadratic function, and I have assigned the x and y data manually for this question. I do get the expected answer for the values of my parameters for basically every guess I put in. However, I noticed that for guesses of the first parameter not close to 0 (particularly, after 1), I get a Covariance Matrix full of infinity. I am not sure why such a simple test is failing.
# python version: 3.9.7
# using a venv
# numpy version: 1.23.2
# scipy version: 1.9.0
import numpy as np
from scipy.optimize import curve_fit
# data taken from a quadratic function of: y = 3*x**2 + 2
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.float64)
y = np.array([2, 5, 14, 29, 50, 77, 110, 149, 194, 245, 302], dtype=np.float64)
# quadratic function
def func(x, a, b, c):
return a * x**2 + b * x + c
# test to reproduce success case - notice that we have success when changing the first value upto a value of 1.0
success = [0, 0, 0]
# test to reproduce failure case
failure = [4, 0, 0]
popt, pcov = curve_fit(func, x, y, p0=failure) # change p0 to success or failure
print(popt) # expected answer is [3, 0, 2]
print(pcov) # covariance matrix
I'm not sure why you're expecting a different covariance matrix. The documentation says:
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf
As far as I understand the Jacobian matrix is estimated during the optimization, and depending on what initialization you use the above case might happen. Note that the result of popt still converges!
The covariance matrix is really only useful (and in general, can only be calculated) when each and every variable is optimized. That generally means the variable is moved away from its initial value and in a way so that the dependence of the fit quality (typically, chi-square) from changing the value of this variable can be determined.
It also turns out that if initial guesses are bad, the solution may not be found -- and some variables may not actually be moved from their initial values. I think that is what is happening for you.
An initial value of "0" is particularly troublesome, as the fit really does not know "how zero" that is. Is that "magnitude less than 1e-16" or "magnitude less than 1"? Even using initial values of [4, 0.01, 0.01] would get to a good solution.
An additional potential problem is that your "data" is exactly given by the model function and values. At "the right solution", the residual will be really very very close to zero, and converting the Jacobian matrix of derivatives (of misfit with respect to variables) to covariance can be numerically unstable. That would be very unlikely with real data, but you may want to add a small amount of noise to the data being modeled.
I am not a mathematician but I think that what I am after is called a "multiple linear regression"; please correct me if I am wrong.
I use numpy.polyfit and numpy.poly1d on a series of angle/pulse_width values from a servo motor, to obtain a function, angles_to_pulsewidths.
angles_to_pulsewidths is a polynomial function that models the servo and represents a line of good fit for the series. Given an angle value, it returns a corresponding pulse_width.
I am now trying to do a similar thing but instead of a single angle value in my series, I have pair of x/y co-ordinates for each pulse_width. I want to obtain a function that given an x/y pair, returns a corresponding pulse_width.
This is my code for creating my angles_to_pulsewidths function:
import numpy
angles_and_pulsewidths = [
[-162, 2490],
[-144, 2270],
[-126, 2070],
[-108, 1880]
]
angles_values_array = numpy.array(angles_and_pulsewidths)[:, 0]
pulsewidths_values_array = numpy.array(angles_and_pulsewidths)[:, 1]
coefficients = numpy.polyfit(
angles_values_array,
pulsewidths_values_array,
3
)
angles_to_pulsewidths = numpy.poly1d(coefficients)
I have been trying to modify this so that instead of providing a one-dimensional array of angles I will provide a two-dimensional array of x/y values:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7]]
pulse_widths = [2490, 2270, 2070, 1880]
However in this case, I can't use polyfit, because that takes only a one-dimensional array for its x parameter.
I can use numpy.linalg.lstsq instead, but I can't work out what to do with the results it gives me.
I'm also not even sure if I am on the right track; am I? I have read numerous related questions here, and have found numerous clues, but not enough to get me to the next step.
It is possible to use scipy's curve_fit for this.
If you know the general format of the function, perhaps you think it will be something of the form:
a x ^ 2 + b x y + c y ^ 2 + d x + e y +f
then you can use scipy's curve_fit to estimate what I will refer to as "parameters": a, b, c, d, e, f.
First we need to define the general form of our function:
def func(variables, a, b, c, d, e, f):
x, y = variables
return a * x ** 2 + b * x * y + c * y ** 2 + d * x + e * y + f
Note that our function has 6 parameters, to be able to demonstrate how this works we need more data than parameters so I'm extending your example data set to have 7 pairs of xy values and 7 pulse widths:
xy_values = [[1, 2], [3, 4], [5, 6], [6, 7], [8, 9], [10, 11], [12, 13]]
pulse_widths = [2490, 2270, 2070, 1880, 2000, 500, 600]
(If you do not have more data than parameters then you probably can choose a general form of your function to have less parameters.)
We need to reshape our xy_values so that it is not pairs of values but a single pair of two sets of values (the xs and the ys). To do this I'm choosing to creating a numpy array and "transpose" it:
xy_values = np.array(xy_values).T
We can now call our func on our array:
func(variables=xy_values, a=0, b=0, c=0, d=0, e=0, f=4)
Which gives:
array([4, 4, 4, 4, 4, 4, 4])
We can now actually use our data and curve_fit to estimate the best parameters:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f=func, xdata=xy_values, ydata=pulse_widths)
pcov contains information about how good the fit is and popt is the actual values of the parameters which we can directly see and use:
popt
gives:
array([ -25.61043682, -106.84636863, 119.10145249, -374.6200899 ,
230.65326227, 2141.55126789])
and we can call the function with it on some new value of x and y:
func([0, 5], *popt)
which gives:
6272.353891536915
Choosing the correct general form of the function you want to fit is case dependant. If there is any knowledge of the problem at hand (perhaps you expect there to be some trigonometric relationship) then you can use it otherwise it's a case of trial and error and getting a relationship that's "good enough" for your use case.
EDIT: Your original suggestion of needing to use multiple linear regression (MLR) is not completely incorrect. The solution approach I've described allows you to do MLR but it just assumes a specific type of func: one where all the terms are linear.
I'm working on an assignment where I need to do KNN Regression using the sklearn library--but, if I have missing data (assume it's missing-at-random) I am not supposed to impute it. Instead, I have to leave it as null and somehow in my code account for it to ignore comparisons where one value is null.
For example, if my observations are (1, 2, 3, 4, null, 6) and (1, null, 3, 4, 5, 6) then I would ignore both the second and the fifth observations.
Is this possible with the sklearn library?
ETA: I would just drop the null values, but I won't know what the data looks like that they'll be testing and it could end up dropping anywhere between 0% and 99% of the data.
This depends a little on what exactly you're trying to do.
Ignore all columns with nulls: I imagine this isn't what you're asking since that's more of a data pre-processing step and isn't really unique to sklearn. Even in pure python, just search for column indices containing nulls and construct a new data set with those indices filtered out.
Ignore null values in vector comparisons: This one is actually kind of fun. Essentially you're saying something like the distance between [1, 2, 3, 4, None, 6] and [1, None, 3, 4, 5, 6] is sqrt(1*1 + 3*3 + 4*4 + 6*6). In this case you need some kind of a custom metric, which sklearn supports. Unfortunately you can't input null values into the KNN fit() method, so even with a custom metric you can't quite get what you want. The solution is to pre-compute distances. E.g.:
from math import sqrt, isfinite
X_train = [
[1, 2, 3, 4, None, 6],
[1, None, 3, 4, 5, 6],
]
y_train = [3.14, 2.72] # we're regressing something
def euclidean(p, q):
# Could also use numpy routines
return sqrt(sum((x-y)**2 for x,y in zip(p,q)))
def is_num(x):
# The `is not None` check needs to happen first because of short-circuiting
return x is not None and isfinite(x)
def restricted_points(p, q):
# Returns copies of `p` and `q` except at coordinates where either vector
# is None, inf, or nan
return tuple(zip(*[(x,y) for x,y in zip(p,q) if all(map(is_num, (x,y)))]))
def dist(p, q):
# Note that in this form you can use any metric you like on the
# restricted vectors, not just the euclidean metric
return euclidean(*restricted_points(p, q))
dists = [[dist(p,q) for p in X_train] for q in X_train]
knn = KNeighborsRegressor(
n_neighbors=1, # only needed in our test example since we have so few data points
metric='precomputed'
)
knn.fit(dists, y_train)
X_test = [
[1, 2, 3, None, None, 6],
]
# We tell sklearn which points in the knn graph to use by telling it how far
# our queries are from every input. This is super inefficient.
predictions = knn.predict([[dist(q, p) for p in X_train] for q in X_test])
There's still an open question of what to do if you have nulls in the outputs you're regressing to, but your problem statement doesn't make it sound like that's an issue for you.
This should work:
import pandas as pd
df = pd.read_csv("your_data.csv")
df.dropna(inplace = True)
this is one of those questions that's probably going to be totally obvious once answered, but for now I'm stuck.
I'm trying to re-create an equation from a result dataset and the four parameters that produced it.
The data is in a matrix with the last column being the result.
I saw that numpy.polyfit allows multiple values for y, so I tried...
result=data[:,-1]
variables=data[:,0:-1]
factors=numpy.polyfit(result,variables,2)
Result comes out is:
[[-4.69652251e-01 8.09734523e-01 1.93673361e-02 -1.62700198e+00]
[ 1.42092582e+01 -7.06024402e+00 -9.94583683e-02 1.11882833e+01]
[ 7.44030682e+00 2.08161127e+01 2.65025708e-01 1.14229534e+01]]
I'm assuming the result coefficients are in the form
[[A^2,B^2,C^2,D^2]
[A ,B, C, D]
[const,const,const,const]]
Which is a little puzzling, especially since if I apply the coefficients to the input data I don't seem to be getting anything even close to the result data.
First off, am I even right about the meaning of polyfit's results?
Second, why are there four constants, all different? Am I supposed to add them together, or what?
Is this merely solving A vs result, then B vs result, etc, rather than combined multi-dimensional minimizing of the whole?? (And if so, how could I do that instead?)
Or am I just misguided what polyfit is doing in the first place?
Polyfit docs tell us that
Several data sets of sample points sharing the same x-coordinates can
be fitted at once by passing in a 2D-array that contains one dataset
per column.
Let us understand it.
Firstly, let us consider an example. Say we have 3 points on the plane and want to interpolate them by polynomial of degree 1. It means that we want to plot a line through given 3 points, and this line should have minimal squared distance to this point.
Say, we have 3 points: (1, 1), (2, 2), (3, 3). Obviously, it is possible to find the line which is going through these points without any error, and this line is y = x. If we think of line in terms of y = a * x + b, then a = 1, b = 0.
Good. Now let us start from giving this example to numpy polyfit:
X = np.array([1, 2, 3])
y = np.array([1, 2, 3])
a, b = np.polyfit(X, y, deg=1)
(a, b)
>>> (0.9999999999999997, 1.2083031466395714e-15)
a * 1000 + b
>>> 999.9999999999997
Nice. Now let us make the example with matrix instead of one vector of y. Docs told us that we are just having multiple lines with the same X coordinates. Let us check this. We take two sets of points: (1, 1), (2, 2), (3, 3) with the line y = x that fits them and (1, 2), (2, 4), (3, 6). The fitting line is y = 2x (check!).
We are transposing the second matrix because polyfit wants it.
X = np.array([1, 2, 3])
y = np.array([[1, 2, 3], [2, 4, 6]]).T
coeff = np.polyfit(X, y, deg=1)
coeff
>>> array([[1.00000000e+00, 2.00000000e+00],
[1.20830315e-15, 2.41660629e-15]])
We see that we have a matrix with first row (1, 2) and second row (0, 0). So the first column contains coefficients for the first line, and second one -- for the second line. Let us check:
a, b = coeff[:, 0]
a * 10 + b
>>> 9.999999999999998
a, b = coeff[:, 1]
a * 100 + b
>>> 199.99999999999994
So, you can pass multiple lines with the same X coordinates and get many fits simultaneously. It can be useful, for example, for transforming features for the whole bunch of data.
I want to solve the linear equation for n given points in n dimensional space to get the equation of hyper-plane.
for example, in two dimensional case, Ax + By + C = 0.
How can I get one solution if there are infinite solutions in a linear equations ?
I have tried scipy.linalg.solve() but it requires coefficient matrix A to be nonsingular.
I also tried sympy
A = Matrix([[0, 0, 1], [1, 1, 1]])
b = Matrix([0, 0])
linsolve((A, b), [x, y, z])
It returned me this
{(−y,y,0)}
I have to parse the result to determine which one is the free variable and then assign a number to it to get a solution.
Is there a more convenient way since I only want to get a specific solution ?