I have some data in two arrays, which appears to have a break in it. I want my code to figure out where the break is with using piecewise in scipy. Here is what I have:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.array([7228,7620,7730,7901,8139,8370,8448,8737,8824,9089,9233,9321,9509,9568,9642,9756,9915,10601,10942], dtype=np.float)
y= np.array([.874,.893,.8905,.8916,.9095,.9142,.9109,.9185,.9169,.9251,.9290,.9304,.9467,.9378,0.9464,0.9508,0.9583,0.9857,0.9975],dtype=np.float)
def piecewise_linear(x, x0, y0, k1, k2):
return np.piecewise(x, [x < x0], [lambda x:k1*x + y0-k1*x0, lambda x:k2*x + y0-k2*x0])
p , e = optimize.curve_fit(piecewise_linear, x, y)
perr = np.sqrt(np.diag(e))
xd = np.linspace(7228, 11000, 3000)
plt.plot(x, y, "o")
plt.plot(xd, piecewise_linear(xd, *p))
My issue is if I run this, I get an error, "OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)". Not sure how to get around this? Is there maybe a way to feed initial parameters into this function to help it converge or similar?
Note, I do realize that the other way I could be getting this to work is interpolating and finding the second derivative of my data. I've already done this, but because my data is not evenly spaced/ the y axis data has some error in it I am interested in getting it to work this way as well for statistical purposes. So, to be clear, what I want here are the parameters for the two lines (slope/intercept), and the inflection point. (Ideally I would love to get an error too on these too, but not sure if that's possible with this method.) Thanks in advance!
The code works perfectly fine, only the initial values are causing problems.
By default curve_fit starts with all parameters set to 1. Thus, x0 starts way out of range of the x in your data and the optimizer cannot compute a sensible gradient.
This small modification will fix the issue:
# make sure initial x0 and y0 are in range of the data
p0 = [np.mean(x), np.mean(y), 1, 1]
p , e = optimize.curve_fit(piecewise_linear, x, y, p0) # set initial parameter estimates
perr = np.sqrt(np.diag(e))
xd = np.linspace(7228, 11000, 3000)
plt.plot(x, y, "o")
plt.plot(xd, piecewise_linear(xd, *p))
print(p) # [ 9.32099947e+03 9.32965835e-01 2.58225121e-05 4.05400820e-05]
print(np.diag(e)) # [ 4.56978067e+04 5.52060368e-05 3.88418404e-12 7.05010755e-12]
Probably your software uses an iterative method starting from an initial guess. Generally the initial guess is the weakness of those methods.
If you want to overcome this kind of difficulty, use a non iterative method which don't require an initial guess. If the criteria of fitting of the non iterative method is not convenient for you, nevertheless first use the non iterative method to obtain a first solution. Then use a classical iterative method, starting from the solution found first.
For example, the next result is obtained thanks to the very simple algorithm (not iterative, no initial guess) which is given pp. 12-13in the paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
Related
What I have done so far:
I am trying to fit noisy data (which I generated myself by adding random noise to my function) to Gauss-Hermite function that I have defined. It works well in some cases for lower values of h3 and h4 but every once in a while it will produce a really bad fit even for lower h3, h4 values, and for higher h3, h4 values, it always gives a bad fit.
My code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as mpl
import matplotlib.pyplot as plt
# Let's define the Gauss-Hermite function
#a=amplitude, x0=location of peak, sig=std dev, h3, h4
def gh_func(x, a, x0, sig, h3, h4):
return a*np.exp(-.5*((x-x0)/sig)**2)*(1+h3*((-np.sqrt(3))*((x-x0)/sig)+(2/np.sqrt(3))*((x-x0)/sig)**3)+ h4*((np.sqrt(6)/4)+(-np.sqrt(6))*((x-x0)/sig)**2+(np.sqrt(6)/3)*(((x-x0)/sig)**4)))
#generate clean data
x = np.linspace(-10, 20, 100)
y = gh_func(x, 10, 5, np.sqrt(3), -0.10,-0.03) #it gives okay fit for h3=-0.10, h4=-0.03 but bad fits for higher values like h3=-0.4 and h4=-0.3.
#add noise to data
noise=np.random.normal(0,np.sqrt(0.5),size=len(x))
yn = y + noise
fig = mpl.figure(1)
ax = fig.add_subplot(111)
ax.plot(x, y, c='k', label='analytic function')
ax.scatter(x, yn, s=5, label='fake noisy data')
fig.savefig('model_and_noise_h3h4.png')
# Executing curve_fit on noisy data
popt, pcov = curve_fit(gh_func, x, yn)
#popt returns the best fit values for parameters of the given model (func)
print('Fitted Parameters (Gaus_Hermite):\na = %.10f , x0 = %.10f , sig = %.10f\nh3 = %.10f , h4 = %.10f' \
%(popt[0],popt[1],popt[2],popt[3],popt[4]))
ym = gh_func(x, popt[0], popt[1], popt[2], popt[3], popt[4])
ax.plot(x, ym, c='r', label='Best fit')
ax.legend()
fig.savefig('model_fit_h3h4.png')
plt.legend(loc='upper left')
plt.xlabel("v")
plt.ylabel("f(v)")
What I want to do:
I want to find a better fitting method than just curve_fit from scipy.optimize but I am not sure what I can use. Even if we end up using curve_fit, I need a way to produce better fits by providing initial guesses for the parameters which are generated automatically, e.g. one approach for only single peak gaussian is described in the accepted answer of this post (Jean Jacquelin's method):gaussian fitting inaccurate for lower peak width using Python. But this is just for mu,sigma and amplitude not h3,h4.
Besides curve_fit from scipy.optimize, I think there's one called lmfit: https://lmfit.github.io/lmfit-py/ but I am not sure how I will implement it in my code. I do not want to use manual initial guesses for the parameters. I want to be able to find the fitting automatically.
Thanks a lot!
Using lmfit for this fitting problem would be straightforward, by creating a lmfit.Model from your gh_func, with something like
from lmfit import Model
gh_model = Model(gh_func)
params = gh_model.make_params(x0=?, a=?, sig=?, h3=?, h4=?)
where those question marks would have to be initial values for the parameters in question.
Your use of scipy.optimize.curve_fit does not provide initial values for the variables. Unfortunately, this does not raise an error or give an indication of a problem because the authors of scipy.optimize.curve_fit have lied to you by making initial values optional. Initial values for all parameters are always required for all non-linear least-squares analyses. It is not a choice of the implementation, it is a feature of the mathematical algorithm. What curve_fit hides from you is that leaving p0=None makes all initial values 1.0. Whether that is remotely appropriate depends on the model and data being fit - it cannot be always reasonable. As an example, if the x values for your data extended from 9000 to 9500, and the peak function was centered around 9200, starting with x0 of 1 would almost certainly not find a suitable fit. It is always deceptive to imply in any way that initial values are optional. They just are not. Not sometimes, not for well-defined problems. Initial values are never, ever optional.
Initial values don't have to be perfect, but they need to be of the right scale. Many people will warn you about "false minima" - getting stuck at a solution that is not bad but not "best". My experience is that the more common problem people run into is initial values that are so far off that the model is just not sensitive to small changes in their values, and so can never be optimized (with x0=1,sig=1 for data with x on [9000, 9500] and centered at 9200 being exactly in that category).
The answer to your question for your essential question of "providing initial guesses for the parameters which are generated automatically" is hard to answer without knowing the properties of your model function. You might be able to use some heuristics to guess values from a data set. Lmfit has such heuristic "guess parameter values from data" functions for many peak-like functions. You probably have some sense for the physical (or other domain) for h3 and h4 and know what kinds of values are reasonable, and can probably give better initial values than h3=h4=1. It might be that you want to start with guessing parameters as if the data was simple Gaussian (say, using lmfit.Model.guess_from_peak()) and then use the difference of that and your data to get the scale for h3 and h4.
I have solved a single second order differential equation with two boundary conditions using the module solve_bvp. However, now I am trying to solve the system of two second order differential equations;
U'' + a*B' = 0
B'' + b*U' = 0
with the boundary conditions U(+/-0.5) = +/-0.01 and B(+/-0.5) = 0. I have split this into a system of first ordinary differential equations and I am trying to use solve_bvp to solve them numerically. However, I am just getting arrays full of zeros for my solution. I believe I am implementing the boundary conditions wrong. It is not clear to me how to handle more than two equations from the documentation. My attempt is below
import numpy as np
from scipy.integrate import solve_bvp
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.integrate import solve_bvp
alpha = 1E-8
zeta = 8E-3
C_k = 0.05
sigma = 0.01
def fun(x, y):
return np.vstack((y[1],-((alpha)/(C_k*sigma))*y[2],y[2], -(1/(C_k*zeta))*y[1]))
def bc(ya, yb):
return np.array([ya[0]+0.001, yb[0]-0.001,ya[0]-0, yb[0]-0])
x = np.linspace(-0.5, 0.5, 5000)
y = np.zeros((4, x.size))
print(y)
sol = solve_bvp(fun, bc, x, y)
print(sol)
In my question I have just relabeled a and b, but they're just parameters that I input. I have the analytic solution for this set of equations so I know one exists that is non-trivial. Any help would be greatly appreciated.
It is most times really helpful if you state at least once in a comment or by assignment to specifically named variables how you want to compose the state vector.
By the form of the derivative return vector, I would think you intend
U, U', B, B'
which means that U=y[0], U'=y[1] and B=y[2],B'=y[3], so that your derivatives vector should correctly be
return y[1], -((alpha)/(C_k*sigma))*y[3], y[3], -(1/(C_k*zeta))*y[1]
and the boundary conditions
return ya[0]+0.001, yb[0]-0.001, ya[2]-0, yb[2]-0
Especially your boundary condition should throw the algorithm in the first step because of a singular Jacobian, always check the .success field and the .message field of the solution structure.
Note that by default the absolute and relative tolerance of the experimental solve_bvp is 1e-3, and the number of nodes is limited to 500.
Setting the initial node number to 50 (5000 is much too much, the solver refines where necessary), and the tolerance to 1-6, I get the following solution plots that visibly satisfy the boundary conditions.
For several hours now, I have been trying to fit a model to a (generated) dataset as a casus for a problem I've been struggling with. I generated datapoints for the function f(x) = A*cos^n(x)+b, and added some noise. When I try to fit the dataset with this function and curve_fit, I get the error
./tester.py:10: RuntimeWarning: invalid value encountered in power
return Amp*(np.cos(x))**n + b
/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py:690: OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
The code I'm using to generate the datapoints and fit the model is the following:
#!/usr/bin/env python
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit
from matplotlib.pyplot import figure, show, rc, plot
def f(x, Amp, n, b):
return np.real(Amp*(np.cos(x))**n + b)
x = np.arange(0, 6.28, 0.01)
randomPart = np.random.rand(len(x))-0.5
fig = figure()
sample = f(x, 5, 2, 5)+randomPart
frame = fig.add_subplot(1,1,1)
frame.plot(x, sample, label="Sample measurements")
popt, pcov = curve_fit(f, x, sample, p0=(1,1,1))
modeldata = f(x, popt[0], popt[1], popt[2])
print(modeldata)
frame.plot(x, modeldata, label="Best fit")
frame.legend()
frame.set_xlabel("x")
frame.set_ylabel("y")
show()
The noisy data is shown - see the image below.
Does any of you have a clue of what's going on? I suspect it has something to do with the power law going into the complex domain, as the real part of the function is nowhere divergent. I have tried returning only the real part of the function, setting realistic bounds in curve_fit and using a numpy array instead of a python list for p0 already as well. I'm running the latest version of scipy available to me, scipy 0.17.0-1.
The problem is the following:
>>> (-2)**1.1
(-2.0386342710747223-0.6623924280875919j)
>>> np.array(-2)**1.1
__main__:1: RuntimeWarning: invalid value encountered in power
nan
Unlike native python floats, numpy doubles usually refuse to take part in operations leading to complex results:
>>> np.sqrt(-1)
__main__:1: RuntimeWarning: invalid value encountered in sqrt
nan
As a quick workaround I suggest adding an np.abs call to your function, and using appropriate bounds for fitting to make sure this doesn't give a spurious fit. If your model is near the truth and your sample (I mean the cosine in your sample) is positive, then adding an absolute value around it should be a no-op (update: I realize this is never the case, see the proper approach below).
def f(x, Amp, n, b):
return Amp*(np.abs(np.cos(x)))**n + b # only change here
With this small change I get this:
For reference, the parameters from the fit are (4.96482314, 2.03690954, 5.03709923]) comparing to the generation with (5,2,5).
After giving it a bit more thought I realized that the cosine will always be negative for half your domain (duh). So the workaround I suggested might be a bit problematic, or at least its correctness is non-trivial. On the other hand, thinking of your original formula containing cos(x)^n, with negative values for cos(x) this only makes sense as a model if n is an integer, otherwise you get a complex result. Since we can't solve Diophantine fitting problems, we need to handle this properly.
The most proper way (by which I mean the way that is least likely to bias your data) is this: first do the fitting with a model that converts your data to complex numbers then takes the complex magnitude on output:
def f(x, Amp, n, b):
return Amp*np.abs(np.cos(x.astype(np.complex128))**n) + b
This is obviously much less efficient than my workaround, since in each fitting step we create a new mesh, and do some extra work both in the form of complex arithmetic and an extra magnitude calculation. This gives me the following fit even with no bounds set:
The parameters are (5.02849409, 1.97655728, 4.96529108). These are close too. However, if we put these values back into the actual model (without np.abs), we get imaginary parts as large as -0.37, which is not overwhelming but significant.
So the second step should be redoing the fit with a proper model---one that has an integer exponent. Take the exponent 2 which is obvious from your fit, and do a new fit with this model. I don't believe any other approach gives you a mathematically sound result. You can also start from the original popt, hoping that it's indeed close to the truth. Of course we could use the original function with some currying, but it's much faster to use a dedicated double-specific version of your model.
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit
from matplotlib.pyplot import subplots, show
def f_aux(x, Amp, n, b):
return Amp*np.abs(np.cos(x.astype(np.complex128))**n) + b
def f_real(x, Amp, n, b):
return Amp*np.cos(x)**n + b
x = np.arange(0, 2*np.pi, 0.01) # pi
randomPart = np.random.rand(len(x)) - 0.5
sample = f(x, 5, 2, 5) + randomPart
fig,(frame_aux,frame) = subplots(ncols=2)
for fr in frame_aux,frame:
fr.plot(x, sample, label="Sample measurements")
fr.legend()
fr.set_xlabel("x")
fr.set_ylabel("y")
# auxiliary fit for n value
popt_aux, pcov_aux = curve_fit(f_aux, x, sample, p0=(1,1,1))
modeldata = f(x, *popt_aux)
#print(modeldata)
print('Auxiliary fit parameters: {}'.format(popt_aux))
frame_aux.plot(x, modeldata, label="Auxiliary fit")
# check visually, test if it's close to an integer, but otherwise
n = np.round(popt_aux[1])
# actual fit with integral exponent
popt, pcov = curve_fit(lambda x,Amp,b,n=n: f_real(x,Amp,n,b), x, sample, p0=(popt_aux[0],popt_aux[2]))
modeldata = f(x, popt[0], n, popt[1])
#print(modeldata)
print('Final fit parameters: {}'.format([popt[0],n,popt[1]]))
frame.plot(x, modeldata, label="Best fit")
frame_aux.legend()
frame.legend()
show()
Note that I changed a few things in your code which doesn't really affect my point. The figure from the above, so the one that shows both the auxiliary fit and the proper one:
The output:
Auxiliary fit parameters: [ 5.02628994 2.00886409 5.00652371]
Final fit parameters: [5.0288141074549699, 2.0, 5.0009730316739462]
Just to reiterate: while there might be no visual difference between the auxiliary fit and the proper one, only the latter gives a meaningful answer to your problem.
I am playing around with logistic regression in Python. I have implemented a version where the minimization of the cost function is done via gradient descent, and now I'd like to use the BFGS algorithm from scipy (scipy.optimize.fmin_bfgs).
I have a set of data (features in matrix X, with one sample in every row of X, and correpsonding labels in vertical vector y). I am trying to find parameters Theta to minimize:
I have trouble understanding how fmin_bfgs works exactly. As far as I get it, I have to pass a function to be minimized and a set of initial values for Thetas.
I do the following:
initial_values = numpy.zeros((len(X[0]), 1))
myargs = (X, y)
theta = scipy.optimize.fmin_bfgs(computeCost, x0=initial_values, args=myargs)
where computeCost calculates J(Thetas) as illustrated above. But I get some index-related errors, so I think I am not supplying what fmin_bfgs expects.
Can anyone shed some light on this?
After wasting hours on it, solved again by power of posting...I was defining computeCost(X, y, Thetas), but as Thetas is the target parameter for optimization, it should have been the first parameter in the signature. Fixed and works!
i don't know your whole code, but have you tried
initial_values = numpy.zeros(len(X[0]))
? This x0 should be a 1d vector, i think.
The leastsq method in scipy lib fits a curve to some data. And this method implies that in this data Y values depends on some X argument. And calculates the minimal distance between curve and the data point in the Y axis (dy)
But what if I need to calculate minimal distance in both axes (dy and dx)
Is there some ways to implement this calculation?
Here is a sample of code when using one axis calculation:
import numpy as np
from scipy.optimize import leastsq
xData = [some data...]
yData = [some data...]
def mFunc(p, x, y):
return y - (p[0]*x**p[1]) # is takes into account only y axis
plsq, pcov = leastsq(mFunc, [1,1], args=(xData,yData))
print plsq
I recently tryed scipy.odr library and it returns the proper results only for linear function. For other functions like y=a*x^b it returns wrong results. This is how I use it:
def f(p, x):
return p[0]*x**p[1]
myModel = Model(f)
myData = Data(xData, yData)
myOdr = ODR(myData, myModel , beta0=[1,1])
myOdr.set_job(fit_type=0) #if set fit_type=2, returns the same as leastsq
out = myOdr.run()
out.pprint()
This returns wrong results, not desired, and in some input data not even close to real.
May be, there is some special ways of using it, what do I do wrong?
I've found the solution. Scipy Odrpack works noramally but it needs a good initial guess for correct results. So I divided the process into two steps.
First step: find the initial guess by using ordinaty least squares method.
Second step: substitude these initial guess in ODR as beta0 parameter.
And it works very well with an acceptable speed.
Thank you guys, your advice directed me to the right solution
scipy.odr implements the Orthogonal Distance Regression. See the instructions for basic use in the docstring and documentation.
If/when you are able to invert the function described by p you may just include x-pinverted(y) in mFunc, I guess as sqrt(a^2+b^2), so (pseudo code)
return sqrt( (y - (p[0]*x**p[1]))^2 + (x - (pinverted(y))^2)
for example for
y=kx+m p=[m,k]
pinv=[-m/k,1/k]
return sqrt( (y - (p[0]+x*p[1]))^2 + (x - (pinv[0]+y*pinv[1]))^2)
But what you ask for is in some cases problematic. For example, if a polynomial (or your x^j) curve has a minimum ym at y(m) and you have a point x,y lower than ym, what kind of value do you want to return? There's not always a solution.
you can use the ONLS package in R.