Scipy curve_fit fails for data with sine function - python

I'm trying to fit a curve through some data. The function I'm trying to fit is as follows:
def f(x,a,b,c):
return a +b*x**c
When using scipy.optimize.curve_fit I do not get any results: It returns the (default) initial parameters:
(array([ 1., 1., 1.]),
array([[ inf, inf, inf],
[ inf, inf, inf],
[ inf, inf, inf]]))
I've tried reproducing the data, and found that a sine function was causing the problem (the data contains daily variation):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xdata=np.random.rand(1000) + 0.002 *np.sin(np.arange(1000)/(1.5*np.pi))
ydata=0.1 + 23.4*xdata**0.56 + np.random.normal(0,2,1000)
def f(x,a,b,c):
return a +b*x**c
fit=curve_fit(f,xdata,ydata)
fig,ax=plt.subplots(1,1)
ax.plot(xdata,ydata,'k.',markersize=3)
ax.plot(np.arange(0,1,.01), f(np.arange(0,1,.01),*fit[0]))
fig.show()
I would obviously expect curve_fit to return something close to [0.1, 23.4, .56].
Note that the sine function does not really seem to affect the data ('xdata') in value, as the first term of xdata ranges between 0 and 1 and I'm adding something between -0.002 and +0.002, but it does cause the fitting procedure to fail. I found the value 0.002 to be close to the 'critical' value for failure; if it is smaller the procedure is less likely to fail, and vice versa. At 0.002 the procedure fails about as often as not.
I have tried solving this problem by shuffling the 'xdata' and 'ydata' simultaneously, to no effect. I thought (for no particular reason) that perhaps removing the autocorrelation of the data would solve the problem.
So my question is: how can I fix/bypass this problem? I can change the sine contribution in the synthetic data in the snippet above, but for my real data I obviously cannot.

You can eliminate the NaNs generated by negative x-values within in the model function:
def f(x,a,b,c):
y = a +b*x**c
y[np.isnan(y)] = 0.0
return y
Replacing all NaNs by 0 might not be the best choice. You could try neighbour values or do some kind of extrapolation.
If you feed in generated test data you have to make sure that there are no NaNs in there either. So directly after data generation put something like:
if xdata.min() < 0:
print 'expecting NaNs'
ydata[np.isnan(ydata)] = 0.0

Related

Fitting data with an exponential law

I'd like to fit some data with an exponential function. I used scipy.optimize.curve_fit because I already used it for other fits. This time, there is an issue and I can't figure out what's wrong.
Here is what the data looks like when plotted :
data.png
as you see it seems to follow an exponential law.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
data = np.array([
0., 1.93468444, 3.69735865, 5.38185988, 6.02549022,
6.69199075, 7.72316694, 8.08913061, 8.84570241, 8.69711608,
8.80038144, 9.78951087, 9.68486674, 10.06175145, 10.44039495,
10.0481156 , 9.76656204, 9.88581457, 9.81805445, 10.42432252,
10.41102239, 11.2911395 , 9.64866184, 9.98072231, 10.83644694,
10.24748571, 10.81333209, 10.75949899, 10.90367328, 10.42446764,
10.51441017, 10.73047737, 10.8159758 , 10.51013538, 10.02862504,
9.76352112, 10.64829309, 10.6293347 , 10.67752596, 10.34801542,
10.53158576, 10.92883362, 10.67002314, 10.37015825, 10.74876349,
10.12821343, 10.8974205 , 10.1591103 , 10.588377 , 11.92134556,
10.309095 , 11.1174362 , 10.72654524, 10.60890374, 10.37456491,
10.05935346, 11.21295863, 11.09013951, 10.60862773, 11.2558922 ,
11.24660234, 10.35981557, 10.81284365, 10.96113067, 10.22716439,
9.8394873 , 10.01892084, 10.38237311, 10.04920671, 10.87782442,
10.42438756, 10.05614503, 10.5446946 , 9.99974368, 10.76930547,
10.22164072, 10.36942999, 10.89888302, 10.47035428, 10.58157374,
11.12615892, 11.30866718, 10.33215937, 10.46723351, 10.54072701,
11.45027197, 10.45895588, 10.34176601, 10.78405493, 10.43964778,
10.34047484, 10.25099046, 11.05847515, 10.27408195, 10.27529163,
10.16568845, 10.86451738, 10.73205291, 10.73300649, 10.49463959,
10.03729782
])
t = np.linspace(0, 100, len(data)) #time array
def expo(x, a, b, c): #exponential function for fitting
return a * np.exp(b * x) + c
fig1, ax1 = plt.subplots()
ax1.plot(t, data, ".", label="data")
coefs = curve_fit(expo, t, data)[0] # fitting
ax1.plot(t, expo(t, coefs[0], coefs[1], coefs[2]), "-", label="fit")
ax1.legend()
plt.show()
The problem is that curve_fit() returns very big or very small coefficients a,b and c while it should return something more like a = -10.5, b = -0.2, c = 10.5
The fitting process works by finding a local minimum of a loss function.
If the problem is unconstrained, there may be several such local minima,
each giving different values of parameters, and you may get a different one
than the one that you are expecting.
If you have a guess what the parameters should be, you can provide it to narrow the search:
# with an initial guess for values of a, b, c
coefs = curve_fit(expo, t, data, p0=[-10, -1, 10])[0]
The coefficients it produces are:
array([-10.48815244, -0.2091102 , 10.56699883])
Alternatively, you can specify bonds for the parameters:
# with lower and upper bounds for a, b, c
coefs = curve_fit(expo, t, data, bounds=([-20, -2, 0], [-10, 2, 20]))[0]
This gives the same results as above.
Probably a non-linear regression algorithm is implemented in your software.
"Guessed" initial values of the parameters are required to start the iterative process. If no initial value is provided by the user, some initial values are evaluated by the software. That is often a cause of failure because the computed initial values might be too far from the correct values.
Some good initial values can be found in using a linear regression method which doesn't requires initial values. See the calculus below.
The result is :
If the accuracy of the above result is not sufficient according to some specified criteria of fitting, a non-linear regression is necessary. In this case the above values of the parameters $a,b,c$ can be used as initial values to initiate the iterative calculus.
Note : The principle of the method which lineraizes the non-linear regression as shown above is explained in : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
Here is what i tried, used negative b in np.exp
def expo(x,a,b,c):
return a*np.exp(-b*x) + c
>>>[-10.4881516 0.20911016 10.5669989 ]

Why did my p-value equals 0 and statistic equals 1 when I use ks test in python?

Thanks to anyone who have a look first.
My codes are :
import numpy as np
from scipy.stats import kstest
data=[31001, 38502, 40842, 40852, 43007, 47228, 48320, 50500, 54545, 57437, 60126, 65556, 71215, 78460, 81299, 96851, 106472, 108398, 118495, 130832, 141678, 155703, 180689, 218032, 222238, 239553, 250895, 274025, 298231, 330228, 330910, 352058, 362993, 369690, 382487, 397270, 414179, 454013, 504993, 518475, 531767, 551032, 782483, 913658, 1432195, 1712510, 2726323, 2777535, 3996759, 13608152]
x=np.array(data)
test_sta=kstest(x, 'norm')
print(test_sta)
The result of kstest is KstestResult(statistic=1.0, pvalue=0.0). Is there anything wrong with the code or the data is just not normal at all?
I've not used this before, but I think you're testing whether your data is standard-normal (i.e. mean=0, variance=1)
plotting a histogram shows it to be much closer to a log-normal. I'd therefore do:
x = np.log(data)
x -= np.mean(x)
x /= np.std(x)
kstest(x, 'norm')
which gives me a test statistic of 0.095 and a p-value of 0.75, confirming that we can't reject that it's not log-normal.
a good way to check this sort of thing is to generate some random data (from a known distribution) and see what the test gives you back. for example:
kstest(np.random.normal(size=100), 'norm')
gives me p-values near 1, while:
kstest(np.random.normal(loc=13, size=100), 'norm')
gives me p-values near 0.
a log-normal distribution just means that it's normally distributed after log transforming. if you really want to test against a normal distribution, you'd just not log transform the data, e.g:
x = np.array(data, dtype=float)
x -= np.mean(x)
x /= np.std(x)
kstest(x, 'norm')
which gives me a p-value of 7e-7, indicating that we can reliably reject the hypothesis that it's normally distributed.

`ValueError: A value in x_new is above the interpolation range.` - what other reasons than not ascending values?

I receive this error in scipy interp1d function. Normally, this error would be generated if the x was not monotonically increasing.
import scipy.interpolate as spi
def refine(coarsex,coarsey,step):
finex = np.arange(min(coarsex),max(coarsex)+step,step)
intfunc = spi.interp1d(coarsex, coarsey,axis=0)
finey = intfunc(finex)
return finex, finey
for num, tfile in enumerate(files):
tfile = tfile.dropna(how='any')
x = np.array(tfile['col1'])
y = np.array(tfile['col2'])
finex, finey = refine(x,y,0.01)
The code is correct, because it successfully worked on 6 data files and threw the error for the 7th. So there must be something wrong with the data. But as far as I can tell, the data increase all the way down.
I am sorry for not providing an example, because I am not able to reproduce the error on an example.
There are two things that could help me:
Some brainstorming - if the data are indeed monotonically
increasing, what else could produce this error? Another hint,
regarding the decimals, could be in this question, but I think
my solution (the min and max of x) is robust enough to avoid it. Or
isn't it?
Is it possible (how?) to return the value of x_new and
it's index when throwing the ValueError: A value in x_new is above the interpolation range. so that I could actually see where in the
file is the problem?
UPDATE
So the problem is that, for some reason, max(finex) is larger than max(coarsex) (one is .x39 and the other is .x4). I hoped rounding the original values to 2 significant digits would solve the problem, but it didn't, it displays fewer digits but still counts with the undisplayed. What can I do about it?
If you are running Scipy v. 0.17.0 or newer, then you can pass fill_value='extrapolate' to spi.interp1d, and it will extrapolate to accomadate these values of your's that lie outside the interpolation range. So define your interpolation function like so:
intfunc = spi.interp1d(coarsex, coarsey,axis=0, fill_value="extrapolate")
Be forewarned, however!
Depending on what your data looks like and the type on interpolation you are performing, the extrapolated values can be erroneous. This is especially true if you have noisy or non-monotonic data. In your case you might be ok because your x_new value is only slighly beyond your interpolation range.
Here's simple demonstration of how this feature can work nicely but also give erroneous results.
import scipy.interpolate as spi
import numpy as np
x = np.linspace(0,1,100)
y = x + np.random.randint(-1,1,100)/100
x_new = np.linspace(0,1.1,100)
intfunc = spi.interp1d(x,y,fill_value="extrapolate")
y_interp = intfunc(x_new)
import matplotlib.pyplot as plt
plt.plot(x_new,y_interp,'r', label='interp/extrap')
plt.plot(x,y, 'b--', label='data')
plt.legend()
plt.show()
So the interpolated portion (in red) worked well, but the extrapolated portion clearly fails to follow the otherwise linear trend in this data because of the noise. So have some understanding of your data and proceed with caution.
A quick test of your finex calc shows that it can (always?) gets into the extrapolation region.
In [124]: coarsex=np.random.rand(100)
In [125]: max(coarsex)
Out[125]: 0.97393109991816473
In [126]: step=.01;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max(
...: finex),max(coarsex))
Out[126]: (0.98273730602114795, 0.97393109991816473)
In [127]: step=.001;finex=np.arange(min(coarsex), max(coarsex)+step, step);(max
...: (finex),max(coarsex))
Out[127]: (0.97473730602114794, 0.97393109991816473)
Again it is a quick test, and may be missing some critical step or value.

Why does InterpolatedUnivariateSpline return nan values

I have some data, y vs x, which I would like to interpolate at a finer resolution xx using a cubic spline.
Here is my dataset:
import numpy as np
print np.version.version
import scipy
print scipy.version.version
1.9.2
0.15.1
x = np.array([0.5372973, 0.5382103, 0.5392305, 0.5402197, 0.5412042, 0.54221, 0.543209,
0.5442277, 0.5442277, 0.5452125, 0.546217, 0.5472153, 0.5482086,
0.5492241, 0.5502117, 0.5512249, 0.5522136, 0.5532056, 0.5532056,
0.5542281, 0.5552039, 0.5562125, 0.5567836])
y = np.array([0.01, 0.03108, 0.08981, 0.18362, 0.32167, 0.50941, 0.72415, 0.90698,
0.9071, 0.97955, 0.99802, 1., 0.97863, 0.9323, 0.85344, 0.72936,
0.56413, 0.36997, 0.36957, 0.17623, 0.05922, 0.0163, 0.01, ])
xx = np.array([0.5372981, 0.5374106, 0.5375231, 0.5376356, 0.5377481, 0.5378606,
0.5379731, 0.5380856, 0.5381981, 0.5383106, 0.5384231, 0.5385356,
0.5386481, 0.5387606, 0.5388731, 0.5389856, 0.5390981, 0.5392106,
0.5393231, 0.5394356, 0.5395481, 0.5396606, 0.5397731, 0.5398856,
0.5399981, 0.5401106, 0.5402231, 0.5403356, 0.5404481, 0.5405606,
0.5406731, 0.5407856, 0.5408981, 0.5410106, 0.5411231, 0.5412356,
0.5413481, 0.5414606, 0.5415731, 0.5416856, 0.5417981, 0.5419106,
0.5420231, 0.5421356, 0.5422481, 0.5423606, 0.5424731, 0.5425856,
0.5426981, 0.5428106, 0.5429231, 0.5430356, 0.5431481, 0.5432606,
0.5433731, 0.5434856, 0.5435981, 0.5437106, 0.5438231, 0.5439356,
0.5440481, 0.5441606, 0.5442731, 0.5443856, 0.5444981, 0.5446106,
0.5447231, 0.5448356, 0.5449481, 0.5450606, 0.5451731, 0.5452856,
0.5453981, 0.5455106, 0.5456231, 0.5457356, 0.5458481, 0.5459606,
0.5460731, 0.5461856, 0.5462981, 0.5464106, 0.5465231, 0.5466356,
0.5467481, 0.5468606, 0.5469731, 0.5470856, 0.5471981, 0.5473106,
0.5474231, 0.5475356, 0.5476481, 0.5477606, 0.5478731, 0.5479856,
0.5480981, 0.5482106, 0.5483231, 0.5484356, 0.5485481, 0.5486606,
0.5487731, 0.5488856, 0.5489981, 0.5491106, 0.5492231, 0.5493356,
0.5494481, 0.5495606, 0.5496731, 0.5497856, 0.5498981, 0.5500106,
0.5501231, 0.5502356, 0.5503481, 0.5504606, 0.5505731, 0.5506856,
0.5507981, 0.5509106, 0.5510231, 0.5511356, 0.5512481, 0.5513606,
0.5514731, 0.5515856, 0.5516981, 0.5518106, 0.5519231, 0.5520356,
0.5521481, 0.5522606, 0.5523731, 0.5524856, 0.5525981, 0.5527106,
0.5528231, 0.5529356, 0.5530481, 0.5531606, 0.5532731, 0.5533856,
0.5534981, 0.5536106, 0.5537231, 0.5538356, 0.5539481, 0.5540606,
0.5541731, 0.5542856, 0.5543981, 0.5545106, 0.5546231, 0.5547356,
0.5548481, 0.5549606, 0.5550731, 0.5551856, 0.5552981, 0.5554106,
0.5555231, 0.5556356, 0.5557481, 0.5558606, 0.5559731, 0.5560856,
0.5561981, 0.5563106, 0.5564231, 0.5565356, 0.5566481, 0.5567606])
I am trying to fit using the scipy InterpolatedUnivariateSpline method, interpolated with a 3rd order spline k=3, and extrapolated as zeros ext='zeros':
import scipy.interpolate as interp
yspline = interp.InterpolatedUnivariateSpline(x,y, k=3, ext='zeros')
yvals = yspline(xx)
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x, y, 'ko', label='Values')
ax.plot(xx, yvals, 'b-.', lw=2, label='Spline')
plt.xlim([min(x), max(x)])
However, as you can see in this image, my Spline returns NaN values :(
Is there a reason? I am pretty sure my x values are all increasing, so I am stumped as to why this is happening. I have many other datasets I am fitting using this method, and it only fails on this specific set of data.
Any help is greatly appreciated.
Thank you for reading.
EDIT!
The solution was that I have duplicate x values, with differing y values!
For this interpolation, you should rather use scipy.interpolate.interp1d with the argument kind='cubic' (see a related SO question )
I have yet to find a use case where InterpolatedUnivariateSpline can be used in practice (or maybe I just don't understand its purpose). With your code I get,
So the interpolation works but shows extremely strong oscillations, making it unusable, which is typically the result I was getting with this interpolation method in the past. With a lower order spline (e.g. k=1) that works better, but then you lose the advantage of cubic interpolation.
I've also encountered the problem with InterpolatedUnivariateSpline returning NaN values. But in my case the reason was not in having duplicates in x array but because values in x were decreasing when docs states that values "must be increasing".
So, in such a case, instead of original x and y one must supply them reversed: x[::-1] and y[::-1].

scipy.optimize.curve_fit : Not able to do a curve fitting

I am still new with python and I have a problem wit curve fitting. The following program is a simplification of a bigger program that I create but it represent the problem that I have.
The problem is that I have a function which I called burger that I cannot fit a curve. This line : y=np.sqrt(y) : is a problem. When I remove it, i can fit it perfectly but that not the function I want.
How Can I do a fitting of this function y=np.sqrt(y)?
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 11 22:14:54 2013
#author:
"""
import numpy as np
import matplotlib.pyplot as plt
import pdb
import scipy.optimize as optimization
from math import *
from scipy.optimize import curve_fit
import math
import moyenne
####################Function Burger###############################
def burger(t, E1, E2, N,tau):
nu=0.4 #Coefficient de Poisson
P=50 #Peak force
alpha=70.3 #Tip angle
y=((((pi/2.)*P*(1.-nu**2.))/(tan(alpha)))*(1./E1 + 1./E2*(1.-np.exp(-t/tau)) + 1./((N)*(1.-nu))*t))
y=np.sqrt(y)
return y
#######exemple d'utilisation##########
xlist=np.linspace(0,1,100)
ylist=[ burger(t,3, 2,1,0.1) for t in xlist]
#pdb.set_trace()
pa,j = curve_fit(burger,xlist,ylist)
yfit=[burger(x,*pa) for x in xlist]
plt.figure()
plt.plot(xlist,ylist,marker='o')
plt.plot(xlist,yfit)
plt.show()
So, this probably won't be the best answer you get, but while you wait for others here are some things to think about.
First, since you are new to python maybe you don't know, or maybe there is reason to solve these things in list comprehension, but I don't think you need the list comprehensions. You can use the numpy math operations to handle a whole array at a time. Instead of
y=((((pi/2.)*P*(1.-nu**2.))/(tan(alpha)))* ...
You can write
y = ((((np.pi/2.)*P*(1.-nu**2.))/(np.tan(alpha)))* ...
Then instead of
[ burger(t, 3., 2., 1., 0.1) for t in xlist]
you can do
burger(xlist, 3., 2., 1., 0.1)
This is will be a lot faster when you are working with arrays.
Secondly, just looking through a couple of things that were happening in the algorithm. It wasn't looking for your parameters in the right ranges. I looked up the algorithm it is using on the scipy.optimize page (here) and wikipedia says that the convergence is dependent on the initial guess and also that it finds the local, not global, minima (Sometimes your code hit negative values for the parameters which made the sqrt of y undefined for some cases). If there is a way you can give it a good initial guess then it should work ([1., 3., 3., 2] worked for me). My command that solved it was: pa,j = curve_fit(burger,xlist,ylist, [1., 3., 3., 2], maxfev=10000)).
Thirdly, the first error I got when I used your code was that it reached the max number of fevals. Add maxfev=10000 (or more if you need) as the last argument to curve_fit.
Check it out. If you can give your bigger problem an initial guess then maybe you'll get it to converge. Otherwise maybe a different algorithm could be more suitable?
Update: See this question for a more detailed explanation of why this works, but you can get it to work without a guess if you give it another kwg, diag.
Use:
pa,j = curve_fit(burger,xlist,ylist, diag=(1./xlist.mean(), 1./ylist.mean()), maxfev=10000)

Categories