SciPy Curve Fit Fails Power Law - python

So, I'm trying to fit a set of data with a power law of the following kind:
def f(x,N,a): # Power law fit
if a >0:
return N*x**(-a)
else:
return 10.**300
par,cov = scipy.optimize.curve_fit(f,data,time,array([10**(-7),1.2]))
where the else condition is just to force a to be positive. Using scipy.optimize.curve_fit yields an awful fit (green line), returning values of 1.2e+04 and 1.9e0-7 for N and a, respectively, with absolutely no intersection with the data. From fits I've put in manually, the values should land around 1e-07 and 1.2 for N and a, respectively, though putting those into curve_fit as initial parameters doesn't change the result. Removing the condition for a to be positive results in a worse fit, as it chooses a negative, which leads to a fit with the wrong sign slope.
I can't figure out how to get a believable, let alone reliable, fit out of this routine, but I can't find any other good Python curve fitting routines. Do I need to write my own least-squares algorithm or is there something I'm doing wrong here?

UPDATE
In the original post, I showed a solution that uses lmfit which allows to assign bounds to your parameters. Starting with version 0.17, scipy also allows to assign bounds to your parameters directly (see documentation). Please find this solution below after the EDIT which can hopefully serve as a minimal example on how to use scipy's curve_fit with parameter bounds.
Original post
As suggested by #Warren Weckesser, you could use lmfit to get this task done, which allows you to assign bounds to your parameters and avoids this 'ugly' if-clause.
Since you do not provide any data, I created some which are shown here:
They follow the law f(x) = 10.5 * x ** (-0.08)
I fit them - as suggested by #roadrunner66 - by transforming the power law in a linear function:
y = N * x ** a
ln(y) = ln(N * x ** a)
ln(y) = a * ln(x) + ln(N)
So I first use np.log on the original data and then do the fit. When I now use lmfit, I get the following output:
[[Variables]]
lN: 2.35450302 +/- 0.019531 (0.83%) (init= 1.704748)
a: -0.08035342 +/- 0.005158 (6.42%) (init=-0.5)
So a is pretty close to the original value and np.exp(2.35450302) gives 10.53 which is also very close to the original value.
The plot then looks as follows; as you can see the fit describes the data very well:
Here is the entire code with a couple of inline comments:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50.)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
plt.plot(xData, yData, 'bo')
plt.show()
# transform data so that we can use a linear fit
lx = np.log(xData)
ly = np.log(yData)
plt.plot(lx, ly, 'bo')
plt.show()
def decay(params, x, data):
lN = params['lN'].value
a = params['a'].value
# our linear model
model = a * x + lN
return model - data # that's what you want to minimize
# create a set of Parameters
params = Parameters()
params.add('lN', value=np.log(5.5), min=0.01, max=100) # value is the initial value
params.add('a', value=-0.5, min=-1, max=-0.001) # min, max define parameter bounds
# do fit, here with leastsq model
result = minimize(decay, params, args=(lx, ly))
# write error report
report_fit(params)
# plot data
xnew = np.linspace(0., 100., 5000.)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, np.exp(result.values['lN']) * xnew ** (result.values['a']), 'r')
plt.show()
EDIT
Assuming that you have scipy 0.17 installed, you can also do the following using curve_fit. I show it for your original definition of the power law (red line in the plot below) as well as for the logarithmic data (black line in the plot below). The data is generated in the same way as above. The plot the looks as follows:
As you can see, the data is described very well. If you print popt and popt_log, you obtain array([ 10.47463426, 0.07914812]) and array([ 2.35158653, -0.08045776]), respectively (note: for the letter one you will have to take the exponantial of the first argument - np.exp(popt_log[0]) = 10.502 which is close to the original data).
Here is the entire code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
# get logarithmic data
lx = np.log(xData)
ly = np.log(yData)
def f(x, N, a):
return N * x ** (-a)
def f_log(x, lN, a):
return a * x + lN
# optimize using the appropriate bounds
popt, pcov = curve_fit(f, xData, yData, bounds=(0, [30., 20.]))
popt_log, pcov_log = curve_fit(f_log, lx, ly, bounds=([0, -10], [30., 20.]))
xnew = np.linspace(0.01, 100., 5000)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, f(xnew, *popt), 'r')
plt.plot(xnew, f(xnew, np.exp(popt_log[0]), -popt_log[1]), 'k')
plt.show()

Related

How can I fit a sine wave to my data in python? [duplicate]

I am trying to show that economies follow a relatively sinusoidal growth pattern. I am building a python simulation to show that even when we let some degree of randomness take hold, we can still produce something relatively sinusoidal.
I am happy with the data I'm producing, but now I'd like to find some way to get a sine graph that pretty closely matches the data. I know you can do polynomial fit, but can you do sine fit?
Here is a parameter-free fitting function fit_sin() that does not require manual guess of frequency:
import numpy, scipy.optimize
def fit_sin(tt, yy):
'''Fit sin to the input time sequence, and return fitting parameters "amp", "omega", "phase", "offset", "freq", "period" and "fitfunc"'''
tt = numpy.array(tt)
yy = numpy.array(yy)
ff = numpy.fft.fftfreq(len(tt), (tt[1]-tt[0])) # assume uniform spacing
Fyy = abs(numpy.fft.fft(yy))
guess_freq = abs(ff[numpy.argmax(Fyy[1:])+1]) # excluding the zero frequency "peak", which is related to offset
guess_amp = numpy.std(yy) * 2.**0.5
guess_offset = numpy.mean(yy)
guess = numpy.array([guess_amp, 2.*numpy.pi*guess_freq, 0., guess_offset])
def sinfunc(t, A, w, p, c): return A * numpy.sin(w*t + p) + c
popt, pcov = scipy.optimize.curve_fit(sinfunc, tt, yy, p0=guess)
A, w, p, c = popt
f = w/(2.*numpy.pi)
fitfunc = lambda t: A * numpy.sin(w*t + p) + c
return {"amp": A, "omega": w, "phase": p, "offset": c, "freq": f, "period": 1./f, "fitfunc": fitfunc, "maxcov": numpy.max(pcov), "rawres": (guess,popt,pcov)}
The initial frequency guess is given by the peak frequency in the frequency domain using FFT. The fitting result is almost perfect assuming there is only one dominant frequency (other than the zero frequency peak).
import pylab as plt
N, amp, omega, phase, offset, noise = 500, 1., 2., .5, 4., 3
#N, amp, omega, phase, offset, noise = 50, 1., .4, .5, 4., .2
#N, amp, omega, phase, offset, noise = 200, 1., 20, .5, 4., 1
tt = numpy.linspace(0, 10, N)
tt2 = numpy.linspace(0, 10, 10*N)
yy = amp*numpy.sin(omega*tt + phase) + offset
yynoise = yy + noise*(numpy.random.random(len(tt))-0.5)
res = fit_sin(tt, yynoise)
print( "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt, yynoise, "ok", label="y with noise")
plt.plot(tt2, res["fitfunc"](tt2), "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
The result is good even with high noise:
Amplitude=1.00660540618, Angular freq.=2.03370472482, phase=0.360276844224, offset=3.95747467506, Max. Cov.=0.0122923578658
You can use the least-square optimization function in scipy to fit any arbitrary function to another. In case of fitting a sin function, the 3 parameters to fit are the offset ('a'), amplitude ('b') and the phase ('c').
As long as you provide a reasonable first guess of the parameters, the optimization should converge well.Fortunately for a sine function, first estimates of 2 of these are easy: the offset can be estimated by taking the mean of the data and the amplitude via the RMS (3*standard deviation/sqrt(2)).
Note: as a later edit, frequency fitting has also been added. This does not work very well (can lead to extremely poor fits). Thus, use at your discretion, my advise would be to not use frequency fitting unless frequency error is smaller than a few percent.
This leads to the following code:
import numpy as np
from scipy.optimize import leastsq
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.15247 # Optional!! Advised not to use
data = 3.0*np.sin(f*t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)/(2**0.5)
guess_phase = 0
guess_freq = 1
guess_amp = 1
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean
# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(x[1]*t+x[2]) + x[3] - data
est_amp, est_freq, est_phase, est_mean = leastsq(optimize_func, [guess_amp, guess_freq, guess_phase, guess_mean])[0]
# recreate the fitted curve using the optimized parameters
data_fit = est_amp*np.sin(est_freq*t+est_phase) + est_mean
# recreate the fitted curve using the optimized parameters
fine_t = np.arange(0,max(t),0.1)
data_fit=est_amp*np.sin(est_freq*fine_t+est_phase)+est_mean
plt.plot(t, data, '.')
plt.plot(t, data_first_guess, label='first guess')
plt.plot(fine_t, data_fit, label='after fitting')
plt.legend()
plt.show()
Edit: I assumed that you know the number of periods in the sine-wave. If you don't, it's somewhat trickier to fit. You can try and guess the number of periods by manual plotting and try and optimize it as your 6th parameter.
More userfriendly to us is the function curvefit. Here an example:
import numpy as np
from scipy.optimize import curve_fit
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3.0*np.sin(t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_freq = 1
guess_amplitude = 3*np.std(data)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(data)
p0=[guess_freq, guess_amplitude,
guess_phase, guess_offset]
# create the function we want to fit
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
# now do the fit
fit = curve_fit(my_sin, t, data, p0=p0)
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = my_sin(t, *p0)
# recreate the fitted curve using the optimized parameters
data_fit = my_sin(t, *fit[0])
plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
The current methods to fit a sin curve to a given data set require a first guess of the parameters, followed by an interative process. This is a non-linear regression problem.
A different method consists in transforming the non-linear regression to a linear regression thanks to a convenient integral equation. Then, there is no need for initial guess and no need for iterative process : the fitting is directly obtained.
In case of the function y = a + r*sin(w*x+phi) or y=a+b*sin(w*x)+c*cos(w*x), see pages 35-36 of the paper "RĂ©gression sinusoidale" published on Scribd
In case of the function y = a + p*x + r*sin(w*x+phi) : pages 49-51 of the chapter "Mixed linear and sinusoidal regressions".
In case of more complicated functions, the general process is explained in the chapter "Generalized sinusoidal regression" pages 54-61, followed by a numerical example y = r*sin(w*x+phi)+(b/x)+c*ln(x), pages 62-63
All the above answers are based on curve fitting, and most use an iterative method - they all work very nicely, but I wanted to add a different approach using an FFT. Here, we transform the data, set all but the peak frequency to zero and then do the inverse transform. Note, that you probably want to remove the data mean (and detrend) before doing the FFT and then you can add those back in after.
import numpy as np
import pylab as plt
# fake data
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.05
data = 3.0*np.sin(f*t+0.001) + np.random.randn(N) # create artificial data with noise
# FFT...
mfft=np.fft.fft(data)
imax=np.argmax(np.absolute(mfft))
mask=np.zeros_like(mfft)
mask[[imax]]=1
mfft*=mask
fdata=np.fft.ifft(mfft)
plt.plot(t, data, '.')
plt.plot(t, fdata,'.', label='FFT')
plt.legend()
plt.show()

scipy curve_fit raises "OptimizeWarning: Covariance of the parameters could not be estimated"

I am trying to fit this function to some data:
But when I use my code
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def f(x, start, end):
res = np.empty_like(x)
res[x < start] =-1
res[x > end] = 1
linear = np.all([[start <= x], [x <= end]], axis=0)[0]
res[linear] = np.linspace(-1., 1., num=np.sum(linear))
return res
if __name__ == '__main__':
xdata = np.linspace(0., 1000., 1000)
ydata = -np.ones(1000)
ydata[500:1000] = 1.
ydata = ydata + np.random.normal(0., 0.25, len(ydata))
popt, pcov = curve_fit(f, xdata, ydata, p0=[495., 505.])
print(popt, pcov)
plt.figure()
plt.plot(xdata, f(xdata, *popt), 'r-', label='fit')
plt.plot(xdata, ydata, 'b-', label='data')
plt.show()
I get the error
OptimizeWarning: Covariance of the parameters could not be estimated
Output:
In this example start and end should be closer to 500, but they dont change at all from my initial guess.
The warning (not error) of
OptimizeWarning: Covariance of the parameters could not be estimated
means that the fit could not determine the uncertainties (variance) of the fitting parameters.
The main problem is that your model function f treats the parameters start and end as discrete values -- they are used as integer locations for the change in functional form. scipy's curve_fit (and all other optimization routines in scipy.optimize) assume that parameters are continuous variables, not discrete.
The fitting procedure will try to take small steps (typically around machine precision) in the parameters to get a numerical derivative of the residual with respect to the variables (the Jacobian). With values used as discrete variables, these derivatives will be zero and the fitting procedure will not know how to change the values to improve the fit.
It looks like you're trying to fit a step function to some data. Allow me to recommend trying lmfit (https://lmfit.github.io/lmfit-py) which provides a higher-level interface to curve fitting, and has many built-in models. For example, it includes a StepModel that should be able to model your data.
For a slight modification of your data (so that it has a finite step), the following script with lmfit can fit such data:
#!/usr/bin/python
import numpy as np
from lmfit.models import StepModel, LinearModel
import matplotlib.pyplot as plt
np.random.seed(0)
xdata = np.linspace(0., 1000., 1000)
ydata = -np.ones(1000)
ydata[500:1000] = 1.
# note that a linear step is added here:
ydata[490:510] = -1 + np.arange(20)/10.0
ydata = ydata + np.random.normal(size=len(xdata), scale=0.1)
# model data as Step + Line
step_mod = StepModel(form='linear', prefix='step_')
line_mod = LinearModel(prefix='line_')
model = step_mod + line_mod
# make named parameters, giving initial values:
pars = model.make_params(line_intercept=ydata.min(),
line_slope=0,
step_center=xdata.mean(),
step_amplitude=ydata.std(),
step_sigma=2.0)
# fit data to this model with these parameters
out = model.fit(ydata, pars, x=xdata)
# print results
print(out.fit_report())
# plot data and best-fit
plt.plot(xdata, ydata, 'b')
plt.plot(xdata, out.best_fit, 'r-')
plt.show()
which prints out a report of
[[Model]]
(Model(step, prefix='step_', form='linear') + Model(linear, prefix='line_'))
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 49
# data points = 1000
# variables = 5
chi-square = 9.72660131
reduced chi-square = 0.00977548
Akaike info crit = -4622.89074
Bayesian info crit = -4598.35197
[[Variables]]
step_sigma: 20.6227793 +/- 0.77214167 (3.74%) (init = 2)
step_center: 490.167878 +/- 0.44804412 (0.09%) (init = 500)
step_amplitude: 1.98946656 +/- 0.01304854 (0.66%) (init = 0.996283)
line_intercept: -1.00628058 +/- 0.00706005 (0.70%) (init = -1.277259)
line_slope: 1.3947e-05 +/- 2.2340e-05 (160.18%) (init = 0)
[[Correlations]] (unreported correlations are < 0.100)
C(step_amplitude, line_slope) = -0.875
C(step_sigma, step_center) = -0.863
C(line_intercept, line_slope) = -0.774
C(step_amplitude, line_intercept) = 0.461
C(step_sigma, step_amplitude) = 0.170
C(step_sigma, line_slope) = -0.147
C(step_center, step_amplitude) = -0.146
C(step_center, line_slope) = 0.127
and produces a plot of
Lmfit has lots of extra features. For example, if you want to set bounds on some of the parameter values or fix some from varying, you can do the following:
# make named parameters, giving initial values:
pars = model.make_params(line_intercept=ydata.min(),
line_slope=0,
step_center=xdata.mean(),
step_amplitude=ydata.std(),
step_sigma=2.0)
# now set max and min values for step amplitude"
pars['step_amplitude'].min = 0
pars['step_amplitude'].max = 100
# fix the offset of the line to be -1.0
pars['line_offset'].value = -1.0
pars['line_offset'].vary = False
# then run fit with these parameters
out = model.fit(ydata, pars, x=xdata)
If you know the model should be Step+Constant and that the constant should be fixed, you could also modify the model to be
from lmfit.models import ConstantModel
# model data as Step + Constant
step_mod = StepModel(form='linear', prefix='step_')
const_mod = ConstantModel(prefix='const_')
model = step_mod + const_mod
pars = model.make_params(const_c=-1,
step_center=xdata.mean(),
step_amplitude=ydata.std(),
step_sigma=2.0)
pars['const_c'].vary = False

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

Fitting a Weibull distribution using Scipy

I am trying to recreate maximum likelihood distribution fitting, I can already do this in Matlab and R, but now I want to use scipy. In particular, I would like to estimate the Weibull distribution parameters for my data set.
I have tried this:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
def weib(x,n,a):
return (a / n) * (x / n)**(a - 1) * np.exp(-(x / n)**a)
data = np.loadtxt("stack_data.csv")
(loc, scale) = s.exponweib.fit_loc_scale(data, 1, 1)
print loc, scale
x = np.linspace(data.min(), data.max(), 1000)
plt.plot(x, weib(x, loc, scale))
plt.hist(data, data.max(), density=True)
plt.show()
And get this:
(2.5827280639441961, 3.4955032285727947)
And a distribution that looks like this:
I have been using the exponweib after reading this http://www.johndcook.com/distributions_scipy.html. I have also tried the other Weibull functions in scipy (just in case!).
In Matlab (using the Distribution Fitting Tool - see screenshot) and in R (using both the MASS library function fitdistr and the GAMLSS package) I get a (loc) and b (scale) parameters more like 1.58463497 5.93030013. I believe all three methods use the maximum likelihood method for distribution fitting.
I have posted my data here if you would like to have a go! And for completeness I am using Python 2.7.5, Scipy 0.12.0, R 2.15.2 and Matlab 2012b.
Why am I getting a different result!?
My guess is that you want to estimate the shape parameter and the scale of the Weibull distribution while keeping the location fixed. Fixing loc assumes that the values of your data and of the distribution are positive with lower bound at zero.
floc=0 keeps the location fixed at zero, f0=1 keeps the first shape parameter of the exponential weibull fixed at one.
>>> stats.exponweib.fit(data, floc=0, f0=1)
[1, 1.8553346917584836, 0, 6.8820748596850905]
>>> stats.weibull_min.fit(data, floc=0)
[1.8553346917584836, 0, 6.8820748596850549]
The fit compared to the histogram looks ok, but not very good. The parameter estimates are a bit higher than the ones you mention are from R and matlab.
Update
The closest I can get to the plot that is now available is with unrestricted fit, but using starting values. The plot is still less peaked. Note values in fit that don't have an f in front are used as starting values.
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> plt.plot(data, stats.exponweib.pdf(data, *stats.exponweib.fit(data, 1, 1, scale=02, loc=0)))
>>> _ = plt.hist(data, bins=np.linspace(0, 16, 33), normed=True, alpha=0.5);
>>> plt.show()
It is easy to verify which result is the true MLE, just need a simple function to calculate log likelihood:
>>> def wb2LL(p, x): #log-likelihood
return sum(log(stats.weibull_min.pdf(x, p[1], 0., p[0])))
>>> adata=loadtxt('/home/user/stack_data.csv')
>>> wb2LL(array([6.8820748596850905, 1.8553346917584836]), adata)
-8290.1227946678173
>>> wb2LL(array([5.93030013, 1.57463497]), adata)
-8410.3327470347667
The result from fit method of exponweib and R fitdistr (#Warren) is better and has higher log likelihood. It is more likely to be the true MLE. It is not surprising that the result from GAMLSS is different. It is a complete different statistic model: Generalized Additive Model.
Still not convinced? We can draw a 2D confidence limit plot around MLE, see Meeker and Escobar's book for detail).
Again this verifies that array([6.8820748596850905, 1.8553346917584836]) is the right answer as loglikelihood is lower that any other point in the parameter space. Note:
>>> log(array([6.8820748596850905, 1.8553346917584836]))
array([ 1.92892018, 0.61806511])
BTW1, MLE fit may not appears to fit the distribution histogram tightly. An easy way to think about MLE is that MLE is the parameter estimate most probable given the observed data. It doesn't need to visually fit the histogram well, that will be something minimizing mean square error.
BTW2, your data appears to be leptokurtic and left-skewed, which means Weibull distribution may not fit your data well. Try, e.g. Gompertz-Logistic, which improves log-likelihood by another about 100.
Cheers!
I know it's an old post, but I just faced a similar problem and this thread helped me solve it. Thought my solution might be helpful for others like me:
# Fit Weibull function, some explanation below
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]
print 'shape:',shape
print 'scale:',scale
#### Plotting
# Histogram first
values,bins,hist = plt.hist(data,bins=51,range=(0,25),normed=True)
center = (bins[:-1] + bins[1:]) / 2.
# Using all params and the stats function
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy')
# Using my own Weibull function as a check
def weibull(u,shape,scale):
'''Weibull distribution for wind speed u with shape parameter k and scale parameter A'''
return (shape / scale) * (u / scale)**(shape-1) * np.exp(-(u/scale)**shape)
plt.plot(center,weibull(center,shape,scale),label='Wind analysis',lw=2)
plt.legend()
Some extra info that helped me understand:
Scipy Weibull function can take four input parameters: (a,c),loc and scale.
You want to fix the loc and the first shape parameter (a), this is done with floc=0,f0=1. Fitting will then give you params c and scale, where c corresponds to the shape parameter of the two-parameter Weibull distribution (often used in wind data analysis) and scale corresponds to its scale factor.
From docs:
exponweib.pdf(x, a, c) =
a * c * (1-exp(-x**c))**(a-1) * exp(-x**c)*x**(c-1)
If a is 1, then
exponweib.pdf(x, a, c) =
c * (1-exp(-x**c))**(0) * exp(-x**c)*x**(c-1)
= c * (1) * exp(-x**c)*x**(c-1)
= c * x **(c-1) * exp(-x**c)
From this, the relation to the 'wind analysis' Weibull function should be more clear
I was curious about your question and, despite this is not an answer, it compares the Matlab result with your result and with the result using leastsq, which showed the best correlation with the given data:
The code is as follows:
import scipy.stats as s
import numpy as np
import matplotlib.pyplot as plt
import numpy.random as mtrand
from scipy.integrate import quad
from scipy.optimize import leastsq
## my distribution (Inverse Normal with shape parameter mu=1.0)
def weib(x,n,a):
return (a / n) * (x / n)**(a-1) * np.exp(-(x/n)**a)
def residuals(p,x,y):
integral = quad( weib, 0, 16, args=(p[0],p[1]) )[0]
penalization = abs(1.-integral)*100000
return y - weib(x, p[0],p[1]) + penalization
#
data = np.loadtxt("stack_data.csv")
x = np.linspace(data.min(), data.max(), 100)
n, bins, patches = plt.hist(data,bins=x, normed=True)
binsm = (bins[1:]+bins[:-1])/2
popt, pcov = leastsq(func=residuals, x0=(1.,1.), args=(binsm,n))
loc, scale = 1.58463497, 5.93030013
plt.plot(binsm,n)
plt.plot(x, weib(x, loc, scale),
label='weib matlab, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
loc, scale = s.exponweib.fit_loc_scale(data, 1, 1)
plt.plot(x, weib(x, loc, scale),
label='weib stack, loc=%1.3f, scale=%1.3f' % (loc, scale), lw=4.)
plt.plot(x, weib(x,*popt),
label='weib leastsq, loc=%1.3f, scale=%1.3f' % tuple(popt), lw=4.)
plt.legend(loc='upper right')
plt.show()
I had the same problem, but found that setting loc=0 in exponweib.fit primed the pump for the optimization. That was all that was needed from #user333700's answer. I couldn't load your data -- your data link points to an image, not data. So I ran a test on my data instead:
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
N=30
counts, bins = np.histogram(x, bins=N)
bin_width = bins[1]-bins[0]
total_count = float(sum(counts))
f, ax = plt.subplots(1, 1)
f.suptitle(query_uri)
ax.bar(bins[:-1]+bin_width/2., counts, align='center', width=.85*bin_width)
ax.grid('on')
def fit_pdf(x, name='lognorm', color='r'):
dist = getattr(ss, name) # params = shape, loc, scale
# dist = ss.gamma # 3 params
params = dist.fit(x, loc=0) # 1-day lag minimum for shipping
y = dist.pdf(bins, *params)*total_count*bin_width
sqerror_sum = np.log(sum(ci*(yi - ci)**2. for (ci, yi) in zip(counts, y)))
ax.plot(bins, y, color, lw=3, alpha=0.6, label='%s err=%3.2f' % (name, sqerror_sum))
return y
colors = ['r-', 'g-', 'r:', 'g:']
for name, color in zip(['exponweib', 't', 'gamma'], colors): # 'lognorm', 'erlang', 'chi2', 'weibull_min',
y = fit_pdf(x, name=name, color=color)
ax.legend(loc='best', frameon=False)
plt.show()
There have been a few answers to this already here and in other places. likt in Weibull distribution and the data in the same figure (with numpy and scipy)
It still took me a while to come up with a clean toy example so I though it would be useful to post.
from scipy import stats
import matplotlib.pyplot as plt
#input for pseudo data
N = 10000
Kappa_in = 1.8
Lambda_in = 10
a_in = 1
loc_in = 0
#Generate data from given input
data = stats.exponweib.rvs(a=a_in,c=Kappa_in, loc=loc_in, scale=Lambda_in, size = N)
#The a and loc are fixed in the fit since it is standard to assume they are known
a_out, Kappa_out, loc_out, Lambda_out = stats.exponweib.fit(data, f0=a_in,floc=loc_in)
#Plot
bins = range(51)
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(bins, stats.exponweib.pdf(bins, a=a_out,c=Kappa_out,loc=loc_out,scale = Lambda_out))
ax.hist(data, bins = bins , density=True, alpha=0.5)
ax.annotate("Shape: $k = %.2f$ \n Scale: $\lambda = %.2f$"%(Kappa_out,Lambda_out), xy=(0.7, 0.85), xycoords=ax.transAxes)
plt.show()
In the meantime, there is really good package out there: reliability. Here is the documentation: reliability # readthedocs.
Your code simply becomes:
from reliability.Fitters import Fit_Weibull_2P
...
wb = Fit_Weibull_2P(failures=data)
plt.show()
Saves a lot of headaches and makes beautiful plots, too.
the order of loc and scale is messed up in the code:
plt.plot(x, weib(x, scale, loc))
the scale parameter should come first.

How do I fit a sine curve to my data with pylab and numpy?

I am trying to show that economies follow a relatively sinusoidal growth pattern. I am building a python simulation to show that even when we let some degree of randomness take hold, we can still produce something relatively sinusoidal.
I am happy with the data I'm producing, but now I'd like to find some way to get a sine graph that pretty closely matches the data. I know you can do polynomial fit, but can you do sine fit?
Here is a parameter-free fitting function fit_sin() that does not require manual guess of frequency:
import numpy, scipy.optimize
def fit_sin(tt, yy):
'''Fit sin to the input time sequence, and return fitting parameters "amp", "omega", "phase", "offset", "freq", "period" and "fitfunc"'''
tt = numpy.array(tt)
yy = numpy.array(yy)
ff = numpy.fft.fftfreq(len(tt), (tt[1]-tt[0])) # assume uniform spacing
Fyy = abs(numpy.fft.fft(yy))
guess_freq = abs(ff[numpy.argmax(Fyy[1:])+1]) # excluding the zero frequency "peak", which is related to offset
guess_amp = numpy.std(yy) * 2.**0.5
guess_offset = numpy.mean(yy)
guess = numpy.array([guess_amp, 2.*numpy.pi*guess_freq, 0., guess_offset])
def sinfunc(t, A, w, p, c): return A * numpy.sin(w*t + p) + c
popt, pcov = scipy.optimize.curve_fit(sinfunc, tt, yy, p0=guess)
A, w, p, c = popt
f = w/(2.*numpy.pi)
fitfunc = lambda t: A * numpy.sin(w*t + p) + c
return {"amp": A, "omega": w, "phase": p, "offset": c, "freq": f, "period": 1./f, "fitfunc": fitfunc, "maxcov": numpy.max(pcov), "rawres": (guess,popt,pcov)}
The initial frequency guess is given by the peak frequency in the frequency domain using FFT. The fitting result is almost perfect assuming there is only one dominant frequency (other than the zero frequency peak).
import pylab as plt
N, amp, omega, phase, offset, noise = 500, 1., 2., .5, 4., 3
#N, amp, omega, phase, offset, noise = 50, 1., .4, .5, 4., .2
#N, amp, omega, phase, offset, noise = 200, 1., 20, .5, 4., 1
tt = numpy.linspace(0, 10, N)
tt2 = numpy.linspace(0, 10, 10*N)
yy = amp*numpy.sin(omega*tt + phase) + offset
yynoise = yy + noise*(numpy.random.random(len(tt))-0.5)
res = fit_sin(tt, yynoise)
print( "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt, yynoise, "ok", label="y with noise")
plt.plot(tt2, res["fitfunc"](tt2), "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
The result is good even with high noise:
Amplitude=1.00660540618, Angular freq.=2.03370472482, phase=0.360276844224, offset=3.95747467506, Max. Cov.=0.0122923578658
You can use the least-square optimization function in scipy to fit any arbitrary function to another. In case of fitting a sin function, the 3 parameters to fit are the offset ('a'), amplitude ('b') and the phase ('c').
As long as you provide a reasonable first guess of the parameters, the optimization should converge well.Fortunately for a sine function, first estimates of 2 of these are easy: the offset can be estimated by taking the mean of the data and the amplitude via the RMS (3*standard deviation/sqrt(2)).
Note: as a later edit, frequency fitting has also been added. This does not work very well (can lead to extremely poor fits). Thus, use at your discretion, my advise would be to not use frequency fitting unless frequency error is smaller than a few percent.
This leads to the following code:
import numpy as np
from scipy.optimize import leastsq
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.15247 # Optional!! Advised not to use
data = 3.0*np.sin(f*t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)/(2**0.5)
guess_phase = 0
guess_freq = 1
guess_amp = 1
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean
# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(x[1]*t+x[2]) + x[3] - data
est_amp, est_freq, est_phase, est_mean = leastsq(optimize_func, [guess_amp, guess_freq, guess_phase, guess_mean])[0]
# recreate the fitted curve using the optimized parameters
data_fit = est_amp*np.sin(est_freq*t+est_phase) + est_mean
# recreate the fitted curve using the optimized parameters
fine_t = np.arange(0,max(t),0.1)
data_fit=est_amp*np.sin(est_freq*fine_t+est_phase)+est_mean
plt.plot(t, data, '.')
plt.plot(t, data_first_guess, label='first guess')
plt.plot(fine_t, data_fit, label='after fitting')
plt.legend()
plt.show()
Edit: I assumed that you know the number of periods in the sine-wave. If you don't, it's somewhat trickier to fit. You can try and guess the number of periods by manual plotting and try and optimize it as your 6th parameter.
More userfriendly to us is the function curvefit. Here an example:
import numpy as np
from scipy.optimize import curve_fit
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3.0*np.sin(t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_freq = 1
guess_amplitude = 3*np.std(data)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(data)
p0=[guess_freq, guess_amplitude,
guess_phase, guess_offset]
# create the function we want to fit
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
# now do the fit
fit = curve_fit(my_sin, t, data, p0=p0)
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = my_sin(t, *p0)
# recreate the fitted curve using the optimized parameters
data_fit = my_sin(t, *fit[0])
plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
The current methods to fit a sin curve to a given data set require a first guess of the parameters, followed by an interative process. This is a non-linear regression problem.
A different method consists in transforming the non-linear regression to a linear regression thanks to a convenient integral equation. Then, there is no need for initial guess and no need for iterative process : the fitting is directly obtained.
In case of the function y = a + r*sin(w*x+phi) or y=a+b*sin(w*x)+c*cos(w*x), see pages 35-36 of the paper "RĂ©gression sinusoidale" published on Scribd
In case of the function y = a + p*x + r*sin(w*x+phi) : pages 49-51 of the chapter "Mixed linear and sinusoidal regressions".
In case of more complicated functions, the general process is explained in the chapter "Generalized sinusoidal regression" pages 54-61, followed by a numerical example y = r*sin(w*x+phi)+(b/x)+c*ln(x), pages 62-63
All the above answers are based on curve fitting, and most use an iterative method - they all work very nicely, but I wanted to add a different approach using an FFT. Here, we transform the data, set all but the peak frequency to zero and then do the inverse transform. Note, that you probably want to remove the data mean (and detrend) before doing the FFT and then you can add those back in after.
import numpy as np
import pylab as plt
# fake data
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.05
data = 3.0*np.sin(f*t+0.001) + np.random.randn(N) # create artificial data with noise
# FFT...
mfft=np.fft.fft(data)
imax=np.argmax(np.absolute(mfft))
mask=np.zeros_like(mfft)
mask[[imax]]=1
mfft*=mask
fdata=np.fft.ifft(mfft)
plt.plot(t, data, '.')
plt.plot(t, fdata,'.', label='FFT')
plt.legend()
plt.show()

Categories