Monte-Carlo Fitting on python data - python

I have written a Monte-Carlo simulation to fit 49 data points with asymmetric error bars. Since the errors are asymmetric on both axis, I cannot simply use scipy.optimize.curve_fit module. This is my basic approach:
Generate a list of 1000 random numbers from within the confidence level (error range) using triangular probability distribution distribution with maximum probability at a certain data point. Now I have a list of dimensions [49*1000].
Convert this list from [491000] to [100049]. I did this so that I have a data set of 1000 samples of 49 points which are within the error range.
Use scipy's curve_fit to fit these 1000 samples seperately and find the free parameters in the function y=m*x+c (c is free parameter and I already know the slope m.)
Find mean squared error in each of these 1000 samples using sklearn.metrics.mean_squared_error module.
Find the index with least Mean squared and use the popt value (c parameter) at this index to plot the fit.
Working Code: This is my code:
trials=int(1e4)
#xdata: most probable x value
#ydata: most probable y value
#xerr6low: lower bound on x error
#xerr6up: upper bound on x error
#yerr6low: lower bound on y error
#yerr6up: upper bound on y error
#generating random number using triangular distribution weighted with highest probability at xdata/ydata:
xarray1=[0]*len(obs6xr)
yarray1=[0]*len(obs6xr)
for i in range(0,len(obs6xr)):
xarray1[i]=np.random.triangular(xdata[i]-xerr6low[i],xdata[i],xdata[i]+xerr6up[i],trials)
yarray1[i]=np.random.triangular(ydata[i]-yerr6low[i],ydata[i],ydata[i]+yerr6up[i],trials)
xarray=[list(x) for x in zip(*xarray1)]
yarray=[list(x) for x in zip(*yarray1)]
def func(x, c):
#return (np.log10(a)+(b*(x-12))+np.log10(10**(8+c)))
#return ((x/1e12)**a)*(b)*(10**(8+c))
m = 1.65
mx = [element * m for element in x]
y = [j+c for j in mx]
return y
#Fit for the parameters a, b, c of the function func:
popt=np.zeros(trials)
pcov=np.zeros(trials)
print(len(xarray),len(yarray),len(popt))
for i in tqdm(range (trials),desc='Optimizing'):
popt[i], pcov[i] = curve_fit(func, xarray[i], np.array(yarray[i]))
MSE=np.zeros(trials)
for i in tqdm(range (trials),desc='Calculating MSE'):
MSE[i]=mean_squared_error(np.array(yarray[i]),func(np.array(xarray[i]),popt[i]))
minimum_MSE=np.amin(MSE)
index_MSE=np.where(MSE == np.amin(MSE))
print('minimum MSE = ',minimum_MSE,'at index = ', index_MSE)
print(popt[index_MSE])
def MbhthShimasakupropc(x,c):
Mbhth = (10**(c))*(x**1.65)
return Mbhth
plt.figure(figsize=(5,5))
x=np.logspace(11,14,trials)
plt.loglog(obs6x,MbhthShimasakuprop(obs6x,popt[index_MSE]))
plt.xscale('log')
plt.yscale('log')
m=np.logspace(np.log10(4e10),np.log10(3e14),1000)
plt.errorbar(obs0x, obs0y, xerr=asymmetric_errorx0, yerr=asymmetric_errory0, fmt='o',color='black', markersize='2.5', ecolor='black',capsize=2, elinewidth=1)
plt.errorbar(obs6x, obs6y, xerr=asymmetric_errorx6, yerr=asymmetric_errory6, fmt='o',color='red', markersize='2.5', ecolor='red',capsize=2, elinewidth=1)
plt.loglog(m,MbhthShimasaku(m,0),color='black',linestyle=':',label='Shimasaku-Ferrarese z=0')
plt.xlim(4e10,3e14)
plt.ylim(1e6,1e11)
plt.legend(['Monte-Carlo Fitting','local relation','z=0','z~6'])
plt.show()
Result:
As we can see, this code works perfectly fine. But if I change function to 2 parameters (a&b),
func(x,a,b) and apply the same code with some small tweaks, the code fails miserably.
Not Working Code:
trials=int(1e3)
#generating random number:
xarray1=[0]*len(obs6xr)
yarray1=[0]*len(obs6xr)
for i in range(0,len(obs6xr)):
xarray1[i]=np.random.triangular(xdata[i]-xerr6low[i],xdata[i],xdata[i]+xerr6up[i],trials)
yarray1[i]=np.random.triangular(ydata[i]-yerr6low[i],ydata[i],ydata[i]+yerr6up[i],trials)
xarray=[list(x) for x in zip(*xarray1)]
yarray=[list(x) for x in zip(*yarray1)]
def func(x, a, b):
y=a*x+b
return y
#Fit for the parameters a, b, c of the function func:
popt=[0]*(trials)
pcov=[0]*(trials)
print(len(xarray),len(yarray),len(popt))
for i in tqdm(range (trials)):
popt[i], pcov[i] = curve_fit(func, np.array(xarray[i]), np.array(yarray[i]))
MSE=np.zeros(trials)
for i in tqdm(range (trials)):
MSE[i]=mean_squared_error(np.array(yarray[i]),func(np.array(xarray[i]),np.array(popt[i])[0],np.array(popt[i])[1]))
minimum_MSE=np.amin(MSE)
index_MSE=np.where(MSE == np.amin(MSE))
print('minimum MSE = ',minimum_MSE,'at index = ', index_MSE)
def MbhthShimasakupropmc(x,m,c):
Mbhth = (10**(c))*(x**m)
return Mbhth
plt.figure(figsize=(5,5))
x=np.logspace(11,14,trials)
plt.loglog(obs6x,MbhthShimasakupropmc(obs6x,*popt[[index_MSE][0][0][0]]))
plt.xscale('log')
plt.yscale('log')
m=np.logspace(np.log10(4e10),np.log10(3e14),1000)
plt.errorbar(obs0x, obs0y, xerr=asymmetric_errorx0, yerr=asymmetric_errory0, fmt='o',color='black', markersize='2.5', ecolor='black',capsize=2, elinewidth=1)
plt.errorbar(obs6x, obs6y, xerr=asymmetric_errorx6, yerr=asymmetric_errory6, fmt='o',color='red', markersize='2.5', ecolor='red',capsize=2, elinewidth=1)
plt.loglog(m,MbhthShimasaku(m,0),color='black',linestyle=':',label='Shimasaku-Ferrarese z=0')
plt.legend(['Monte-Carlo Fitting','local relation','z=0','z~6'])
plt.show()
Result:
I don't know what I am doing wrong. Can someone help me debug the issue.

Related

How to write a function to fit data to a sum of N Gaussian-like peaks without explicitly defining the expression for every possible N?

I am trying to fit a progression of Gaussian peaks to a spectral lineshape.
The progression is a summation of N evenly spaced Gaussian peaks. When coded as a function, the formula for N=1 looks like this:
A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
where A, e0, hf, S and fwhm are to be determined from the fit with some good initial guesses.
Importantly, the parameter i starts at 0 and is incremented by 1 for every additional component.
So, for N = 3 the expression would take the form:
A * ((e0-0*hf)/e0)**3 * ((S**0)/np.math.factorial(0)) * np.exp(-4*np.log(2)*((x-e0+0*hf)/fwhm)**2) +
A * ((e0-1*hf)/e0)**3 * ((S**1)/np.math.factorial(1)) * np.exp(-4*np.log(2)*((x-e0+1*hf)/fwhm)**2) +
A * ((e0-2*hf)/e0)**3 * ((S**2)/np.math.factorial(2)) * np.exp(-4*np.log(2)*((x-e0+2*hf)/fwhm)**2)
All the parameters except i are constant for every component in the summation, and this is intended. i is changing in a controlled way depending on the number of parameters.
I am using curve_fit. One way to code the fitting routine would be to explicitly define the expression for any reasonable N and just use an appropriate one. Like, here it'would be 5 or 6, depending on the spacing, which is determined by hf. I could just define a long function with N components, writing an appropriate i value into each component. I understand how to do that (and did). But I would like to code this more intelligently. My goal is to write a function that will accept any value of N, add the appropriate amount of components as described above, compute the expression while incrementing the i properly and return the result.
I have attempted a variety of things. My main hurdle is that I don't know how to tell the program to use a particular N and the corresponding values of i. Finally, after some searching I thought I found a good way to code it with a lambda function.
from scipy.optimize import curve_fit
import numpy as np
def fullfunc(x,p,n):
def func(x,A,e0,hf,S,fwhm,i):
return A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
y_fit = np.zeros_like(x)
for i in range(n):
y_fit += func(x,p[0],p[1],p[2],p[3],p[4],i)
return y_fit
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(lambda x,p: fullfunc(x,p,n), x, y, p)
A,e0,hf,S,fwhm = fittedParameters
This gives:
TypeError: <lambda>() takes 2 positional arguments but 7 were given
and I don't understand why. I have a feeling the lambda function can't deal with a list of initial parameters.
I would greatly appreciate any advice on how to make this work without explicitly writing all the equations out, as I find that a bit too rigid.
The x and y ranges provided are samples of real data which give a general idea of what the shape is.
Since you only use summation over a range i=0, 1, ..., n-1, there is no need to refer to complicated lambda constructs that may or may not work in the context of curve fit. Just define your fit function as the summation of n components:
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def func(x, A, e0, hf, S, fwhm):
return sum((A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)) for i in range(n))
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(func, x, y, p0=p)
#A,e0,hf,S,fwhm = fittedParameters
print(fittedParameters)
plt.plot(x, y, "ro", label="data")
x_fit = np.linspace(min(x), max(x), 100)
y_fit = func(x_fit, *fittedParameters)
plt.plot(x_fit, y_fit, label="fit")
plt.legend()
plt.show()
Sample output:
P.S.: By the look of it, these data points are already well fitted with n=1.

Python curvefit inside a for loop

I have a simple linear regression model which I find coefficients using python curve-fit as follows:
import numpy as np
from scipy.optimize import curve_fit
def line(x,m,c): #linear fit function in order to get the slope
return m*x + c
x = np.array([2005.38,2005.46,2017.39])
y = np.array([631137.78, 631137.88, 631138.12])
popt, pcov = curve_fit(line,x,y)
slope = popt[0]
intercept = popt[1]
perr = np.sqrt(np.diag(pcov))
slope_err = perr[0]
intercept_err = perr_RA[1]
Then I have performed Monte Carlo simulation based on prior and generated about 1000 similar y array as follows however my x array should stay the same. So same x array for all MC generated y arrays:
y = np.array([631137.97960858, 631137.97958298, 631137.97544918]),
array([631138.00349615, 631138.00462398, 631138.18676081]),
array([631137.83121579, 631137.83457397, 631138.37689362]),
array([631138.03276579, 631138.03322997, 631138.10819225]),
array([631137.79168171, 631137.79288829, 631137.98774176])]
Now, I would like to perform the same calculation and obtain coefficients as I showed above, however, when I put them in a for loop it does not properly calculate coefficients.
nsims = 1000
y = []
slope_mc = []
int_mc = []
for i in range(nsims):
m = models[i]
y.append(m[:,0])
popt, pcov = curve_fit(line,x,m[:,0])
slope = popt[0]
intercept = popt[1]
slope_mc.append(slope)
int_mc.append(intercept)
I have received an error stating
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
I have look at similar solutions like this one but it did not solve my problem. Also is there an easier/faster way without using for loop? I appreciate any help.

given percentiles find distribution function python

From https://stackoverflow.com/a/30460089/2202107, we can generate CDF of a normal distribution:
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
# plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
Question: How do we generate the "original" normal distribution, given only x (eg X2) and y (eg F2) coordinates?
My first thought was plt.plot(x,np.gradient(y)), but gradient of y was all zero (data points are evenly spaced in y, but not in x) These kind of data is often met in percentile calculations. The key is to get the data evenly space in x and not in y, using interpolation:
x=X2
y=F2
num_points=10
xinterp = np.linspace(-2,2,num_points)
yinterp = np.interp(xinterp, x, y)
# for normalizing that sum of all bars equals to 1.0
tot_val=1.0
normalization_factor = tot_val/np.trapz(np.ones(len(xinterp)),yinterp)
plt.bar(xinterp, normalization_factor * np.gradient(yinterp), width=0.2)
plt.show()
output looks good to me:
I put my approach here for examination. Let me know if my logic is flawed.
One issue is: when num_points is large, the plot looks bad, but it's a issue in discretization, not sure how to avoid it.
Related posts:
I failed to understand why the answer was so complicated in https://stats.stackexchange.com/a/6065/131632
I also didn't understand why my approach was different than Generate distribution given percentile ranks

np.polyfit won't plot a characteristic but gives values

The issue I am having is that when I use the code below to find the norm-1 of my error. Firstly, when I plot the error against step-size h, the error values are quite small, in the range of 10^-14 to 10^-16. Secondly, underneath, you can see my attempt to apply the np.polyfit to my graph, which when run, won't fit a characteristic but will output values. The value of p[0] is not perfect, so I believe something is wrong, but it is "close" to the desired output of 3. Is this a matter of just the wrong input or bad data?
def rk3(A,bvector,y0,interval,N):
x0=interval[0]
x_end=interval[1]
x=np.linspace(x0,x_end,N+1)
h=(x_end-x0)/N
y=np.zeros((N+1,len(y0)))
y[0, :] = y0
for n in range(N):
y_1=y[n,:]+h*(np.dot(A,y[n,:])+bvector(x[n]))
y_2=(3/4)*y[n,:]+(1/4)*y_1+(1/4)*h*(np.dot(A,y_1)+bvector(x[n]+h))
y[n+1,:]=(1/3)*y[n,:]+(2/3)*y_2+(2/3)*h*(np.dot(A,y_2)+bvector(x[n]+(1/2)*h))
return x,y
err_vals = []
h_vals = []
for k in range(2,11): #for the range of N=40k, where k=1,...,10
N=40*k
x, y = rk3(A,bvector,y0,[0,0.1],N)
yc = y[-1,:]
h = (x[-1]-x[0])/N
h_vals.append(h)
yvals.append(yc)
yn = y[:,1]
abs_err = np.zeros(N)
print("The value of y at k=",k," is ",yc)
for j in range(1,N):
y_exact=np.array([np.exp(-1000*x[j]), (1000/999)*(np.exp(-x[j])-np.exp(-1000*x[j]))])
y_exact_2 = y_exact[1]
abs_err[j] = np.abs((y[j, 1] - y_exact_2)/y_exact_2)
Error = h*np.sum(abs_err[j])
err_vals.append(Error)
p = np.polyfit(np.log(h_vals), np.log(err_vals), 1)
pyplot.loglog(h_vals,err_vals,"kx")
pyplot.xlabel("h")
pyplot.ylabel("Error")
pyplot.loglog(h,np.exp(p[1])*h**(p[0]), 'r--')
print("Best fit line slope ",format(p[0]))
My evolution of your code below gives a completely straight line with slope close to 3 for the integration over the interval [0,0.01].
For the given interval [0,0.1] the slope value is about 1/3 larger. The error profiles, that is, the absolute error divided by the expected global error power of the step size, gives a converging pattern, confirming the convergence of order 3 of the method.
The error bound 2e7*h^3 is rather large, showing why the combination of problem and method can become very problematic for larger step sizes.
The error is computed via the L1 norms of the function difference and exact solution,
Error = sum(abs((y-y_exact(x))[:,1]))/sum(abs(y[:,1]))
giving a mathematically sound quantity. The summation of the local relative errors can lead to distortions of the total error where the exact solution has a root or small values. But still, even using your computation method of integrating the local relative error leaving out the first data point which is zero,
Error = sum(abs((y[1:,1]/y_exact(x)[1:,1]-1)))*h
gives a similar linear plot, with the range shifted down to 1e-7..1e-9, the slope staying at 3.0293
Note that if you want to use the list h_vals in a computation like the one to plot the fitted line, you have to convert in into a numpy array first.
h=np.asarray(h_vals)
complete code
def rk3(A,bvector,y0,interval,N):
"""Solves an IVP y'=f(x, y(x)) on x \in [0, x_end] with y(0) = y0 using N points, using Runge-Kutta method."""
x=np.linspace(*interval,N+1)
h=x[1]-x[0]
y=np.zeros((N+1,len(y0)))
y[0, :] = y0
for n in range(N):
y_1=y[n]+h*(np.dot(A,y[n])+bvector(x[n]))
y_2=(3/4)*y[n,:]+(1/4)*y_1+(1/4)*h*(np.dot(A,y_1)+bvector(x[n]+h))
y[n+1]=(1/3)*y[n]+(2/3)*y_2+(2/3)*h*(np.dot(A,y_2)+bvector(x[n]+0.5*h))
return x,y
A = np.array([[-1000.0,0.0],[1000.0,-1.0]]);
bvector = lambda x: 0
y_exact = lambda x: np.array([np.exp(-1000*x), (1000/999)*(np.exp(-x)-np.exp(-1000*x))]).T
y0 = y_exact(0)
plt.figure(figsize=(6,3));
h_vals, y_vals, err_vals = [],[],[]
for k in range(2,11): #for the range of N=40k, where k=1,...,10
N=40*k
x, y = rk3(A,bvector,y0,[0,0.01],N)
yc = y[-1,:]
h = x[1]-x[0];
plt.plot(x,(y-y_exact(x))[:,1]/h**3)
h_vals.append(h)
y_vals.append(yc)
yn = y[:,1]
print("The value of y at k=",k," is ",yc)
Error = sum(abs((y-y_exact(x))[:,1]))/sum(abs(y[:,1]))
err_vals.append(Error)
plt.grid(); plt.show()
p = np.polyfit(np.log(h_vals), np.log(err_vals), 1)
plt.figure(figsize=(6,4))
plt.loglog(h_vals,err_vals,"kx")
h=np.asarray(h_vals)
plt.plot(h,np.exp(p[1])*h**(p[0]), '--r', lw=0.5)
plt.xlabel("h")
plt.ylabel("Error")
plt.grid(); plt.show()
print("Best fit line slope ",format(p[0]))

Nonlinear e^(-x) regression using scipy, python, numpy

The code below is giving me a flat line for the line of best fit rather than a nice curve along the model of e^(-x) that would fit the data. Can anyone show me how to fix the code below so that it fits my data?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
def _eNegX_residuals(p,x,y):
return y - _eNegX_(p,x)
def Get_eNegX_Coefficients(x,y):
print 'x is: ',x
print 'y is: ',y
# Calculate p_guess for the vectors x,y. Note that p_guess is the
# starting estimate for the minimization.
p_guess=(np.median(x),np.min(y),np.max(y),.01)
# Calls the leastsq() function, which calls the residuals function with an initial
# guess for the parameters and with the x and y vectors. Note that the residuals
# function also calls the _eNegX_ function. This will return the parameters p that
# minimize the least squares error of the _eNegX_ function with respect to the original
# x and y coordinate vectors that are sent to it.
p, cov, infodict, mesg, ier = scipy.optimize.leastsq(
_eNegX_residuals,p_guess,args=(x,y),full_output=1,warning=True)
# Define the optimal values for each element of p that were returned by the leastsq() function.
x0,y0,c,k=p
print('''Reference data:\
x0 = {x0}
y0 = {y0}
c = {c}
k = {k}
'''.format(x0=x0,y0=y0,c=c,k=k))
print 'x.min() is: ',x.min()
print 'x.max() is: ',x.max()
# Create a numpy array of x-values
numPoints = np.floor((x.max()-x.min())*100)
xp = np.linspace(x.min(), x.max(), numPoints)
print 'numPoints is: ',numPoints
print 'xp is: ',xp
print 'p is: ',p
pxp=_eNegX_(p,xp)
print 'pxp is: ',pxp
# Plot the results
plt.plot(x, y, '>', xp, pxp, 'g-')
plt.xlabel('BPM%Rest')
plt.ylabel('LVET/BPM',rotation='vertical')
plt.xlim(0,3)
plt.ylim(0,4)
plt.grid(True)
plt.show()
return p
# Declare raw data for use in creating regression equation
x = np.array([1,1.425,1.736,2.178,2.518],dtype='float')
y = np.array([3.489,2.256,1.640,1.043,0.853],dtype='float')
p=Get_eNegX_Coefficients(x,y)
It looks like it's a problem with your initial guesses; something like (1, 1, 1, 1) works fine:
You have
p_guess=(np.median(x),np.min(y),np.max(y),.01)
for the function
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
So that's test_data_maxe^( -.01(x - test_data_median)) + test_data_min
I don't know much about the art of choosing good starting parameters, but I can say a few things. leastsq is finding a local minimum here - the key in choosing these values is to find the right mountain to climb, not to try to cut down on the work that the minimization algorithm has to do. Your initial guess looks like this (green):
(1.736, 0.85299999999999998, 3.4889999999999999, 0.01)
which results in your flat line (blue):
(-59.20295956, 1.8562 , 1.03477144, 0.69483784)
Greater gains were made in adjusting the height of the line than in increasing the k value. If you know you're fitting to this kind of data, use a larger k. If you don't know, I guess you could try to find a decent k value by sampling your data, or working back from the slope between an average of the first half and the second half, but I wouldn't know how to go about that.
Edit: You could also start with several guesses, run the minimization several times, and take the line with the lowest residuals.

Categories