The code below is giving me a flat line for the line of best fit rather than a nice curve along the model of e^(-x) that would fit the data. Can anyone show me how to fix the code below so that it fits my data?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
def _eNegX_residuals(p,x,y):
return y - _eNegX_(p,x)
def Get_eNegX_Coefficients(x,y):
print 'x is: ',x
print 'y is: ',y
# Calculate p_guess for the vectors x,y. Note that p_guess is the
# starting estimate for the minimization.
p_guess=(np.median(x),np.min(y),np.max(y),.01)
# Calls the leastsq() function, which calls the residuals function with an initial
# guess for the parameters and with the x and y vectors. Note that the residuals
# function also calls the _eNegX_ function. This will return the parameters p that
# minimize the least squares error of the _eNegX_ function with respect to the original
# x and y coordinate vectors that are sent to it.
p, cov, infodict, mesg, ier = scipy.optimize.leastsq(
_eNegX_residuals,p_guess,args=(x,y),full_output=1,warning=True)
# Define the optimal values for each element of p that were returned by the leastsq() function.
x0,y0,c,k=p
print('''Reference data:\
x0 = {x0}
y0 = {y0}
c = {c}
k = {k}
'''.format(x0=x0,y0=y0,c=c,k=k))
print 'x.min() is: ',x.min()
print 'x.max() is: ',x.max()
# Create a numpy array of x-values
numPoints = np.floor((x.max()-x.min())*100)
xp = np.linspace(x.min(), x.max(), numPoints)
print 'numPoints is: ',numPoints
print 'xp is: ',xp
print 'p is: ',p
pxp=_eNegX_(p,xp)
print 'pxp is: ',pxp
# Plot the results
plt.plot(x, y, '>', xp, pxp, 'g-')
plt.xlabel('BPM%Rest')
plt.ylabel('LVET/BPM',rotation='vertical')
plt.xlim(0,3)
plt.ylim(0,4)
plt.grid(True)
plt.show()
return p
# Declare raw data for use in creating regression equation
x = np.array([1,1.425,1.736,2.178,2.518],dtype='float')
y = np.array([3.489,2.256,1.640,1.043,0.853],dtype='float')
p=Get_eNegX_Coefficients(x,y)
It looks like it's a problem with your initial guesses; something like (1, 1, 1, 1) works fine:
You have
p_guess=(np.median(x),np.min(y),np.max(y),.01)
for the function
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
So that's test_data_maxe^( -.01(x - test_data_median)) + test_data_min
I don't know much about the art of choosing good starting parameters, but I can say a few things. leastsq is finding a local minimum here - the key in choosing these values is to find the right mountain to climb, not to try to cut down on the work that the minimization algorithm has to do. Your initial guess looks like this (green):
(1.736, 0.85299999999999998, 3.4889999999999999, 0.01)
which results in your flat line (blue):
(-59.20295956, 1.8562 , 1.03477144, 0.69483784)
Greater gains were made in adjusting the height of the line than in increasing the k value. If you know you're fitting to this kind of data, use a larger k. If you don't know, I guess you could try to find a decent k value by sampling your data, or working back from the slope between an average of the first half and the second half, but I wouldn't know how to go about that.
Edit: You could also start with several guesses, run the minimization several times, and take the line with the lowest residuals.
Related
I have written a Monte-Carlo simulation to fit 49 data points with asymmetric error bars. Since the errors are asymmetric on both axis, I cannot simply use scipy.optimize.curve_fit module. This is my basic approach:
Generate a list of 1000 random numbers from within the confidence level (error range) using triangular probability distribution distribution with maximum probability at a certain data point. Now I have a list of dimensions [49*1000].
Convert this list from [491000] to [100049]. I did this so that I have a data set of 1000 samples of 49 points which are within the error range.
Use scipy's curve_fit to fit these 1000 samples seperately and find the free parameters in the function y=m*x+c (c is free parameter and I already know the slope m.)
Find mean squared error in each of these 1000 samples using sklearn.metrics.mean_squared_error module.
Find the index with least Mean squared and use the popt value (c parameter) at this index to plot the fit.
Working Code: This is my code:
trials=int(1e4)
#xdata: most probable x value
#ydata: most probable y value
#xerr6low: lower bound on x error
#xerr6up: upper bound on x error
#yerr6low: lower bound on y error
#yerr6up: upper bound on y error
#generating random number using triangular distribution weighted with highest probability at xdata/ydata:
xarray1=[0]*len(obs6xr)
yarray1=[0]*len(obs6xr)
for i in range(0,len(obs6xr)):
xarray1[i]=np.random.triangular(xdata[i]-xerr6low[i],xdata[i],xdata[i]+xerr6up[i],trials)
yarray1[i]=np.random.triangular(ydata[i]-yerr6low[i],ydata[i],ydata[i]+yerr6up[i],trials)
xarray=[list(x) for x in zip(*xarray1)]
yarray=[list(x) for x in zip(*yarray1)]
def func(x, c):
#return (np.log10(a)+(b*(x-12))+np.log10(10**(8+c)))
#return ((x/1e12)**a)*(b)*(10**(8+c))
m = 1.65
mx = [element * m for element in x]
y = [j+c for j in mx]
return y
#Fit for the parameters a, b, c of the function func:
popt=np.zeros(trials)
pcov=np.zeros(trials)
print(len(xarray),len(yarray),len(popt))
for i in tqdm(range (trials),desc='Optimizing'):
popt[i], pcov[i] = curve_fit(func, xarray[i], np.array(yarray[i]))
MSE=np.zeros(trials)
for i in tqdm(range (trials),desc='Calculating MSE'):
MSE[i]=mean_squared_error(np.array(yarray[i]),func(np.array(xarray[i]),popt[i]))
minimum_MSE=np.amin(MSE)
index_MSE=np.where(MSE == np.amin(MSE))
print('minimum MSE = ',minimum_MSE,'at index = ', index_MSE)
print(popt[index_MSE])
def MbhthShimasakupropc(x,c):
Mbhth = (10**(c))*(x**1.65)
return Mbhth
plt.figure(figsize=(5,5))
x=np.logspace(11,14,trials)
plt.loglog(obs6x,MbhthShimasakuprop(obs6x,popt[index_MSE]))
plt.xscale('log')
plt.yscale('log')
m=np.logspace(np.log10(4e10),np.log10(3e14),1000)
plt.errorbar(obs0x, obs0y, xerr=asymmetric_errorx0, yerr=asymmetric_errory0, fmt='o',color='black', markersize='2.5', ecolor='black',capsize=2, elinewidth=1)
plt.errorbar(obs6x, obs6y, xerr=asymmetric_errorx6, yerr=asymmetric_errory6, fmt='o',color='red', markersize='2.5', ecolor='red',capsize=2, elinewidth=1)
plt.loglog(m,MbhthShimasaku(m,0),color='black',linestyle=':',label='Shimasaku-Ferrarese z=0')
plt.xlim(4e10,3e14)
plt.ylim(1e6,1e11)
plt.legend(['Monte-Carlo Fitting','local relation','z=0','z~6'])
plt.show()
Result:
As we can see, this code works perfectly fine. But if I change function to 2 parameters (a&b),
func(x,a,b) and apply the same code with some small tweaks, the code fails miserably.
Not Working Code:
trials=int(1e3)
#generating random number:
xarray1=[0]*len(obs6xr)
yarray1=[0]*len(obs6xr)
for i in range(0,len(obs6xr)):
xarray1[i]=np.random.triangular(xdata[i]-xerr6low[i],xdata[i],xdata[i]+xerr6up[i],trials)
yarray1[i]=np.random.triangular(ydata[i]-yerr6low[i],ydata[i],ydata[i]+yerr6up[i],trials)
xarray=[list(x) for x in zip(*xarray1)]
yarray=[list(x) for x in zip(*yarray1)]
def func(x, a, b):
y=a*x+b
return y
#Fit for the parameters a, b, c of the function func:
popt=[0]*(trials)
pcov=[0]*(trials)
print(len(xarray),len(yarray),len(popt))
for i in tqdm(range (trials)):
popt[i], pcov[i] = curve_fit(func, np.array(xarray[i]), np.array(yarray[i]))
MSE=np.zeros(trials)
for i in tqdm(range (trials)):
MSE[i]=mean_squared_error(np.array(yarray[i]),func(np.array(xarray[i]),np.array(popt[i])[0],np.array(popt[i])[1]))
minimum_MSE=np.amin(MSE)
index_MSE=np.where(MSE == np.amin(MSE))
print('minimum MSE = ',minimum_MSE,'at index = ', index_MSE)
def MbhthShimasakupropmc(x,m,c):
Mbhth = (10**(c))*(x**m)
return Mbhth
plt.figure(figsize=(5,5))
x=np.logspace(11,14,trials)
plt.loglog(obs6x,MbhthShimasakupropmc(obs6x,*popt[[index_MSE][0][0][0]]))
plt.xscale('log')
plt.yscale('log')
m=np.logspace(np.log10(4e10),np.log10(3e14),1000)
plt.errorbar(obs0x, obs0y, xerr=asymmetric_errorx0, yerr=asymmetric_errory0, fmt='o',color='black', markersize='2.5', ecolor='black',capsize=2, elinewidth=1)
plt.errorbar(obs6x, obs6y, xerr=asymmetric_errorx6, yerr=asymmetric_errory6, fmt='o',color='red', markersize='2.5', ecolor='red',capsize=2, elinewidth=1)
plt.loglog(m,MbhthShimasaku(m,0),color='black',linestyle=':',label='Shimasaku-Ferrarese z=0')
plt.legend(['Monte-Carlo Fitting','local relation','z=0','z~6'])
plt.show()
Result:
I don't know what I am doing wrong. Can someone help me debug the issue.
I am trying to plot several datasets for repeat R-T measurements and fit a cubic root line of best fit through each dataset using scipy.optimize.curve_fit.
My code produces a line for each dataset, but not a cubic root line of best fit. Each dataset is colour-coded to its corresponding line of best fit:
I've tried increasing the order of magnitude of my data, as I heard that sometimes scipy.optimize.curve_fit doesn't like very small numbers, but this made no change. If anyone could point out where I am going wrong I would be extremely grateful:
import numpy as np
from scipy.optimize import curve_fit
import scipy.optimize as scpo
import matplotlib.pyplot as plt
files = [ '50mA30%set1.lvm','50mA30%set3.lvm', '50mA30%set4.lvm',
'50mA30%set5.lvm']
for file in files:
data = numpy.loadtxt(file)
current_YBCO = data[:,1]
voltage_YBCO = data[:,2]
current_thermometer = data[:,3]
voltage_thermometer = data[:,4]
T = data[:,5]
R = voltage_thermometer/current_thermometer
p = np.polyfit(R, T, 4)
T_fit = p[0]*R**4 + p[1]*R**3 + p[2]*R**2 + p[3]*R + p[4]
y = voltage_YBCO/current_YBCO
def test(T_fit, a, b, c):
return a * (T_fit+b)**(1/3) + c
param, param_cov = curve_fit(test, np.array(T_fit), np.array(y),
maxfev=100000)
ans = (param[0]*(np.array(T_fit)+param[1])**(1/3)+param[2])
plt.scatter(T_fit,y, 0.5)
plt.plot(T_fit, ans, '--', label ="optimized data")
plt.xlabel("YBCO temperature(K)")
plt.ylabel("Resistance of YBCO(Ohms)")
plt.xlim(97, 102)
plt.ylim(-.00025, 0.00015)
Two things are making this harder for you.
First, cube roots of negative numbers for numpy arrays. If you try this you'll see that you aren't getting the result you want:
x = np.array([-8, 0, 8])
x**(1/3) # array([nan, 0., 2.])
This means that your test function is going to have a problem any time it gets a negative value, and you need the negative values to create the left hand side of the curves. Instead, use np.cbrt
x = np.array([-8, 0, 8])
np.cbrt(x) # array([-2., 0., 2.])
Secondly, your function is
def test(T_fit, a, b, c):
return a * (T_fit + b)**(1/3) + c
Unfortunately, this just doesn't look very much like the graph you show. This makes it really hard for the optimisation to find a "good" fit. Things I particularly dislike about this function are
it goes vertical at T_fit == b. Your data has a definite slope at this point
it keeps growing quite strongly away from T_fit = b. Your data goes horizontal.
However, it is sometimes possible to get a more "sensible fit" by giving the optimisation a good starting point.
You haven't given us any code to work from, which makes this much harder. So, by way of illustration, try this:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
fig, ax = plt.subplots(1)
# Generate some data which looks a bit like yours
x = np.linspace(95, 105, 1001)
y = 0.001 * (-0.5 + 1/(1 + np.exp((100-x)/0.5)) + 0.125 * np.random.rand(len(x)))
# A fitting function
def fit(x, a, b, c):
return a * np.cbrt((x + b)) + c
# Perform the fitting
param, param_cov = scipy.optimize.curve_fit(fit, x, y, p0=[0.004, -100, 0.001], maxfev=100000)
# Calculate the fitted data:
yf = fit(x, *param)
print(param)
# Plot data and the fitted curve
ax.plot(x, y, '.')
ax.plot(x, yf, '-')
Now, if I run this code I do get a fit which roughly follows the data. However, if I take the initial guess out, i.e. do the fitting by calling
param, param_cov = scipy.optimize.curve_fit(fit, x, y, maxfev=100000)
then the fit is much worse. The reason why is that curve_fit will start from an initial guess of [1, 1, 1]. The solution which looks approximately right lies in a different valley to [1, 1, 1] and therefore it isn't the solution which is found. Said another way, it only finds the local minimum, not the global.
I have heard that the Cauchy integration formula can be used to interpolate complex-valued functions along a closed boundary of a disk to points inside the disk. For my current project, this sounds rather valuable, so I attempted to give this a shot. Unfortunately, my experiments were not very successful so far, and I am not certain what is going wrong. Some degree of interpolation is certainly going on, but the results do not seem to be correct along the boundaries. Here is what my code returns:
Here is my initial code example:
import scipy.stats
import numpy as np
import scipy.integrate
import scipy.interpolate
import matplotlib.pyplot as plt
plt.close('all')
# This is the interpolation function, which takes as input a position on the
# boundary in radians (x), a complex evaluation point (eval_point), and the
# function which returns the boundary condition
def f(x,eval_point,itp):
# What is the complex coordinate of this point on the boundary?
zi = np.cos(x) + 1j*np.sin(x)
# Get the boundary condition value
fz = itp(x)
return fz/(zi-eval_point)
# Complex quadrature for integration, adapted from
# https://stackoverflow.com/questions/57325919/using-scipy-quad-with-i%ce%b5-trick-bad-results
def cquad(func, a, b, **kwargs):
real_integral = scipy.integrate.quad(lambda x: np.real(func(x, **kwargs)), a, b, limit=200)
imag_integral = scipy.integrate.quad(lambda x: np.imag(func(x, **kwargs)), a, b, limit=200)
return (real_integral[0] + 1j*imag_integral[0], real_integral[1:], imag_integral[1:])
# Define the interpolation function for the boundary values
itp = scipy.interpolate.interp1d(
x = [0,np.pi/2,np.pi,1.5*np.pi,2*np.pi],
y = [0+0j,0+1j,1+1j,1+0j,0+0j])
# Get some evaluation points
X,Y = np.meshgrid(np.linspace(-1,1,51),
np.linspace(-1,1,51))
XY = X+1j*Y
x = np.ndarray.flatten(XY)
# Throw away all points outside the unit disk; avoid evaluting at radius 1 to
# dodge singularities
x = x[np.where(np.abs(x) <= 0.99)]
# Calculate the result for each evaluation point
res = []
for val in x:
res.append(cquad(
func = f,
a = 0,
b = 2*np.pi,
eval_point = val,
itp = itp)[0]/(2*np.pi*1j))
# Convert the results into an array
res = np.asarray(res)
# Plot the real part of the results
plt.tricontour(
np.real(x),
np.imag(x),
np.real(res),
cmap = 'jet')
plt.colorbar(label='real part')
# Plot the imaginary part of the results
plt.tricontour(
np.real(x),
np.imag(x),
np.imag(res),
cmap = 'Greys')
plt.colorbar(label='imaginary part')
Does anybody have an idea what is going wrong?
You can get an easy approximation of that function by employing the FFT. The inverse FFT can be interpreted as polynomial evaluation at the corresponding points on the unit circle, so that the polynomial in total is an approximation of the Cauchy-formula
c = np.fft.fft(itp(np.linspace(0,2*np.pi,401)[:-1]))
c=c[::-1]/len(c)
np.polyval(c,[1,1j,-1,-1j])
returns
[5.55111512e-17+5.55111512e-17j, 5.55111512e-17+1.00000000e+00j,
1.00000000e+00+1.00000000e+00j, 1.00000000e+00+5.55111512e-17j]
these are the values that were expected.
X,Y = np.meshgrid(np.linspace(-1,1,151),
np.linspace(-1,1,151))
Z = (X+1j*Y).flatten()
Z = Z[np.where(np.abs(Z) <= 0.99)]
W = np.polyval(c,Z)
# Plot the real part of the results
plt.tricontour( Z.real, Z.imag, W.real, cmap = 'jet')
plt.colorbar(label='real part')
# Plot the imaginary part of the results
plt.tricontour( Z.real, Z.imag, W.imag, cmap = 'Greys')
plt.colorbar(label='imaginary part')
plt.tight_layout(); plt.show()
This then gives the picture
The dominant terms of the polynomial are
(1+1j)*(0.500000 - 0.045040*z^3 - 0.008279*z^7
- 0.005012*z^391 - 0.016220*z^395 - 0.405293*z^399)
As far as I could see, the leading degree 3 after the constant term is constant under refinement of the sampling sequence.
I am trying to fit a progression of Gaussian peaks to a spectral lineshape.
The progression is a summation of N evenly spaced Gaussian peaks. When coded as a function, the formula for N=1 looks like this:
A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
where A, e0, hf, S and fwhm are to be determined from the fit with some good initial guesses.
Importantly, the parameter i starts at 0 and is incremented by 1 for every additional component.
So, for N = 3 the expression would take the form:
A * ((e0-0*hf)/e0)**3 * ((S**0)/np.math.factorial(0)) * np.exp(-4*np.log(2)*((x-e0+0*hf)/fwhm)**2) +
A * ((e0-1*hf)/e0)**3 * ((S**1)/np.math.factorial(1)) * np.exp(-4*np.log(2)*((x-e0+1*hf)/fwhm)**2) +
A * ((e0-2*hf)/e0)**3 * ((S**2)/np.math.factorial(2)) * np.exp(-4*np.log(2)*((x-e0+2*hf)/fwhm)**2)
All the parameters except i are constant for every component in the summation, and this is intended. i is changing in a controlled way depending on the number of parameters.
I am using curve_fit. One way to code the fitting routine would be to explicitly define the expression for any reasonable N and just use an appropriate one. Like, here it'would be 5 or 6, depending on the spacing, which is determined by hf. I could just define a long function with N components, writing an appropriate i value into each component. I understand how to do that (and did). But I would like to code this more intelligently. My goal is to write a function that will accept any value of N, add the appropriate amount of components as described above, compute the expression while incrementing the i properly and return the result.
I have attempted a variety of things. My main hurdle is that I don't know how to tell the program to use a particular N and the corresponding values of i. Finally, after some searching I thought I found a good way to code it with a lambda function.
from scipy.optimize import curve_fit
import numpy as np
def fullfunc(x,p,n):
def func(x,A,e0,hf,S,fwhm,i):
return A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
y_fit = np.zeros_like(x)
for i in range(n):
y_fit += func(x,p[0],p[1],p[2],p[3],p[4],i)
return y_fit
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(lambda x,p: fullfunc(x,p,n), x, y, p)
A,e0,hf,S,fwhm = fittedParameters
This gives:
TypeError: <lambda>() takes 2 positional arguments but 7 were given
and I don't understand why. I have a feeling the lambda function can't deal with a list of initial parameters.
I would greatly appreciate any advice on how to make this work without explicitly writing all the equations out, as I find that a bit too rigid.
The x and y ranges provided are samples of real data which give a general idea of what the shape is.
Since you only use summation over a range i=0, 1, ..., n-1, there is no need to refer to complicated lambda constructs that may or may not work in the context of curve fit. Just define your fit function as the summation of n components:
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def func(x, A, e0, hf, S, fwhm):
return sum((A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)) for i in range(n))
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(func, x, y, p0=p)
#A,e0,hf,S,fwhm = fittedParameters
print(fittedParameters)
plt.plot(x, y, "ro", label="data")
x_fit = np.linspace(min(x), max(x), 100)
y_fit = func(x_fit, *fittedParameters)
plt.plot(x_fit, y_fit, label="fit")
plt.legend()
plt.show()
Sample output:
P.S.: By the look of it, these data points are already well fitted with n=1.
From https://stackoverflow.com/a/30460089/2202107, we can generate CDF of a normal distribution:
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
# plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
Question: How do we generate the "original" normal distribution, given only x (eg X2) and y (eg F2) coordinates?
My first thought was plt.plot(x,np.gradient(y)), but gradient of y was all zero (data points are evenly spaced in y, but not in x) These kind of data is often met in percentile calculations. The key is to get the data evenly space in x and not in y, using interpolation:
x=X2
y=F2
num_points=10
xinterp = np.linspace(-2,2,num_points)
yinterp = np.interp(xinterp, x, y)
# for normalizing that sum of all bars equals to 1.0
tot_val=1.0
normalization_factor = tot_val/np.trapz(np.ones(len(xinterp)),yinterp)
plt.bar(xinterp, normalization_factor * np.gradient(yinterp), width=0.2)
plt.show()
output looks good to me:
I put my approach here for examination. Let me know if my logic is flawed.
One issue is: when num_points is large, the plot looks bad, but it's a issue in discretization, not sure how to avoid it.
Related posts:
I failed to understand why the answer was so complicated in https://stats.stackexchange.com/a/6065/131632
I also didn't understand why my approach was different than Generate distribution given percentile ranks