Related
I'm trying to fit a set of data to a CDF exponential function. However, I'm not sure what is going wrong either in my code or in the initial parameter guess, but it only creates a straight line. Data was imported from a CSV file.
#Plot Data
plt.figure(1,dpi=120)
plt.title("Cell A3")
plt.xlabel(rawdata[0][0])
plt.ylabel(rawdata[0][1])
plt.scatter(xdata,ydata,label="A3 Cell 1")
#Define Function
def func(t,lam):
return 1 - (np.exp(-lam * t))
funcdata = func(xdata,1.17)
plt.plot(xdata,funcdata,label="Model")
plt.legend()
#CurveFit data to model
popt, pcov = curve_fit(func,xdata,ydata,p0=(-0.64))
perr = np.sqrt(np.diag(pcov))
Image of the graph I get with the initial data and the straight line that the curve_fit gives
You cannot fit correctly such a simple exponential function of this kind :
y=( 1 - (np.exp(-lam * t)) ) * scale
to the data because the shape of this function is far to the shape of the data in the range of 0<t<5.
Better consider a function of the logistic kind, For example :
Think about your data and your function. ydata is quite a large value. What is the maximum value of
def func(t,lam):
return 1 - (np.exp(-lam * t))
I think you will find the max of the function occurs as lam approaches infinity, the function approaches 1. How can a function with max value == 1 fit data in the 1000s? If you want to be able to scale beyond 1, you need more parameters in your function. Try with
def func(t,lam,scale):
return ( 1 - (np.exp(-lam * t)) ) * scale
and see if scipy is able to better fit the data.
EDIT:
I mananaged to get that to work, however, you aren't even plotting the optimum parameters. To do that, see my code with simulated xdata and ydata:
#Plot Data
import numpy as np
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
def func(t,lam,scale):
return ( 1 - (np.exp(-lam * t)) ) * scale
xdata = np.arange(25.)
ydata = func(xdata, 1.12, 2000.)
plt.figure(1,dpi=120)
plt.title("Cell A3")
plt.xlabel(rawdata[0][0])
plt.ylabel(rawdata[0][1])
plt.scatter(xdata,ydata,label="A3 Cell 1")
#CurveFit data to model
popt, pcov = curve_fit(func,xdata,ydata,p0=[0.5, 1000.1])
plt.plot(np.arange(25),func(np.arange(25), *popt),label="Model")
plt.legend()
outputs:
I don't understand the parameters returned by the _fitstart() method of scipy.stats.levy_stable for distributions with positive versus negative beta parameters. Intuitively, changing the sign of beta when generating a random sample should not affect the estimate for alpha when fitting the data. I am not sure what effect the sign of beta should have on the third parameter returned by _fitstart(), but I hoped the sign might just get reversed after converting the return values as suggested by this answer.
from scipy.stats import levy_stable
from scipy.stats import rv_continuous as rvc
import numpy as np
points = 1000000
jennys_constant = 8675309
pconv = lambda alpha, beta, mu, sigma: (alpha, beta, mu - sigma * beta * np.tan(np.pi * alpha / 2.0), sigma)
rvc.random_state = jennys_constant
def test_fitstart(alpha, beta):
draw = levy_stable.rvs(alpha, beta, size=points)
# use scipy's quantile estimator to estimate the parameters and convert to S parameterization
return pconv(*levy_stable._fitstart(draw))
print("A few calls with beta=1")
for i in range(3):
print(test_fitstart(alpha=1.3, beta=1))
print("A few calls with beta=-1")
for i in range(3):
print(test_fitstart(alpha=1.3, beta=-1))
>>> A few calls with beta=1
>>> (1.3059810788754223, 1.0, 1.9212069030505312, 1.0017497273563876)
>>> (1.3048494867305243, 1.0, 1.92631956349381, 1.000064636906844)
>>> (1.3010492983811222, 1.0, 1.9544520781484407, 0.9999042085058586)
>>> A few calls with beta=-1
>>> (1.3652389860952416, -1.0, 0.3424825654388899, 1.0317366391952136)
>>> (1.370069101697994, -1.0, 0.3560781956631771, 1.0397745333221347)
>>> (1.3682310757082936, -1.0, 0.34621980810217745, 1.037169706715312)
Looking at the _fitstart() code I think the lookup for alpha should probably be using the absolute value of nu_beta, but isn't, so the lookup is probably extrapolating outside the nu_beta_range.
Similarly, I wonder if the absolute value of something should be used inside the calculation of delta, before clipping gets applied, with a post-clipping adjustment for the sign of beta? Actually, looking at it again I think clipping should be applied to c (the scaling parameter, which must be positive). Clipping should not be applied to delta (the location parameter = mean, which can vary from -inf to inf). Is this right?
levy_stable._fitstart() is not handling negatively skewed data correctly, but we can work around it by reflecting the sample about the origin. _fitstart() will then return sensible estimates for the stability and scale parameters, which are not affected by reflection. The estimates of skewness and loc parameters are inverted in the reflected sample.
A simple wrapper function can check whether the data are skewed to the right or left before calling _fitstart(), and then invert the inverted parameter estimates as necessary. This won't fix levy_stable.fit() itself, but at least we can get quantile estimates from _fitstart().
import numpy as np
from scipy import __version__ as scipy_version
from scipy.stats import levy_stable
points = 1000000
const = 314
def lsfitstart(data):
"""Wrapper for levy_stable._fitstart() to fix data with negative skew"""
skewleft = np.mean(data) <= np.median(data)
# reverse sign of the data points if distribution has negative skew
alpha, beta, loc, scale = levy_stable._fitstart(-data if skewleft else data)
# reverse sign of skewness and loc estimates if distribution has negative skew
beta_fixed, loc_fixed = [-x if skewleft else x for x in (beta, loc)]
# clip scale parameter to ensure it is positive
scale_fixed = np.clip(scale, np.finfo(float).eps, np.inf)
return (alpha, beta_fixed, loc_fixed, scale_fixed)
print(scipy_version)
sample = levy_stable.rvs(alpha=1.3, beta=1, size=points, random_state=const)
print("levy_stable fit : alpha (stabililty), beta (skewness), loc, scale")
print("_fitstart(positive ): ", levy_stable._fitstart(sample))
print("_fitstart(negative=bad): ", levy_stable._fitstart(-sample))
print()
print("lsfitstart(positive ): ", lsfitstart(sample))
print("lsfitstart(negative=OK): ", lsfitstart(-sample))
>>> 1.5.2
>>> levy_stable fit : alpha (stabililty), beta (skewness), loc, scale
>>> _fitstart(positive ): (1.3055214922752139, 1.0, 2.220446049250313e-16, 1.0002159057207403)
>>> _fitstart(negative=bad): (1.3673555827622812, -1.0, 1.9389262857717497, 1.0337386531320203)
>>>
>>> lsfitstart(positive ): (1.3055214922752139, 1.0, 2.220446049250313e-16, 1.0002159057207403)
>>> lsfitstart(negative=OK): (1.3055214922752139, -1.0, -2.220446049250313e-16, 1.0002159057207403)
I'd like to fit some data with an exponential function. I used scipy.optimize.curve_fit because I already used it for other fits. This time, there is an issue and I can't figure out what's wrong.
Here is what the data looks like when plotted :
data.png
as you see it seems to follow an exponential law.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
data = np.array([
0., 1.93468444, 3.69735865, 5.38185988, 6.02549022,
6.69199075, 7.72316694, 8.08913061, 8.84570241, 8.69711608,
8.80038144, 9.78951087, 9.68486674, 10.06175145, 10.44039495,
10.0481156 , 9.76656204, 9.88581457, 9.81805445, 10.42432252,
10.41102239, 11.2911395 , 9.64866184, 9.98072231, 10.83644694,
10.24748571, 10.81333209, 10.75949899, 10.90367328, 10.42446764,
10.51441017, 10.73047737, 10.8159758 , 10.51013538, 10.02862504,
9.76352112, 10.64829309, 10.6293347 , 10.67752596, 10.34801542,
10.53158576, 10.92883362, 10.67002314, 10.37015825, 10.74876349,
10.12821343, 10.8974205 , 10.1591103 , 10.588377 , 11.92134556,
10.309095 , 11.1174362 , 10.72654524, 10.60890374, 10.37456491,
10.05935346, 11.21295863, 11.09013951, 10.60862773, 11.2558922 ,
11.24660234, 10.35981557, 10.81284365, 10.96113067, 10.22716439,
9.8394873 , 10.01892084, 10.38237311, 10.04920671, 10.87782442,
10.42438756, 10.05614503, 10.5446946 , 9.99974368, 10.76930547,
10.22164072, 10.36942999, 10.89888302, 10.47035428, 10.58157374,
11.12615892, 11.30866718, 10.33215937, 10.46723351, 10.54072701,
11.45027197, 10.45895588, 10.34176601, 10.78405493, 10.43964778,
10.34047484, 10.25099046, 11.05847515, 10.27408195, 10.27529163,
10.16568845, 10.86451738, 10.73205291, 10.73300649, 10.49463959,
10.03729782
])
t = np.linspace(0, 100, len(data)) #time array
def expo(x, a, b, c): #exponential function for fitting
return a * np.exp(b * x) + c
fig1, ax1 = plt.subplots()
ax1.plot(t, data, ".", label="data")
coefs = curve_fit(expo, t, data)[0] # fitting
ax1.plot(t, expo(t, coefs[0], coefs[1], coefs[2]), "-", label="fit")
ax1.legend()
plt.show()
The problem is that curve_fit() returns very big or very small coefficients a,b and c while it should return something more like a = -10.5, b = -0.2, c = 10.5
The fitting process works by finding a local minimum of a loss function.
If the problem is unconstrained, there may be several such local minima,
each giving different values of parameters, and you may get a different one
than the one that you are expecting.
If you have a guess what the parameters should be, you can provide it to narrow the search:
# with an initial guess for values of a, b, c
coefs = curve_fit(expo, t, data, p0=[-10, -1, 10])[0]
The coefficients it produces are:
array([-10.48815244, -0.2091102 , 10.56699883])
Alternatively, you can specify bonds for the parameters:
# with lower and upper bounds for a, b, c
coefs = curve_fit(expo, t, data, bounds=([-20, -2, 0], [-10, 2, 20]))[0]
This gives the same results as above.
Probably a non-linear regression algorithm is implemented in your software.
"Guessed" initial values of the parameters are required to start the iterative process. If no initial value is provided by the user, some initial values are evaluated by the software. That is often a cause of failure because the computed initial values might be too far from the correct values.
Some good initial values can be found in using a linear regression method which doesn't requires initial values. See the calculus below.
The result is :
If the accuracy of the above result is not sufficient according to some specified criteria of fitting, a non-linear regression is necessary. In this case the above values of the parameters $a,b,c$ can be used as initial values to initiate the iterative calculus.
Note : The principle of the method which lineraizes the non-linear regression as shown above is explained in : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
Here is what i tried, used negative b in np.exp
def expo(x,a,b,c):
return a*np.exp(-b*x) + c
>>>[-10.4881516 0.20911016 10.5669989 ]
I have some noisy data that can contain 0 and n gaussian shapes, I am trying to implement an algorithm that takes the highest data points and fits a gaussian to that as per the following 'scheme':
New attempt, steps:
fit a spline through all data points
get first derivative of spline function
get both data points (left/right) where f'(x) = around 0 the data point with max intensity
fit a gaussian through the data points returned from 3
4a. Plot the gaussian (stopping at baseline) in the pdf
Calculate area under gaussian curve
Calculate area under raw data points
Calculate percentage of total area explained by gaussian area
I have implemented this concept using the following code (minimal working example):
#! /usr/bin/env python
from scipy.interpolate import InterpolatedUnivariateSpline
from scipy.optimize import curve_fit
from scipy.signal import argrelextrema
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
data = [(9.60380153195,187214),(9.62028167623,181023),(9.63676350256,174588),(9.65324602212,169389),(9.66972824591,166921),(9.68621215187,167597),(9.70269675106,170838),(9.71918105436,175816),(9.73566703995,181552),(9.75215371878,186978),(9.76864010158,191718),(9.78512816681,194473),(9.80161692526,194169),(9.81810538757,191203),(9.83459553243,186603),(9.85108637051,180273),(9.86757691233,171996),(9.88406913682,163653),(9.90056205454,156032),(9.91705467586,149928),(9.93354897998,145410),(9.95004397733,141818),(9.96653867816,139042),(9.98303506191,137546),(9.99953213889,138724)]
data2 = [(9.60476933166,163571),(9.62125990879,156662),(9.63775225872,150535),(9.65424539203,146960),(9.67073831905,146794),(9.68723301904,149326),(9.70372850238,152616),(9.72022377931,155420),(9.73672082933,156151),(9.75321866271,154633),(9.76971628954,151549),(9.78621568961,148298),(9.80271587303,146333),(9.81921584976,146734),(9.83571759987,150351),(9.85222013334,156612),(9.86872245996,164192),(9.88522656011,171199),(9.90173144362,175697),(9.91823612015,176867),(9.93474257034,175029),(9.95124980389,171762),(9.96775683032,168449),(9.98426563055,165026)]
def gaussFunction(x, *p):
""" TODO
"""
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2.*sigma**2))
def quantify(data):
""" TODO
"""
backGround = 105000 # Normally this is dynamically determined but this value is fine for testing on the provided data
time,intensity = zip(*data)
x_data = np.array(time)
y_data = np.array(intensity)
newX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
f = InterpolatedUnivariateSpline(x_data, y_data)
fPrime = f.derivative()
newY = f(newX)
newPrimeY = fPrime(newX)
maxm = argrelextrema(newPrimeY, np.greater)
minm = argrelextrema(newPrimeY, np.less)
breaks = maxm[0].tolist() + minm[0].tolist()
maxPoint = 0
for index,j in enumerate(breaks):
try:
if max(newY[breaks[index]:breaks[index+1]]) > maxPoint:
maxPoint = max(newY[breaks[index]:breaks[index+1]])
xData = newX[breaks[index]:breaks[index+1]]
yData = [x - backGround for x in newY[breaks[index]:breaks[index+1]]]
except:
pass
# Gaussian fit on main points
newGaussX = np.linspace(x_data[0], x_data[-1], 2500*(x_data[-1]-x_data[0]))
p0 = [np.max(yData), xData[np.argmax(yData)],0.1]
try:
coeff, var_matrix = curve_fit(gaussFunction, xData, yData, p0)
newGaussY = gaussFunction(newGaussX, *coeff)
newGaussY = [x + backGround for x in newGaussY]
# Generate plot for visual confirmation
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(x_data, y_data, 'b*')
plt.plot((newX[0],newX[-1]),(backGround,backGround),'red')
plt.plot(newX,newY, color='blue',linestyle='dashed')
plt.plot(newGaussX, newGaussY, color='green',linestyle='dashed')
plt.title("Test")
plt.xlabel("rt [m]")
plt.ylabel("intensity [au]")
plt.savefig("Test.pdf",bbox_inches="tight")
plt.close(fig)
except:
pass
# Call the test
#quantify(data)
quantify(data2)
where normally the background (red line in below pictures) is dynamically determined, but for the sake of this example I have set it to a fixed number. The problem that I have is that for some data it works really well:
Corresponding f'(x):
However, for some other data it fails horrendously:
Corresponding f'(x):
Therefore, I would like to hear some suggestions or ideas on why this happens and on potential approaches to fix it. I have included the data that is shown in the picture below (in case anyone wants to try it):
The error lied in the following bit:
breaks = maxm[0].tolist() + minm[0].tolist()
for index,j in enumerate(breaks):
The breaks list now contains both the maxima and minima, but they are not sorted by time. Resulting in the list yielding the following data points for the poor fit: 9.78, 9.62 and 9.86.
The program would then examine data from 9.78 to 9.62 and 9.62 to 9.86, which meant that 9.62 to 9.86 contained the highest intensity data point yielding the fit that is shown in the second graph.
The fix was rather simple by just adding a sort on the breaks in between, as follows:
breaks = maxm[0].tolist() + minm[0].tolist()
breaks = sorted(breaks)
for index,j in enumerate(breaks):
The program then yielded a fit more closely resembling what I would expect:
this is quite a specific problem I was hoping the community could help me out with. Thanks in advance.
So I have 2 sets of data, one is experimental and the other is based off of an equation. I am trying to fit my data points to this curve and hence obtain the missing variables I am interested in. Namely, a and b in the Ebfit function.
Here is the code:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as spys
from scipy.optimize import curve_fit
time = [60,220,520,1840]
Moment = [0.64227262,0.468318916,0.197100772,0.104512508]
Temperature = 25 # Bake temperature in degrees C
Nb = len(Moment) # Number of bake measurements
Baketime_a = time #[s]
N_Device = 10000 # No. of devices considered in the array
T_ambient = 273 + Temperature
kt = 0.0256*(T_ambient/298) # In units of eV
f0 = 1e9 # Attempt frequency
def Ebfit(x,a,b):
Eb_mean = a*(0.0256/kt) # Eb at bake temperature
Eb_sigma = b*Eb_mean
Foursigma = 4*Eb_sigma
Eb_a = np.linspace(Eb_mean-Foursigma,Eb_mean+Foursigma,N_Device)
dEb = Eb_a[1] - Eb_a[0]
pdfEb_a = spys.norm.pdf(Eb_a,Eb_mean,Eb_sigma)
## Retention Time
DMom = np.zeros(len(x),float)
tau = (1/f0)*np.exp(Eb_a)
for bb in range(len(x)):
DMom[bb]= (1 - 2*(sum(pdfEb_a*(1 - np.exp(np.divide(-x[bb],tau))))*dEb))
return DMom
a = 30
b = 0.10
params,extras = curve_fit(Ebfit,time,Moment)
x_new = list(range(0,2000,1))
y_new = Ebfit(x_new,params[0],params[1])
plt.plot(time,Moment, 'o', label = 'data points')
plt.plot(x_new,y_new, label = 'fitted curve')
plt.legend()
The main problem I am having is that the fitting of the data to the function does not work when I use large number of points. In the above code When I use the 4 points (time & moment), this code works fine.
I get the following values for a and b.
array([ 29.11832766, 0.13918353])
The expected values for a is (23-50) and b is (0.06 - 0.15). So these values are within the acceptable range. This is the corresponding plot:
However, when I use my actual experimental normalized data with about 500 points.
EDIT: This data:
Normalized Data
https://www.dropbox.com/s/64zke4wckxc1r75/Normalized%20Data.csv?dl=0
Raw Data
https://www.dropbox.com/s/ojgse5ibp59r8nw/Data1.csv?dl=0
I get the following values and plot for a and b which are out of the acceptable range,
array([-13.76687781, -12.90494196])
I know these values are wrong and if I were to do it manually (slowly adjusting values to obtain the proper fit) it would be around a=30.1 and b=0.09. And when plotted looks as such:
I have tried changing the initial guess values for a & b, other sets of experimental data as well and other suggestions in similar threads. None seem to work for me. Any help you can provide is appreciated. Thanks.
.
.
.
.
ADDITIONAL INFORMATION
The model I am trying to fit the data to comes from the following equation:
where Dmom = 1 - 2*Psw
a is the Eb value while b is the Sigma value where, Eb has a range of values determined by the probability density function and 4 times of the sigma values (i.e. Foursigma). This distribution is then summed up to use for the final equation.
It looks like you do need to play around with the initial guesses for a and b after all. Perhaps the function you're fitting is not very well behaved, which is why it's so prone to fail for intitial guesses away from the global minumum. That being said, here's a working example of how to fit your data:
import pandas as pd
data_df = pd.read_csv('data.csv')
time = data_df['Time since start, Time [s]'].values
moment = data_df['Signal X direction, Moment [emu]'].values
params, extras = curve_fit(Ebfit, time, moment, p0=[40, 0.3])
Yields the values of a and b of:
In [6]: params
Out[6]: array([ 30.47553689, 0.08839412])
Which results in a nicely aligned fit of a function.
x_big = np.linspace(1, 1800, 2000)
y_big = Ebfit(x_big, params[0], params[1])
plt.plot(time, moment, 'o', alpha=0.5, label='all points')
plt.plot(x_big, y_big, label = 'fitted curve')
plt.legend()
plt.show()