I have been trying to fit a gaussian curve to my data
data
I have used the following code:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def gaus(x, y0, a, b, c):
return y0 + a*np.exp(-np.power(x - b, 2)/(2*np.power(c, 2)))
popt, pcov = curve_fit(gaus, x, y)
plt.figure()
plt.scatter(x, y, c='grey', marker = 'o', label = "Measured values", s = 2)
plt.plot(x, gaus(x, *popt), c='grey', linestyle = '-')
And that's what I am getting:
result
I have the x/y data available here in case you want to try it by yourself.
Any idea on how can I get a fit? This data is obviously gaussian shaped, so it seems weird I cannot fit a gaussian curve.
The fit needs a decent starting point. Per the docs if you do not specify the starting point all parameters are set to 1 which is clearly not appropriate, and the fit gets stuck in some wrong local minima. Try this, where I chose the starting point by eyeballing the data
popt, pcov = curve_fit(gaus, x, y, p0 = (1500,2000,20, 1))
you would get something like this:
and the solution found by the solver is
popt
array([1559.13138798, 2128.64718985, 21.50092272, 0.16298357])
Even just getting the mean (parameter b) roughly right is enough for the solver to find the solution, eg try this
popt, pcov = curve_fit(gaus, x, y, p0 = (1,1,20, 1))
you should see the same (good) result
Related
I plotted a graph for entropy. Now I want to do a curve fitting but I am not able to understand how to initiate the process. I tried it using the curve_fit module of scipy but what I get is just a straight line rather than a Gaussian curve. Here is the output:
from scipy.optimize import curve_fit
x = [18,23,95,142,154,156,157,158,258,318,367,382,484,501,522,574,681,832,943,1071,1078,1101,1133,1153,1174,1264]
y = [0.179,0.179,0.692,0.574,0.669,0.295,0.295,0.295,0.387,0.179,0.179,0.462,0.179,0.179,0.179,0.179,0.179,0.179,0.179,0.179,0.179,0.462,0.179,0.387,0.179,0.295]
x = np.asarray(x)
y = np.asarray(y)
def Gauss(x, A, B):
y = A*np.exp(-1*B*x**2)
return y
parameters_g, covariance_g = curve_fit(Gauss, x, y)
fit_A = parameters_g[0]
fit_B = parameters_g[1]
print(fit_A)
print(fit_B)
fit_y = Gauss(x, fit_A, fit_B)
plt.figure(figsize=(18,10))
plt.plot(x, y, 'o', label='data')
plt.plot(x, fit_y, '-', label='fit')
plt.legend()
plt.show()
I just connected the values using spline and got this kind of curve:
Can anyone suggest to me how to fit this curve into a Gaussian curve?
Edit 1: Now I have the barplot (sort of) where I have the y value for the corresponding x values. My x-axis ranges from 0 to 1273 and the y-axis from 0 to 1. How can I do a curve fitting and what will be the right curve over here? I was trying to fit a bimodal curve distribution for the given data. You can find the data from here.
Bar plot image: https://i.stack.imgur.com/1Awt5.png
Data : https://drive.google.com/file/d/1_uiweIWRWgzy5wNVLOvn4WN25jteu8rQ/view?usp=sharing
You dont have a line you indeed have a Gaussian curve (centered around zero because of your definition of Gauss). You can clearly see that when you plot the function on a different scale:
x_arr = np.linspace(-2,2,100)
fit_y = Gauss(x_arr, fit_A, fit_B)
plt.figure(figsize=(18,10))
plt.plot(x_arr, fit_y, '-', label='fit')
plt.legend()
plt.show()
That image was made with the estimated fit_A == fit_B == 1 from your code.
You can add a different initial guess which leads to different result via:
parameters_g, covariance_g = curve_fit(Gauss, x, y, p0=(1,1/1000))
but I would say that your data is not that well described via a Gaussian curve.
One thing I would allays recommend is setting those values by hand to see the effect it has on the plot. That way you can get a feeling if the task you try to automated is realistic in the first place.
I'm currently working on a lab report for Brownian Motion using this PDF equation with the intent of evaluating D:
Brownian PDF equation
And I am trying to curve_fit it to a histogram. However, whenever I plot my curve_fits, it's a line and does not appear correctly on the histogram.
Example Histogram with bad curve_fit
And here is my code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Variables
eta = 1e-3
ra = 0.95e-6
T = 296.5
t = 0.5
# Random data
r = np.array(np.random.rayleigh(0.5e-6, 500))
# Histogram
plt.hist(r, bins=10, density=True, label='Counts')
# Curve fit
x,y = np.histogram(r, bins=10, density=True)
x = x[2:]
y = y[2:]
bin_width = y[1] - y[2]
print(bin_width)
bin_centers = (y[1:] + y[:-1])/2
err = x*0 + 0.03
def f(r, a):
return (((1e-6)3*np.pi*r*eta*ra)/(a*T*t))*np.exp(((-3*(1e-6 * r)**2)*eta*ra*np.pi)/(a*T*t))
print(x) # these are flipped for some reason
print(y)
plt.plot(bin_centers, x, label='Fitting this', color='red')
popt, pcov = optimize.curve_fit(f, bin_centers, x, p0 = (1.38e-23), sigma=err, maxfev=1000)
plt.plot(y, f(y, popt), label='PDF', color='orange')
print(popt)
plt.title('Distance vs Counts')
plt.ylabel('Counts')
plt.xlabel('Distance in micrometers')
plt.legend()
Is the issue with my curve_fit? Or is there an underlying issue I'm missing?
EDIT: I broke down D to get the Boltzmann constant as a in the function, which is why there are more numbers in f than the equation above. D and Gamma.
I've tried messing with the initial conditions and plotting the function with 1.38e-23 instead of popt, but that does this (the purple line). This tells me something is wrong with the equation for f, but no issues jump out to me when I look at it. Am I missing something?
EDIT 2: I changed the function to this to simplify it and match the numpy.random.rayleigh() distribution:
def f(r, a):
return ((r)/(a))*np.exp((-1*(r)**2)/(2*a))
But this doesn't resolve the issue that the curve_fit is a line with a positive slope instead of anything remotely what I'm interested in. Now I am more confused as to what the issue is.
There are a few things here. I don't think x and y were ever flipped, or at least when I assumed they weren't, everything seemed to work fine. I also cleaned up a few parts of the code, for example, I'm not sure why you call two different histograms; and I think there may have been problems handling the single element tuple of parameters. Also, for curve fitting, the initial parameter guess often needs to be in the ballpark, so I changed that too.
Here's a version that works for me:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
# Random data
r = np.array(np.random.rayleigh(0.5e-6, 500))
# Histogram
hist_values, bin_edges, patches = plt.hist(r, bins=10, density=True, label='Counts')
bin_centers = (bin_edges[1:] + bin_edges[:-1])/2
x = bin_centers[2:] # not necessary, and I'm not sure why the OP did this, but I'm doing this here because OP does
y = hist_values[2:]
def f(r, a):
return (r/(a*a))*np.exp((-1*(r**2))/(2*a*a))
plt.plot(x, y, label='Fitting this', color='red')
err = x*0 + 0.03
popt, pcov = optimize.curve_fit(f, x, y, p0 = (1.38e-6,), sigma=err, maxfev=1000)
plt.plot(x, f(x, *popt), label='PDF', color='orange')
plt.title('Distance vs Counts')
plt.ylabel('Counts')
plt.xlabel('Distance in Meters') # Motion seems to be in micron range, but calculation and plot has been done in meters
plt.legend()
So I have two lists of data, which I can plot in a scatter plot, as such:
from matplotlib import pyplot as plt
x = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y = [22.4155688819,22.3936180362,22.3177538001,22.1924849792,21.7721194577,21.1590235248,20.6670446864,20.4996957642,20.4260953411,20.3595072628,20.3926201626,20.6023149681,21.1694961343,22.1077417713,23.8270366414,26.5355924353,31.3179807276,42.7871637946,61.9639549412,84.7710953311]
plt.scatter(degrees,RMS_one_image)
This gives you a plot that looks like a Gaussian distribution, which is good as it should-
My issue is however I am trying to fit a Gaussian distribution to this, and failing miserably because a. it's only half a Gaussian instead of a full one, and b. what I've used before has only ever used one bunch of numbers. So something like:
# best fit of data
num_bins = 20
(mu, sigma) = norm.fit(sixteen)
y = mlab.normpdf(num_bins, mu, sigma)
n, bins, patches = plt.hist(deg_array, num_bins, normed=1, facecolor='blue', alpha=0.5)
# add a 'best fit' line
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins, y, 'r--')
Does this approach work at all here, or am I going about this in the wrong way completely? Thanks...
It seems that your normal solution is to find the expectation value and standard deviation of the data directly instead of using a least square fit. Here is a solution using curve_fit from scipy.optimize.
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
x = np.array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19])
y = [22.4155688819,22.3936180362,22.3177538001,22.1924849792,21.7721194577,21.1590235248,20.6670446864,20.4996957642,20.4260953411,20.3595072628,20.3926201626,20.6023149681,21.1694961343,22.1077417713,23.8270366414,26.5355924353,31.3179807276,42.7871637946,61.9639549412,84.7710953311]
# Define a gaussian function with offset
def gaussian_func(x, a, x0, sigma,c):
return a * np.exp(-(x-x0)**2/(2*sigma**2)) + c
initial_guess = [1,20,2,0]
popt, pcov = curve_fit(gaussian_func, x, y,p0=initial_guess)
xplot = np.linspace(0,30,1000)
plt.scatter(x,y)
plt.plot(xplot,gaussian_func(xplot,*popt))
plt.show()
I'm using scipy.optimize.curve_fit to approximate peaks in my data with Gaussian functions. This works well for strong peaks, but it is more difficult with weaker peaks. However, I think fixing a parameter (say, width of the Gaussian) would help with this. I know I can set initial "estimates" but is there a way that I can easily define a single parameter without changing the function I'm fitting to?
If you want to "fix" a parameter of your fit function, you can just define a new fit function which makes use of the original fit function, yet setting one argument to a fixed value:
custom_gaussian = lambda x, mu: gaussian(x, mu, 0.05)
Here's a complete example of fixing sigma of a Gaussian function to 0.05 (instead of optimal value 0.1). Of course, this doesn't really make sense here because the algorithm has no problem in finding optimal values. Yet, you can see how mu is still found despite the fixed sigma.
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
def gaussian(x, mu, sigma):
return 1 / sigma / np.sqrt(2 * np.pi) * np.exp(-(x - mu)**2 / 2 / sigma**2)
# Create sample data
x = np.linspace(0, 2, 200)
y = gaussian(x, 1, 0.1) + np.random.rand(*x.shape) - 0.5
plt.plot(x, y, label="sample data")
# Fit with original fit function
popt, _ = scipy.optimize.curve_fit(gaussian, x, y)
plt.plot(x, gaussian(x, *popt), label="gaussian")
# Fit with custom fit function with fixed `sigma`
custom_gaussian = lambda x, mu: gaussian(x, mu, 0.05)
popt, _ = scipy.optimize.curve_fit(custom_gaussian, x, y)
plt.plot(x, custom_gaussian(x, *popt), label="custom_gaussian")
plt.legend()
plt.show()
Hopefully this is helpful. Had to use hax. Curve_fit is pretty strict about what it takes.
import numpy as np
from numpy import random
import scipy as sp
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def exp1(t,a1,tau1):
#A1*exp(-t/t1)
val=0.
val=(a1*np.exp(-t/tau1))*np.heaviside(t,0)
return val
def wrapper(t,*args):
global hold
global p0
wrapperName='exp1(t,'
for i in range(0, len(hold)):
if hold[i]:
wrapperName+=str(p0[i])
else:
if i%2==0:
wrapperName+='args['+str(i)+']'
else:
wrapperName+='args'+str(i)+']'
if i<len(hold):
wrapperName+=','
wrapperName+=')'
return eval(wrapperName)
p0=np.array([1.5,500.])
hold=np.array([0,1])
p1=np.delete(p0,1)
timepoints = np.arange(0.,2000.,20.)
y=exp1(timepoints,1,1000)+np.random.normal(0, .1, size=len(timepoints))
popt, pcov = curve_fit(exp1, timepoints, y, p0=p0)
print 'unheld parameters:', popt, pcov
popt, pcov = curve_fit(wrapper, timepoints, y, p0=p1)
for i in range(0, len(hold)):
if hold[i]:
popt=np.insert(popt,i,p0[i])
yfit=exp1(timepoints,popt[0],popt[1])
pl.plot(timepoints,y,timepoints,yfit)
pl.show()
print 'hold parameters:', popt, pcov
I have fitted curve to a set of data points. I would like to know how to find the maximum point of my curve and then I would like to annotate that point (I don't want to use by largest y value from my data to do this). I cannot exactly write my code but here is the basic layout of my code.
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x = [1,2,3,4,5]
y = [1,4,16,4,1]
def f(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
p0 = (8, 16, 0.1) # guess perameters
plt.plot(x,y,"ro")
popt, pcov = curve_fit(f, x, y, p0)
plt.plot(x, f(x, *popt))
Also is there a way to find the peak width?
Am I missing a simple built in function that could do this? Could I differentiate the function and find the point at which it is zero? If so how?
After you fit to find the best parameters to maximize your function, you can find the peak using minimize_scalar (or one of the other methods from scipy.optimize).
Note that in below, I've shifted x[2]=3.2 so that the peak of the curve doesn't land on a data point and we can be sure we're finding the peak to the curve, not the data.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit, minimize_scalar
x = [1,2,3.2,4,5]
y = [1,4,16,4,1]
def f(x, p1, p2, p3):
return p3*(p1/((x-p2)**2 + (p1/2)**2))
p0 = (8, 16, 0.1) # guess perameters
plt.plot(x,y,"ro")
popt, pcov = curve_fit(f, x, y, p0)
# find the peak
fm = lambda x: -f(x, *popt)
r = minimize_scalar(fm, bounds=(1, 5))
print "maximum:", r["x"], f(r["x"], *popt) #maximum: 2.99846874275 18.3928199902
x_curve = np.linspace(1, 5, 100)
plt.plot(x_curve, f(x_curve, *popt))
plt.plot(r['x'], f(r['x'], *popt), 'ko')
plt.show()
Of course, rather than optimizing the function, we could just calculate it for a bunch of x-values and get close:
x = np.linspace(1, 5, 10000)
y = f(x, *popt)
imax = np.argmax(y)
print imax, x[imax] # 4996 2.99859985999
If you don't mind using sympy, it's pretty easy. Assuming the code you posted has already been run:
import sympy
sym_x = sympy.symbols('x', real=True)
sym_f = f(sym_x, *popt)
sym_df = sym_f.diff()
solns = sympy.solve(sym_df) # returns [3.0]