fit gaussian function to a broad data

fit gaussian function to a broad data - python

I have two Arrays. one is E which is an integer Array from 0-21. the second is E_Dist which Shows the number of elements for each index of E Array. the min and max values of E_Dist is 0-450000.
I want to plot E_Dist vs E, i.e E in horizontal and E_Dist in vertical direction and fit a gaussian function to this plot.
unfortunately How much I try I cannot fit a gaussian function to it. I have some Problems in fact:
1- how can I guess the Center of gaussian function correctly. I can use simply the mean value of E, which is 10, but in reality my gaussian plot Peaks about 5. I tried to use some formula like
E_cen= sum(E*E_Dist)/sum(E)
but now the result is much worse and is about 10e4.
2- I used two different gaussian function form one with Amplitude and another without it. I never could fit a gaussian function when I considered the Amplitude but without Amplitude I have a gaussian plot but never fit my data Points.
I think one issue is that I couldn't calculate the Sigma or other gaussian Parameters correctly. another one can be the large difference of E_Dist values which covers 0 to 450000.
I attached my plotanyone can help me how can I solve the Problem?
Thanks in advance
.
########### Gaussian fit
E_cen= np.mean(E)
print 'E_cen=',E_cen
10.54
mean_N = np.mean(E_Dist ) # mean number of particles
print 'mean_N=',mean_N
>>>58937
sigma = np.sqrt( sum((E_Dist - mean_N)**2 )/len(E_Dist)-1 )
print 'sigma =',sigma #/1.e9,'*1.e9'
>>>119888.7
##########
# def gaus(x,a,x0,sigma):
# return a*np.exp(-(x-x0)**2/(2*sigma**2)) # Gaussian function with amplitude
#popt,pcov = curve_fit(gaus,E,E_Dist ,p0=[2,E_cen,sigma]) #
def gaus(x,x0,sigma):
return np.exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,E,E_Dist ,p0=[E_cen,sigma]) # no amplitude
#################
print 'min(popt)=',min(popt)
>>>10.54
print 'max(popt)=',max(popt)
>>>119800
print 'mean(popt)=',np.mean(popt)
>>>59940

Related

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!

The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

FFT to find autocorrelation function

I am trying to find the correlation function of the following stochastic process:
where beta and D are constants and xi(t) is a Gaussian noise term.
After simulating this process with the Euler method, I want to find the auto correlation function of this process. First of all, I have found an analytical solution for the correlation function and already used the definition of correlation function to simulate it and the two results were pretty close (please see the photo, the corresponding code is at the end of this post).
(Figure 1)
Now I want to use the Wiener-Khinchin theorem (using fft) to find the correlation function by taking the fft of the realizations, multiply it with its conjugate and then find take the ifft to get the correlation function. But obviously I am getting results that are way off the expected correlation function, so I am pretty sure there is something I misunderstood in the code to get this wrong results..
Here is my code for the solution of the stochastic process (which I am sure it is right although my code might be sloppy) and my attempt to find the autocorrelaion with the fft:
N = 1000000
dt=0.01
gamma = 1
D=1
v_data = []
v_factor = math.sqrt(2*D*dt)
v=1
for t in range(N):
F = random.gauss(0,1)
v = v - gamma*dt + v_factor*F
if v<0: ###boundary conditions.
v=-v
v_data.append(v)
def S(x,dt): ### power spectrum
N=len(x)
fft=np.fft.fft(x)
s=fft*np.conjugate(fft)
# n=N*np.ones(N)-np.arange(0,N) #divide res(m) by (N-m)
return s.real/(N)
c=np.fft.ifft(S(v_data,0.01)) ### correlation function
t=np.linspace(0,1000,len(c))
plt.plot(t,c.real,label='fft method')
plt.xlim(0,20)
plt.legend()
plt.show()
And this is what I would get using this method for the correlation function,
And this is my code for the correlation function using the definition:
def c_theo(t,b,d): ##this was obtained by integrating the solution of the SDE
I1=((-t*d)+((d**2)/(b**2))-((1/4)*(b**2)*(t**2)))*special.erfc(b*t/(2*np.sqrt(d*t)))
I2=(((d/b)*(np.sqrt(d*t/np.pi)))+((1/2)*(b*t)*(np.sqrt(d*t/np.pi))))*np.exp(-((b**2)*t)/(4*d))
return I1+I2
## this is the correlation function that was plotted in the figure 1 using the definition of the autocorrelation.
Ntau = 500
sum2=np.zeros(Ntau)
c=np.zeros(Ntau)
v_mean=0
for i in range (0,N):
v_mean=v_mean+v_data[i]
v_mean=v_mean/N
for itau in range (0,Ntau):
for i in range (0,N-10*itau):
sum2[itau]=sum2[itau]+v_data[i]*v_data[itau*10+i]
sum2[itau]=sum2[itau]/(N-itau*10)
c[itau]=sum2[itau]-v_mean**2
t=np.arange(Ntau)*dt*10
plt.plot(t,c,label='numericaly')
plt.plot(t,c_theo(t,1,1),label='analyticaly')
plt.legend()
plt.show()
so would someone please point out where is the mistake in my code, and how could I simulate it better to get the right correlation function?

There are two issues with the code that I can see.
As francis said in a comment, you need to subtract the mean from your signal to get the autocorrelation to reach zero.
You plot your autocorrelation function with a wrong x-axis values.
v_data is defined with:
N = 1000000 % 1e6
dt = 0.01 % 1e-2
meaning that t goes from 0 to 1e4. However:
t = np.linspace(0,1000,len(c))
meaning that you plot with t from 0 to 1e3. You should probably define t with
t = np.arange(N) * dt
Looking at the plot, I'd say that stretching the blue line by a factor 10 would make it line up with the red line quite well.

Python - curve fit producing incorrect fit

I'm trying to fit a sine wave curve this data distribution, but for some reason, the fit is incorrect:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
from scipy.optimize import curve_fit
#=======================
#====== Analysis =======
#=======================
# sine curve fit
def fit_Sin(t, A, b, C):
return A* np.sin(t*b) + C
## The Data extraciton
t,y,y1 = np.loadtxt("new10_CoCore_5to20_BL.txt", unpack=True)
xdata = t
popt, pcov = curve_fit(fit_Sin, t, y)
print "A = %s , b = %s, C = %s" % (popt[0], popt[1], popt[2])
#=======================
#====== Plotting =======
#=======================
fig1 = plt.figure()
ax1 = fig1.add_subplot(111)
ax1.plot(t, y, ".")
ax1.plot(t, fit_Sin(t, *popt))
plt.show()
In which this fit makes an extreme underestimation of the data. Any ideas why that is?
Here is the data provided here: https://www.dropbox.com/sh/72jnpkkk0jf3sjg/AAAb17JSPbqhQOWnI68xK7sMa?dl=0
Any idea why this is producing this?

Sine waves are extremely difficult to fit if your frequency guess is off. That is because with a sufficient number of cycles in the data, the guess will be out of phase with half the data and in phase with half of it for even a small error in the frequency. At that point, a straight line offers a better fit than a sine wave of different frequency. That is how Fourier transforms work by the way.
I can think of three ways to estimate the frequency well enough to allow a non linear least squares algorithm to take over:
Eyeball it. Subtract the x-values of two peaks in the GUI or even in the command line. If you have very low noise data, you can automate this process quite easily.
Use a Discrete Fourier transform. If your data is a sine wave of one component, the first non-constant peak will give you the frequency. I have found this to require some additional tweaking since the frequency of the sampling is often not a multiple of the sine wave frequency. A parabolic fit to the three points around the peak (three including the peak) can help in this situation.
Find where your data crosses the vertical offset. This is similar to #1 but is easier to automate for relatively non-noisy data. The wavelength is twice the distance between a pair of intersections.
Using #1, I can clearly see that your wavelength is 50. The initial guess for b should therefore be 2*np.pi/50. Also, don't forget to add a phase shift parameter to allow the fit to slide horizontally: A*sin(b*t + d) + C.
You will need to pass in an initial guess via the p0 parameter to curve_fit. A good eyeball estimate is p0=(0.55, np.pi/25, 0.0, -np.pi/25*12.5). The phase shift in your data appears to be a quarter period to the right, hence the 12.5.
I am currently in the process of writing an algorithm for fitting noisy sine waves with a single frequency component that I will submit to SciPy. Will update when I finish.

Python - Convolution with a Gaussian

I need to convolute the next curve with a Gaussian function of specific parameters centered at 3934.8A.
The problem I see is that my curve is a discrete array and the Gaussian would be a well define continuos function. How can I make this work?

To do this, you need to create a Gaussian that's discretized at the same spatial scale as your curve, then just convolve.
Specifically, say your original curve has N points that are uniformly spaced along the x-axis (where N will generally be somewhere between 50 and 10,000 or so). Then the point spacing along the x-axis will be (physical range)/(digital range) = (3940-3930)/N, and the code would look like this:
dx = float(3940-3930)/N
gx = np.arange(-3*sigma, 3*sigma, dx)
gaussian = np.exp(-(x/sigma)**2/2)
result = np.convolve(original_curve, gaussian, mode="full")
Here this is a zero-centered gaussian and does not include the offset you refer to (which to me would just add confusion, since the convolution by its nature is a translating operation, so starting with something already translated is confusing).
I highly recommend keeping everything in real, physical units, as I did above. Then it's clear, for example, what the width of the gaussian is, etc.

Fitting gaussian to a curve in Python II

I have two lists .
import numpy
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
They make a U shaped curve.
Now I want to fit a gaussian to that curve.
from scipy.optimize import curve_fit
n = len(x)
mean = sum(y)/n
sigma = sum(y - mean)**2/n
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popt, pcov = curve_fit(gaus,x,y,p0=[-1,mean,sigma,-5])
pylab.plot(x,y,'r-')
pylab.plot(x,gaus(x,*popt),'k-')
pylab.show()
I just end up with the noisy original U-shaped curve and a straight horizontal line running through the curve.
I am not sure what the -1 and the -5 represent in the above code but I am sure that I need to adjust them or something else to get the gaussian curve. I have been playing around with possible values but to no avail.
Any ideas?

First of all, your variable sigma is actually variance, i.e. sigma squared --- http://en.wikipedia.org/wiki/Variance#Definition.
This confuses the curve_fit by giving it a suboptimal starting estimate.
Then, your fitting ansatz, gaus, includes an amplitude a and an offset, is this what you actually need? And the starting values are a=-1 (negated bell shape) and offset c=-5. Where do they come from?
Here's what I'd do:
fix your fitting model. Do you want just a gaussian, does it need to be normalized. If it does, then the amplitude a is fixed by sigma etc.
Have a look at the actual data. What's the tail (offset), what's the sign (amplitude sign).
If you're actually want just a gaussian without any bells and whistles, you might not actually need curve_fit: a gaussian is fully defined by two first moments, mean and sigma. Calculate them as you do, plot them over the data and see if you're not all set.

p0 in your call to curve_fit gives the initial guesses for the additional parameters of you function in addition to x. In the above code you are saying that I want the curve_fit function to use -1 as the initial guess for a, -5 as the initial guess for c, mean as the initial guess for x0, and sigma as the guess for sigma. The curve_fit function will then adjust these parameters to try and get a better fit. The problem is your initial guesses at your function parameters are really bad given the order of (x,y)s.
Think a little bit about the order of magnitude of your different parameters for the Gaussian. a should be around the size of your y values (10**-16) as at the peak of the Gaussian the exponential part will never be larger than 1. x0 will give the position within your x values at which the exponential part of your Gaussian will be 1, so x0 should be around 7500, probably somewhere in the centre of your data. Sigma indicates the width, or spread of your Gaussian, so perhaps something in the 100's just a guess. Finally c is just an offset to shift the whole Gaussian up and down.
What I would recommend doing, is before fitting the curve, pick some values for a, x0, sigma, and c that seem reasonable and just plot the data with the Gaussian, and play with a, x0, sigma, and c until you get something that looks at least some what the way you want the Gaussian to fit, then use those as the starting points for curve_fit p0 values. The values I gave should get you started, but may not do exactly what you want. For instance a probably needs to be negative if you want to flip the Gaussian to get a "U" shape.
Also printing out the values that curve_fit thinks are good for your a,x0,sigma, and c might help you see what it is doing and if that function is on the right track to minimizing the residual of the fit.
I have had similar problems doing curve fitting with gnuplot, if the initial values are too far from what you want to fit it goes in completely the wrong direction with the parameters to minimize the residuals, and you could probably do better by eye. Think of these functions as a way to fine tune your by eye estimates of these parameters.
hope that helps

I don't think you are estimating your initial guesses for mean and sigma correctly.
Take a look at the SciPy Cookbook here
I think it should look like this.
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x*y)/sum(y)
sigma = sqrt(abs(sum((x-mean)**2*y)/sum(y)))
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))
If anyone has a link to a simple explanation why these are the correct moments I would appreciate it. I am going on faith that SciPy Cookbook got it right.

Here is the solution thanks to everyone .
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x)/n
sigma = math.sqrt(sum((x-mean)**2)/n)
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))

Maybe it is because I use matlab and fminsearch or my fits have to work on much fewer datapoints (~ 5-10), I have much better results with the following starter values (as simple as they are):
a = max(y)-min(y);
imax= find(y==max(y),1);
mean = x(imax);
avg = sum(x.*y)./sum(y);
sigma = sqrt(abs(sum((x-avg).^2.*y) ./ sum(y)));
c = min(y);
The sigma works fine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.