Asymmetric Gaussian Fit in Python - python

I'm trying to fit an asymmetric Gaussian to this data: http://ge.tt/99iNaL53 (csv file).
I have tried to use a skewed Gaussian model from lmfit, and also a spline, but I'm not able to get the Gaussian model to fit well and the splines are not what I'm looking for (I don't want the spline to fit the data exactly as shown below, and altering the level of smoothing isn't helping).
Here is code using the above data that produces the plot below. The second figure is an example of what I'm trying to achieve with the goal of reading the rise and decay time from the fit.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import CubicSpline
from scipy.interpolate import UnivariateSpline
from lmfit.models import SkewedGaussianModel
data = np.loadtxt('data.csv', delimiter=',')
x = data[:,0]
y = data[:,1]
# Skewed Gaussian fit
model = SkewedGaussianModel()
params = model.make_params(amplitude=400, center=3, sigma=7, gamma=1)
result = model.fit(y, params, x=x)
# Cubic Spline
cs = CubicSpline(x, y)
x_range = np.arange(x[0], x[-1], 0.1)
# Univariate Spline
us = UnivariateSpline(x, y, k = 1)
# Univariate Spline (smoothed)
us2 = UnivariateSpline(x, y, k = 5)
plt.scatter(x, y, marker = '^', color = 'k', linewidth = 0.5, s = 10, label = 'data')
plt.plot(x_range, cs(x_range), label = 'Cubic Spline')
plt.plot(x_range, us(x_range), label = 'Univariate Spline, k = 1')
plt.plot(x_range, us2(x_range), label = 'Univariate Spline, k = 5')
plt.plot(x, result.best_fit, color = 'red', label = 'Skewed Gaussian Attempt')
plt.xlabel('x')
plt.ylabel('y')
plt.yscale('log')
plt.ylim(1,500)
plt.legend()
plt.show()

Is there a question here? I don't see one, actually.
That result from lmfit is the best fit to a skewed Gaussian model.
You've chosen to plot the result on a log-scale. That completely changes the view of the quality of the fit or what is not fit well.
It seems like you're expecting a better fit, but not *too good. Well, it looks like your data is not perfectly represented by a single skewed Gaussian. It seems like you were not expecting it to be. You could try different forms for the model function, say a skewed Lorentzian or something. But your data has that low x shoulder that definitely does not look like your uncited figure.

I wrote something for J. Chem. Ed. [1] that involved fitting asymmetric Gaussian functions to data, you can find the core repo here [2] but below is a snippet on how I went about fitting a data set where x = data[:,0] and y = data[:,1] to the type of function you're working with:
import numpy as np
from scipy.optimize import leastsq
from scipy.special import erf
initials = [6.5, 13, 1, 0] # initial guess
def asymGaussian(x, p):
amp = (p[0] / (p[2] * np.sqrt(2 * np.pi)))
spread = np.exp((-(x - p[1]) ** 2.0) / (2 * p[2] ** 2.0))
skew = (1 + erf((p[3] * (x - p[1])) / (p[2] * np.sqrt(2))))
return amp * spread * skew
def residuals(p,y,x):
return y - asymGaussian(x, p)
# executes least-squares regression analysis to optimize initial parameters
cnsts = leastsq(
residuals,
initials,
args=(
data_set[:,1], # y value
data_set[:,0] # x value
))[0]
y = asymGaussian(data[:,0], cnsts)
finally just plot (y, data[:,0]). Hope this helps!
[1] https://pubs.acs.org/doi/10.1021/acs.jchemed.9b00818
[2] https://github.com/1mikegrn/pyGC

Related

Calculating R-square of a slope of specific part of a graph

As a tradition, I want to say that I am pretty new to python.
I have set of x and y values as csv file, and my y values are pretty noisy. So far, I managed to use a filter(scipy.signal.savgol_filter) to filter the noise, plot my graph, and get a linear regression of my data, where it is showing a linear trend. This part is important, because my question is related to linear fitting of some part of the data. Here is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.signal
import os
from scipy.signal import savgol_coeffs
from sklearn.metrics import r2_score
from scipy.linalg import lstsq
plt.rc('lines',linewidth=1)
plt.rc('axes', labelsize=16)
plt.rc('xtick', labelsize=14)
plt.rc('ytick', labelsize=14)
plt.rc('legend', fontsize=10)
# define material parameters
deprate = 5.90E-07
#deposition rate unit: cm/s
Ms = 180000 # Si substrate modulus, unit: MPa
hs = 0.03 # Si substrate thickness, unit: cm
stressfac = Ms*hs**2/6 # stress prefactor, unit: MPa
def fit_slope(hfilm, curvature, h0, h1):
# use least square fitting to find the slope of the film_thickness vs curvature curve btw
# thickness = h0 and h1.
# return fitting parameter p. Least square fitting line is y = p[1]*x + p[0]
xdata = hfilm[ (hfilm>h0) & (hfilm<h1)]
ydata = curvature[(hfilm>h0) & (hfilm<h1)]
A = xdata[:, np.newaxis] ** [0,1]
p, *_ = lstsq(A, ydata)
return p
def read_MOSSdata(filename, deprate):
data = pd.read_csv(filename, sep='\s*,\s*', engine='python')
time = data['time [s]'][~data['time [s]'].isna()].to_numpy()
curvature = data['Curvature'][~data['time [s]'].isna()] # curvature unit: 1/cm
hfilm = time * deprate # file thickness unit: cm
return time, hfilm, curvature
filename = (r'C:\Users\yavuz\01-0722-2.csv')
time, hfilm, curvature = read_MOSSdata(filename, deprate)
h0 = 0.00005
h1 = 0.00008
xdata = np.linspace(h0, h1, 500)
yhat = scipy.signal.savgol_filter(curvature, 21,1)
p = fit_slope(hfilm, yhat, h0, h1)
plt.plot(hfilm, curvature)
plt.plot(hfilm, yhat, color='red', label = 'filtered data')
plt.plot(xdata, p[1]*xdata + p[0], color='green', linewidth=4, label = 'linear fitting')
plt.xlabel("Film thickness (cm)")
plt.ylabel("Curvature(1/cm)")
print(f'fitted stress = {-p[1]*stressfac} MPa')
plt.legend(loc=0)
My question is how do I calculate R-square value of this slope on my graph? I tried using r-square value calculators like sklearn.metrics but the problem is that I am limiting my x values to get a slope of a window, and all of the codes I tried, showing the problem of ''expected x and y to have same length''. I would add the csv file but it seems like there is not such an option. Thanks a lot for the help!

What could be causing incorrect 2-D interpolation in SciPy?

I have a rectilinear (not regular) grid of data (x,y,V) where V is the value at the position (x,y). I would like to use this data source to interpolate my results so that I can fill in the gaps and plot the interpolated values (inside the range) in the future. (Also I need functionality of griddata to check arbitrary values inside the range).
I looked at the documentation at SciPy and here.
Here is what I tried, and the result:
It clearly doesn't match the data.
# INTERPOLATION ATTEMPT?
from scipy.interpolate import Rbf
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
edges = np.linspace(-0.05, 0.05, 100)
centers = edges[:-1] + np.diff(edges[:2])[0] / 2.
XI, YI = np.meshgrid(centers, centers)
# use RBF
rbf = Rbf(x, y, z, epsilon=2)
ZI = rbf(XI, YI)
# plot the result
plt.subplots(1,figsize=(12,8))
X_edges, Y_edges = np.meshgrid(edges, edges)
lims = dict(cmap='viridis')
plt.pcolormesh(X_edges, Y_edges, ZI, shading='flat', **lims)
plt.scatter(x, y, 200, z, edgecolor='w', lw=0.1, **lims)
#decoration
plt.title('RBF interpolation?')
plt.xlim(-0.05, 0.05)
plt.ylim(-0.05, 0.05)
plt.colorbar()
plt.show()
For reference, here is my data (extracted), it has a circular pattern that I need interpolation to recognize.
#DATA
experiment1raw = np.array([
[0,40,1,11.08,8.53,78.10,2.29],
[24,-32,2,16.52,11.09,69.03,3.37],
[8,-32,4,14.27,10.68,71.86,3.19],
[-8,-32,6,10.86,9.74,76.69,2.72],
[-24,-32,8,6.72,12.74,77.08,3.45],
[32,-24,9,18.49,13.67,64.32,3.52],
[-32,-24,17,6.72,12.74,77.08,3.45],
[16,-16,20,13.41,21.33,59.92,5.34],
[0,-16,22,12.16,14.67,69.04,4.12],
[-16,-16,24,9.07,13.37,74.20,3.36],
[32,-8,27,19.35,17.88,57.86,4.91],
[-32,-8,35,6.72,12.74,77.08,3.45],
[40,0,36,19.25,20.36,54.97,5.42],
[16,0,39,13.41,21.33,59.952,5.34],
[0,0,41,10.81,19.55,64.37,5.27],
[-16,0,43,8.21,17.83,69.34,4.62],
[-40,0,46,5.76,13.43,77.23,3.59],
[32,8,47,15.95,23.61,54.34,6.10],
[-32,8,55,5.97,19.09,70.19,4.75],
[16,16,58,11.27,26.03,56.36,6.34],
[0,16,60,9.19,24.94,60.06,5.79],
[-16,16,62,7.10,22.75,64.57,5.58],
[32,24,65,12.39,29.19,51.17,7.26],
[-32,24,73,5.40,24.55,64.33,5.72],
[24,32,74,10.03,31.28,50.96,7.73],
[8,32,76,8.68,30.06,54.34,6.92],
[-8,32,78,6.88,28.78,57.84,6.49],
[-24,32,80,5.83,26.70,61.00,6.46],
[0,-40,81,7.03,31.55,54.40,7.01],
])
#Atomic Percentages are set here
Cr1 = experiment1raw[:,3]
Mn1 = experiment1raw[:,4]
Fe1 = experiment1raw[:,5]
Co1 = experiment1raw[:,6]
#COORDINATE VALUES IN PRE-T
x_pret = experiment1raw[:,0]/1000
y_pret = experiment1raw[:,1]/1000
#important translation
x = -y_pret
y = -x_pret
z = Cr1
You used a larger epsilon in RBF. Best bet is to set it as default and let scipy calculate an appropriate value. See the implementation here.
So setting default epsilon:
rbf = Rbf(x, y, z)
I got a pretty good interpolation for your data (subjective opinion).

Python-Fitting 2D Gaussian to data set

I have data points in a .txt file (delimiter = white space), the first column is x axis and the second is the y axis. I want to fit a 2D Gaussian to theses data points using Python. Truth is, I don't understand the theory behind Gaussian fitting (either one or two dimensional). I have read similar posts here on stackoverflow and got a code, but it's not fitting well. Please someone help. Thanks
Below is what I have in the .txt file:
3.369016418457e+02 3.761813938618e-01
3.369006652832e+02 4.078308343887e-01
3.368996887207e+02 4.220226705074e-01
3.368987121582e+02 4.200653433800e-01
3.368977355957e+02 4.454285204411e-01
3.368967590332e+02 4.156131148338e-01
3.368957824707e+02 3.989491164684e-01
3.368948059082e+02 4.512043893337e-01
3.368938293457e+02 4.565380811691e-01
3.368928527832e+02 4.095999598503e-01
3.368918762207e+02 4.196371734142e-01
3.368908996582e+02 4.002234041691e-01
3.368899230957e+02 4.133881926537e-01
3.368889465332e+02 4.394644796848e-01
3.368879699707e+02 4.504477381706e-01
3.368869934082e+02 3.946847021580e-01
3.368860168457e+02 4.214486181736e-01
3.368850402832e+02 3.753573596478e-01
3.368840637207e+02 3.673824667931e-01
3.368830871582e+02 4.088735878468e-01
3.368821105957e+02 4.351278841496e-01
3.368811340332e+02 4.393630325794e-01
3.368801574707e+02 4.210205972195e-01
3.368791809082e+02 4.322172403336e-01
3.368782043457e+02 4.652716219425e-01
3.368772277832e+02 5.251595377922e-01
3.368762512207e+02 5.873318314552e-01
3.368752746582e+02 6.823697686195e-01
3.368742980957e+02 8.375824093819e-01
3.368733215332e+02 9.335057735443e-01
3.368723449707e+02 1.083636641502e+00
3.368713684082e+02 1.170072913170e+00
3.368703918457e+02 1.224770784378e+00
3.368694152832e+02 1.158735513687e+00
3.368684387207e+02 1.131350398064e+00
3.368674621582e+02 1.073648810387e+00
3.368664855957e+02 9.659162163734e-01
3.368655090332e+02 8.495713472366e-01
3.368645324707e+02 7.637447714806e-01
3.368635559082e+02 6.956064105034e-01
3.368625793457e+02 6.713893413544e-01
3.368616027832e+02 5.285132527351e-01
3.368606262207e+02 4.968771338463e-01
3.368596496582e+02 5.077748298645e-01
3.368586730957e+02 4.686309695244e-01
3.368576965332e+02 4.693206846714e-01
3.368567199707e+02 4.462305009365e-01
3.368557434082e+02 3.872672021389e-01
3.368547668457e+02 4.243377447128e-01
3.368537902832e+02 3.918920457363e-01
3.368528137207e+02 3.848327994347e-01
3.368518371582e+02 4.093343317509e-01
3.368508605957e+02 4.321203231812e-01
Below is the code I have tried:
%pylab inline
import matplotlib.pyplot as plt
import numpy as np
import astropy
import scipy.optimize as opt
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x,y=np.loadtxt('taper2reduced.txt', unpack= True, delimiter=' ')
mean = sum(x * y) / sum(y)
sigma = np.sqrt(sum(y * (x - mean)**2) / sum(y))
def Gauss(x, a, x0, sigma):
<pre>return a * np.exp(-(x - x0)\**2 / (2 * sigma**2))<code>
popt,pcov = curve_fit(Gauss, x, y, p0=[max(y), mean, sigma])
plt.plot(x, y, 'b+:', label='data')
plt.plot(x, Gauss(x, *popt), 'r-', label='fit')
plt.legend()
plt.title('Fig. 1 - Fit for Frequency')
plt.xlabel('Frequecy (GHz)')
plt.ylabel('Flux Density (mJy)')
plt.show()
Your problem is that your function doesn't reflect well your data set. You define a distribution between 0 and max_y, while in reality your data are between min_y and max_y. Change your function like this:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
#function declaration with additional offset parameter
def Gauss(x, a, x0, sigma, offset):
return a * np.exp(-(x - x0)**2 / (2 * sigma**2)) + offset
#loading x, y dataset
x, y = np.loadtxt('test.txt', unpack = True, delimiter=' ')
#calculate parameter of fit function with scipy's curve fitting algorithm
popt, pcov = curve_fit(Gauss, x, y, p0=[np.max(y), np.median(x), np.std(x), np.min(y)])
#plot original data
plt.plot(x, y, 'b+:', label='data')
#create different x value array for smooth fit function curve
x_fit = np.linspace(np.min(x), np.max(x), 1000)
#plot fit function
plt.plot(x_fit, Gauss(x_fit, *popt), 'r-', label='fit')
#beautify graph
plt.legend()
plt.title('Fig. 1 - Fit for Frequency')
plt.xlabel('Frequecy (GHz)')
plt.ylabel('Flux Density (mJy)')
plt.show()
Output:
You might have noticed that I changed two more things.
I de-cluttered the imports. It is not a good idea to load a lot of different unused functions and modules into your name space.
And I changed the start parameter estimation. We don't have to be correct here, an approximation will usually do. Something that doesn't need much code and is fast.
yeap,you want to fit the data to 2d distribution,but the code means
that how to fit a 1d distribution.

Curve fit or interpolation in a semilogy plot using scipy

I have very few data points and I want to create a line to best fit the data points when plotted in a semilogy scale. I have tried curve-fit and cubic interpolation from scipy, but none of them seems to be very reasonable to me compared to the data trend.
I would kindly ask you to check if there is a more efficient way to create a straight line fit for the data. Probably extrapolation can do, but I did not find a good documentation on extrapolation on python.
your help is very appreciated
import sys
import os
import numpy
import matplotlib.pyplot as plt
from pylab import *
from scipy.optimize import curve_fit
import scipy.optimize as optimization
from scipy.interpolate import interp1d
from scipy import interpolate
Mass500 = numpy.array([ 13.938 , 13.816, 13.661, 13.683, 13.621, 13.547, 13.477, 13.492, 13.237,
13.232, 13.07, 13.048, 12.945, 12.861, 12.827, 12.577, 12.518])
y500 = numpy.array([ 7.65103978e-06, 4.79865790e-06, 2.08218909e-05, 4.98385924e-06,
5.63462673e-06, 2.90785458e-06, 2.21166794e-05, 1.34501705e-06,
6.26021870e-07, 6.62368879e-07, 6.46735547e-07, 3.68589447e-07,
3.86209019e-07, 5.61293275e-07, 2.41428755e-07, 9.62491134e-08,
2.36892162e-07])
plt.semilogy(Mass500, y500, 'o')
# interpolation
f2 = interp1d(Mass500, y500, kind='cubic')
plt.semilogy(Mass500, f2(Mass500), '--')
# curve-fit
def line(x, a, b):
return 10**(a*x+b)
#Initial guess.
x0 = numpy.array([1.e-6, 1.e-6])
print optimization.curve_fit(line, Mass500, y500, x0)
popt, pcov = curve_fit(line, Mass500, y500)
print popt
plt.semilogy(Mass500, line(Mass500, popt[0], popt[1]), 'r-')
plt.legend(['data', 'cubic', 'curve-fit'], loc='best')
show()
There are many regression functions available in numpy and scipy.
scipy.stats.lingress is one of the simpler functions, and it returns common linear regression parameters.
Here are two options for fitting semi-log data:
Plot Transformed Data
Rescale Axes and Transform Input/Output Function Values
Given
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline
# Data
mass500 = np.array([
13.938 , 13.816, 13.661, 13.683,
13.621, 13.547, 13.477, 13.492,
13.237, 13.232, 13.07, 13.048,
12.945, 12.861, 12.827, 12.577,
12.518
])
y500 = np.array([
7.65103978e-06, 4.79865790e-06, 2.08218909e-05, 4.98385924e-06,
5.63462673e-06, 2.90785458e-06, 2.21166794e-05, 1.34501705e-06,
6.26021870e-07, 6.62368879e-07, 6.46735547e-07, 3.68589447e-07,
3.86209019e-07, 5.61293275e-07, 2.41428755e-07, 9.62491134e-08,
2.36892162e-07
])
Code
Option 1: Plot Transformed Data
# Regression Function
def regress(x, y):
"""Return a tuple of predicted y values and parameters for linear regression."""
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# Plotting
x, y = mass500, np.log(y500) # transformed data
y_pred, _ = regress(x, y)
plt.plot(x, y, "mo", label="Data")
plt.plot(x, y_pred, "k--", label="Pred.")
plt.xlabel("Mass500")
plt.ylabel("log y500") # label axis
plt.legend()
Output
A simple approach is to plot transformed data and label the appropriate log axes.
Option 2: Rescale Axes and Transform Input/Output Function Values
Code
x, y = mass500, y500 # data, non-transformed
y_pred, _ = regress(x, np.log(y)) # transformed input
plt.plot(x, y, "o", label="Data")
plt.plot(x, np.exp(y_pred), "k--", label="Pred.") # transformed output
plt.xlabel("Mass500")
plt.ylabel("y500")
plt.semilogy()
plt.legend()
Output
A second option is to alter the axes to semi-log scales (via plt.semilogy()). Here the non-transformed data naturally appears linear. Also notice the labels represent the data as-is.
To make an accurate regression, all that remains is to transform data passed into the regression function (via np.log(x) or np.log10(x)) in order to return the proper regression parameters. This transformation is immediately reversed when plotting predicated values using a complementary operation, i.e. np.exp(x) or 10**x.
If you want a line that will look good on log-y scale, then fit a line to the logarithms of the y-values.
def line(x, a, b):
return a*x+b
popt, pcov = curve_fit(line, Mass500, np.log10(y500))
plt.semilogy(Mass500, 10**line(Mass500, popt[0], popt[1]), 'r-')
This is it; I only left out the cubic interpolation part which didn't seem relevant.

plot individual peaks after gaussian curve fitting with python-lmfit

From this piece of code I can print the final fit with "out.best_fit", what I would like to do now, is to plot each of the peaks as individual gaussian curves, instead of all of them merged in one single curve.
from pylab import *
from lmfit import minimize, Parameters, report_errors
from lmfit.models import GaussianModel, LinearModel, SkewedGaussianModel
from scipy.interpolate import interp1d
from numpy import *
fit_data = interp1d(x_data, y_data)
mod = LinearModel()
pars = mod.make_params(slope=0.0, intercept=0.0)
pars['slope'].set(vary=False)
pars['intercept'].set(vary=False)
x_peak = [278.35, 334.6, 375]
y_peak = [fit_data(x) for x in x_peak]
i = 0
for x,y in zip(x_peak, y_peak):
sigma = 1.0
A = y*sqrt(2.0*pi)*sigma
prefix = 'g' + str(i) + '_'
peak = GaussianModel(prefix=prefix)
pars.update(peak.make_params(center=x, sigma=1.0, amplitude=A))
pars[prefix+'center'].set(min=x-20.0, max=x+20.0)
pars[prefix+'amplitude'].set(min=0.0)
mod = mod + peak
i += 1
out = mod.fit(y_data, pars, x=x_data)
plt.figure(1)
plt.plot(x_data, y_data)
plt.figure(1)
plt.plot(x_data, out.best_fit, '--')
Plot of the global fit:
I think you want to do this after your fit:
components = out.eval_components(x=x_data)
for model_name, model_value in components.items():
plt.plot(x_data, model_value)
# or more simply, if you prefer:
plt.plot(x_data, components['g0_'])
plt.plot(x_data, components['g1_'])
...
That is, ModelResult.eval_components() for a composite model will return a dictionary with keys that are the prefixes of the component models, and values that are the calculated model for that component.

Categories