Related
I want to test how close the distribution of a data set is from a Gaussian with mean=0 and variance=1.
The kuiper test from astropy.stats has a cdf parameter, from the documentation: "A callable to evaluate the CDF of the distribution being tested against. Will be called with a vector of all values at once. The default is a uniform distribution", but I don't know how to use this to test against a normal distribution. What if I want e.g. a normal distribution with mean 0.2 and variance 2?
So I used kuiper_two, also from astropy, and created a random normal distribution. See example below.
The problem I see with this is that it depends on the number of data points I generate to compare against. If I'd have used 100 instead of 10000 data points the probability (fpp) would have raised to 43%.
I guess the question is, how do I do this properly? Also, how do I interpret the D number?
# create data and its cdf
np.random.seed(0)
data = np.random.normal(loc=0.02, scale=1.15, size=50)
data_sort = np.sort(data)
data_cdf = [x/len(data) for x in range(0, len(data))]
# create the normal data with mean 0 and variance 1
xx = np.random.normal(loc=0, scale=1, size=10000)
xx_sort = np.sort(xx)
xx_cdf = [x/len(xx) for x in range(0, len(xx))]
# compute the pdf for a plot
x = np.linspace(-4, 4, 50)
x_pdf = stats.norm.pdf(x, 0, 1)
# we can see it all in a plot
fig, ax = plt.subplots(figsize=(8, 6))
plt.hist(xx, bins=20, density=True, stacked=True, histtype='stepfilled', alpha=0.6)
plt.hist(data, density=True, stacked=True, histtype='step', lw=3)
plt.plot(x, x_pdf, lw=3, label='G($\mu=0$, $\sigma^2=1$)')
ax2 = ax.twinx()
ax2.plot(xx_sort, xx_cdf, marker='o', ms=8, mec='green', mfc='green', ls='None')
ax2.plot(data_sort, data_cdf, marker='^', ms=8, mec='orange', mfc='orange', ls='None')
# Kuiper test
D, fpp = kuiper_two(data_sort, xx_sort)
print('# D number =', round(D, 5))
print('# fpp =', round(fpp, 5))
# Which resulted in:
# D number = 0.211
# fpp = 0.14802
astropy.stats.kuiper expects as first argument a sample from the distribution you want to test, and as second argument the CDF of the distribution you want to test against.
This variable is a Callable and expects itself (a) sample(s) for which to return the value(s) under the cumulative distribution function. You can use scipy.stat's CDFs for that. By using functools.partial we can set any parameters.
from scipy import stats
from scipy.stats import norm
from astropy.stats import kuiper
from functools import partial
from random import shuffle
np.random.seed(0)
data = np.random.normal(loc=0.02, scale=1.15, size=50)
print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
# Output: (0.2252118027033838, 0.08776036566607946)
# The data does not have to be sorted, in case you wondered:
shuffle(data)
print(kuiper(data, partial(norm.cdf, loc=0.2, scale=2.0)))
# Output: (0.2252118027033838, 0.08776036566607946)
In this diagram from the Wikipedia article about this test, you get an idea what the Kuiper statistic V measures:
[]3
The
If you share the parameters of the comparison distribution, the distance is lower and the estimated probability that the respective underlying CDFs are identical rises:
print(kuiper(data, partial(norm.cdf, loc=0.02, scale=1.15)))
# Output: (0.14926352419821276, 0.68365004302431)
The function astropy.stats.kuiper_two in contrast, expects two samples of empirical data to compare with one another. So if you want to compare to a distribution with a tractable CDF, it is preferable to use the CDF directly (with kuiper) instead of sampling from the comparison distribution (and using kuiper_two).
And one nit-pick. Apart from using CDF in variable names that are not a CDF, this formulation is much more readable than the above:
data_cdf = np.linspace(0.0, 1.0, len(data), endpoint=False)
xx_cdf = np.linspace(0.0, 1.0, len(xx), endpoint=False)
I'm trying to fit an asymmetric Gaussian to this data: http://ge.tt/99iNaL53 (csv file).
I have tried to use a skewed Gaussian model from lmfit, and also a spline, but I'm not able to get the Gaussian model to fit well and the splines are not what I'm looking for (I don't want the spline to fit the data exactly as shown below, and altering the level of smoothing isn't helping).
Here is code using the above data that produces the plot below. The second figure is an example of what I'm trying to achieve with the goal of reading the rise and decay time from the fit.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import CubicSpline
from scipy.interpolate import UnivariateSpline
from lmfit.models import SkewedGaussianModel
data = np.loadtxt('data.csv', delimiter=',')
x = data[:,0]
y = data[:,1]
# Skewed Gaussian fit
model = SkewedGaussianModel()
params = model.make_params(amplitude=400, center=3, sigma=7, gamma=1)
result = model.fit(y, params, x=x)
# Cubic Spline
cs = CubicSpline(x, y)
x_range = np.arange(x[0], x[-1], 0.1)
# Univariate Spline
us = UnivariateSpline(x, y, k = 1)
# Univariate Spline (smoothed)
us2 = UnivariateSpline(x, y, k = 5)
plt.scatter(x, y, marker = '^', color = 'k', linewidth = 0.5, s = 10, label = 'data')
plt.plot(x_range, cs(x_range), label = 'Cubic Spline')
plt.plot(x_range, us(x_range), label = 'Univariate Spline, k = 1')
plt.plot(x_range, us2(x_range), label = 'Univariate Spline, k = 5')
plt.plot(x, result.best_fit, color = 'red', label = 'Skewed Gaussian Attempt')
plt.xlabel('x')
plt.ylabel('y')
plt.yscale('log')
plt.ylim(1,500)
plt.legend()
plt.show()
Is there a question here? I don't see one, actually.
That result from lmfit is the best fit to a skewed Gaussian model.
You've chosen to plot the result on a log-scale. That completely changes the view of the quality of the fit or what is not fit well.
It seems like you're expecting a better fit, but not *too good. Well, it looks like your data is not perfectly represented by a single skewed Gaussian. It seems like you were not expecting it to be. You could try different forms for the model function, say a skewed Lorentzian or something. But your data has that low x shoulder that definitely does not look like your uncited figure.
I wrote something for J. Chem. Ed. [1] that involved fitting asymmetric Gaussian functions to data, you can find the core repo here [2] but below is a snippet on how I went about fitting a data set where x = data[:,0] and y = data[:,1] to the type of function you're working with:
import numpy as np
from scipy.optimize import leastsq
from scipy.special import erf
initials = [6.5, 13, 1, 0] # initial guess
def asymGaussian(x, p):
amp = (p[0] / (p[2] * np.sqrt(2 * np.pi)))
spread = np.exp((-(x - p[1]) ** 2.0) / (2 * p[2] ** 2.0))
skew = (1 + erf((p[3] * (x - p[1])) / (p[2] * np.sqrt(2))))
return amp * spread * skew
def residuals(p,y,x):
return y - asymGaussian(x, p)
# executes least-squares regression analysis to optimize initial parameters
cnsts = leastsq(
residuals,
initials,
args=(
data_set[:,1], # y value
data_set[:,0] # x value
))[0]
y = asymGaussian(data[:,0], cnsts)
finally just plot (y, data[:,0]). Hope this helps!
[1] https://pubs.acs.org/doi/10.1021/acs.jchemed.9b00818
[2] https://github.com/1mikegrn/pyGC
I have very few data points and I want to create a line to best fit the data points when plotted in a semilogy scale. I have tried curve-fit and cubic interpolation from scipy, but none of them seems to be very reasonable to me compared to the data trend.
I would kindly ask you to check if there is a more efficient way to create a straight line fit for the data. Probably extrapolation can do, but I did not find a good documentation on extrapolation on python.
your help is very appreciated
import sys
import os
import numpy
import matplotlib.pyplot as plt
from pylab import *
from scipy.optimize import curve_fit
import scipy.optimize as optimization
from scipy.interpolate import interp1d
from scipy import interpolate
Mass500 = numpy.array([ 13.938 , 13.816, 13.661, 13.683, 13.621, 13.547, 13.477, 13.492, 13.237,
13.232, 13.07, 13.048, 12.945, 12.861, 12.827, 12.577, 12.518])
y500 = numpy.array([ 7.65103978e-06, 4.79865790e-06, 2.08218909e-05, 4.98385924e-06,
5.63462673e-06, 2.90785458e-06, 2.21166794e-05, 1.34501705e-06,
6.26021870e-07, 6.62368879e-07, 6.46735547e-07, 3.68589447e-07,
3.86209019e-07, 5.61293275e-07, 2.41428755e-07, 9.62491134e-08,
2.36892162e-07])
plt.semilogy(Mass500, y500, 'o')
# interpolation
f2 = interp1d(Mass500, y500, kind='cubic')
plt.semilogy(Mass500, f2(Mass500), '--')
# curve-fit
def line(x, a, b):
return 10**(a*x+b)
#Initial guess.
x0 = numpy.array([1.e-6, 1.e-6])
print optimization.curve_fit(line, Mass500, y500, x0)
popt, pcov = curve_fit(line, Mass500, y500)
print popt
plt.semilogy(Mass500, line(Mass500, popt[0], popt[1]), 'r-')
plt.legend(['data', 'cubic', 'curve-fit'], loc='best')
show()
There are many regression functions available in numpy and scipy.
scipy.stats.lingress is one of the simpler functions, and it returns common linear regression parameters.
Here are two options for fitting semi-log data:
Plot Transformed Data
Rescale Axes and Transform Input/Output Function Values
Given
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline
# Data
mass500 = np.array([
13.938 , 13.816, 13.661, 13.683,
13.621, 13.547, 13.477, 13.492,
13.237, 13.232, 13.07, 13.048,
12.945, 12.861, 12.827, 12.577,
12.518
])
y500 = np.array([
7.65103978e-06, 4.79865790e-06, 2.08218909e-05, 4.98385924e-06,
5.63462673e-06, 2.90785458e-06, 2.21166794e-05, 1.34501705e-06,
6.26021870e-07, 6.62368879e-07, 6.46735547e-07, 3.68589447e-07,
3.86209019e-07, 5.61293275e-07, 2.41428755e-07, 9.62491134e-08,
2.36892162e-07
])
Code
Option 1: Plot Transformed Data
# Regression Function
def regress(x, y):
"""Return a tuple of predicted y values and parameters for linear regression."""
p = sp.stats.linregress(x, y)
b1, b0, r, p_val, stderr = p
y_pred = sp.polyval([b1, b0], x)
return y_pred, p
# Plotting
x, y = mass500, np.log(y500) # transformed data
y_pred, _ = regress(x, y)
plt.plot(x, y, "mo", label="Data")
plt.plot(x, y_pred, "k--", label="Pred.")
plt.xlabel("Mass500")
plt.ylabel("log y500") # label axis
plt.legend()
Output
A simple approach is to plot transformed data and label the appropriate log axes.
Option 2: Rescale Axes and Transform Input/Output Function Values
Code
x, y = mass500, y500 # data, non-transformed
y_pred, _ = regress(x, np.log(y)) # transformed input
plt.plot(x, y, "o", label="Data")
plt.plot(x, np.exp(y_pred), "k--", label="Pred.") # transformed output
plt.xlabel("Mass500")
plt.ylabel("y500")
plt.semilogy()
plt.legend()
Output
A second option is to alter the axes to semi-log scales (via plt.semilogy()). Here the non-transformed data naturally appears linear. Also notice the labels represent the data as-is.
To make an accurate regression, all that remains is to transform data passed into the regression function (via np.log(x) or np.log10(x)) in order to return the proper regression parameters. This transformation is immediately reversed when plotting predicated values using a complementary operation, i.e. np.exp(x) or 10**x.
If you want a line that will look good on log-y scale, then fit a line to the logarithms of the y-values.
def line(x, a, b):
return a*x+b
popt, pcov = curve_fit(line, Mass500, np.log10(y500))
plt.semilogy(Mass500, 10**line(Mass500, popt[0], popt[1]), 'r-')
This is it; I only left out the cubic interpolation part which didn't seem relevant.
I would like to fit multiple Gaussian curves to Mass spectrometry data in Python. Right now I'm fitting the data one Gaussian at a time -- literally one range at a time.
Is there a more streamlined way to do this? Is there a way I can run the data through a loop to plot a Gaussian at each peak? I'm guessing there's gotta be a better way, but I've combed through the internet.
My graph for two Gaussians is shown below.
My example data can be found at: http://txt.do/dooxv
And here's my current code:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.interpolate import interp1d
RGAdata = np.loadtxt("/Users/ilenemitchell/Desktop/RGAscan.txt", skiprows=14)
RGAdata=RGAdata.transpose()
x=RGAdata[0]
y=RGAdata[1]
# graph labels
plt.ylabel('ion current')
plt.xlabel('mass/charge ratio')
plt.xticks(np.arange(min(RGAdata[0]), max(RGAdata[0])+2, 2.0))
plt.ylim([10**-12.5, 10**-9])
plt.title('RGA Data Jul 25, 2017')
plt.semilogy(x, y,'b')
#fitting a guassian to a peak
def gauss(x, a, mu, sig):
return a*np.exp(-(x-mu)**2/(2*sig**2))
fitx=x[(x>40)*(x<43)]
fity=y[(x>40)*(x<43)]
mu=np.sum(fitx*fity)/np.sum(fity)
sig=np.sqrt(np.sum(fity*(fitx-mu)**2)/np.sum(fity))
print (mu, sig, max(fity))
popt, pcov = opt.curve_fit(gauss, fitx, fity, p0=[max(fity),mu, sig])
plt.semilogy(x, gauss(x, popt[0],popt[1],popt[2]), 'r-', label='fit')
#second guassian
fitx2=x[(x>26)*(x<31)]
fity2=y[(x>26)*(x<31)]
mu=np.sum(fitx2*fity2)/np.sum(fity2)
sig=np.sqrt(np.sum(fity2*(fitx2-mu)**2)/np.sum(fity2))
print (mu, sig, max(fity2))
popt2, pcov2 = opt.curve_fit(gauss, fitx2, fity2, p0=[max(fity2),mu, sig])
plt.semilogy(x, gauss(x, popt2[0],popt2[1],popt2[2]), 'm', label='fit2')
plt.show()
In addition to Alex F's answer, you need to identify peaks and analyze their surroundings to identify the xmin and xmax values.
If you have done that, you can use this slightly refactored code and the loop within to plot all relevant data
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.interpolate import interp1d
def _gauss(x, a, mu, sig):
return a*np.exp(-(x-mu)**2/(2*sig**2))
def gauss(x, y, xmin, xmax):
fitx = x[(x>xmin)*(x<xmax)]
fity = y[(x>xmin)*(x<xmax)]
mu = np.sum(fitx*fity)/np.sum(fity)
sig = np.sqrt(np.sum(fity*(fitx-mu)**2)/np.sum(fity))
print (mu, sig, max(fity))
popt, pcov = opt.curve_fit(_gauss, fitx, fity, p0=[max(fity), mu, sig])
return _gauss(x, popt[0], popt[1], popt[2])
# Load data and define x - y
RGAdata = np.loadtxt("/Users/ilenemitchell/Desktop/RGAscan.txt", skiprows=14)
x, y = RGAdata.T
# Create the plot
fig, ax = plt.subplots()
ax.semilogy(x, y, 'b')
# Plot the Gaussian's between xmin and xmax
for xmin, xmax in [(40, 43), (26, 31)]:
yG = gauss(x, y, xmin, xmax)
ax.semilogy(x, yG)
# Prettify the graph
ax.set_xlabel("mass/charge ratio")
ax.set_ylabel("ion current")
ax.set_xticks(np.arange(min(x), max(x)+2, 2.0))
ax.set_ylim([10**-12.5, 10**-9])
ax.set_title("RGA Data Jul 25, 2017")
plt.show()
Here's some sample code of identifying peaks in a data set to get you started. You can find a link to all the examples here.
import numpy as np
import peakutils
cb = np.array([-0.010223, ... ])
indexes = peakutils.indexes(cb, thres=0.02/max(cb), min_dist=100)
# [ 333 693 1234 1600]
interpolatedIndexes = peakutils.interpolate(range(0, len(cb)), cb, ind=indexes)
# [ 332.61234263 694.94831376 1231.92840845 1600.52446335]
You may find the lmfit module (https://lmfit.github.io/lmfit-py/) helpful. This provides a pre-built GaussianModel class for fitting a peak to a single Gaussian and supports adding multiple Models (not necessarily Gaussians, but also other peak models and other functions that might be useful for backgrounds and so for) into a composite model that can be fit at once.
Lmfit supports fixing or giving a range to some Parameters, so that you could build a model as a sum of Gaussians with fixed positions, limiting the value for the centroid to vary with some range (so the peaks cannot get confused). In addition, you can impose simple mathematical constraints on parameter values, so that you might require that all peak widths are the same size (or related in some simple form).
In particular, you might look to https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes for an example a fit using 2 Gaussians and a background function.
For peak finding, I've found scipy.signal.find_peaks_cwt to be pretty good.
I have a disordered list named d that looks like:
[0.0000, 123.9877,0.0000,9870.9876, ...]
I just simply want to plot a cdf graph based on this list by using Matplotlib in Python. But don't know if there's any function I can use
d = []
d_sorted = []
for line in fd.readlines():
(addr, videoid, userag, usertp, timeinterval) = line.split()
d.append(float(timeinterval))
d_sorted = sorted(d)
class discrete_cdf:
def __init__(data):
self._data = data # must be sorted
self._data_len = float(len(data))
def __call__(point):
return (len(self._data[:bisect_left(self._data, point)]) /
self._data_len)
cdf = discrete_cdf(d_sorted)
xvalues = range(0, max(d_sorted))
yvalues = [cdf(point) for point in xvalues]
plt.plot(xvalues, yvalues)
Now I am using this code, but the error message is :
Traceback (most recent call last):
File "hitratioparea_0117.py", line 43, in <module>
cdf = discrete_cdf(d_sorted)
TypeError: __init__() takes exactly 1 argument (2 given)
I know I'm late to the party. But, there is a simpler way if you just want the cdf for your plot and not for future calculations:
plt.hist(put_data_here, normed=True, cumulative=True, label='CDF',
histtype='step', alpha=0.8, color='k')
As an example,
plt.hist(dataset, bins=bins, normed=True, cumulative=True, label='CDF DATA',
histtype='step', alpha=0.55, color='purple')
# bins and (lognormal / normal) datasets are pre-defined
EDIT: This example from the matplotlib docs may be more helpful.
As mentioned, cumsum from numpy works well. Make sure that your data is a proper PDF (ie. sums to one), otherwise the CDF won't end at unity as it should. Here is a minimal working example:
import numpy as np
from pylab import *
# Create some test data
dx = 0.01
X = np.arange(-2, 2, dx)
Y = np.exp(-X ** 2)
# Normalize the data to a proper PDF
Y /= (dx * Y).sum()
# Compute the CDF
CY = np.cumsum(Y * dx)
# Plot both
plot(X, Y)
plot(X, CY, 'r--')
show()
The numpy function to compute cumulative sums cumsum can be useful here
In [1]: from numpy import cumsum
In [2]: cumsum([.2, .2, .2, .2, .2])
Out[2]: array([ 0.2, 0.4, 0.6, 0.8, 1. ])
Nowadays, you can just use seaborn's kdeplot function with cumulative as True to generate a CDF.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
X1 = np.arange(100)
X2 = (X1 ** 2) / 100
sns.kdeplot(data = X1, cumulative = True, label = "X1")
sns.kdeplot(data = X2, cumulative = True, label = "X2")
plt.legend()
plt.show()
For an arbitrary collection of values, x:
def cdf(x, plot=True, *args, **kwargs):
x, y = sorted(x), np.arange(len(x)) / len(x)
return plt.plot(x, y, *args, **kwargs) if plot else (x, y)
((If you're new to python, the *args, and **kwargs allow you to pass arguments and named arguments without declaring and managing them explicitly))
What works best for me is quantile function of pandas.
Say I have 71 participants. Each participant have a certain number of interruptions. I want to compute the CDF plot of #interruptions for participants. Goal is to be able to tell how many percent of participants have at least 30 interventions.
step=0.05
indices = np.arange(0,1+step,step)
num_interruptions_per_participant = [32,70,52,52,39,20,37,31,60,57,31,71,24,23,38,4,77,37,79,43,63,43,75,13
,45,31,57,28,61,29,30,52,65,11,76,37,65,28,33,73,65,43,50,33,45,40,50,44
,33,49,24,69,55,47,22,45,54,11,30,13,32,52,31,50,10,46,10,25,47,51,83]
CDF = pd.DataFrame({'dummy':num_interruptions_per_participant})['dummy'].quantile(indices)
plt.plot(CDF,indices,linewidth=9, label='#interventions', color='blue')
According to Graph Almost 25% of the participants have less than 30 interventions.
You can use this statistic for your further analysis. For instance, In my case I need at least 30 intervention for each participant in order to meet minimum sample requirement needed for leave-one-subject out evaluation. CDF tells me that I have problem with 25% of the participants.
import matplotlib.pyplot as plt
X=sorted(data)
Y=[]
l=len(X)
Y.append(float(1)/l)
for i in range(2,l+1):
Y.append(float(1)/l+Y[i-2])
plt.plot(X,Y,color=c,marker='o',label='xyz')
I guess this would do,for the procedure refer http://www.youtube.com/watch?v=vcoCVVs0fRI