I am trying to understand the results from the scikit-learn gaussian mixture model implementation. Take a look at the following example:
#!/opt/local/bin/python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))
# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))
# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))
# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))
# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(),
gmm.weights_.ravel()):
gmm_y_sum += gauss_function(x=gmm_x, amp=w, x0=m, sigma=np.sqrt(c))
# Normalize so that integral is 1
gmm_y_sum /= np.trapz(gmm_y_sum, gmm_x)
# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum")
# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")
# Draw legend
plt.legend()
plt.show()
Here I first generate a sample distribution constructed from gaussians, then fit a gaussian mixture model to these data. Next, I want to calculate the probability for some given input. Conveniently, the scikit implementation offer the score_samples method to do just that. Now I am trying to understand these results. I always thought, that I can just take the parameters of the gaussians from the GMM fit and construct the very same distribution by summing over them and then normalising the integral to 1. However, as you can see in the plot, the samples drawn from the score_samples method fit perfectly (red line) to the original data (blue histogram), the manually constructed distribution (black line) does not. I would like to understand where my thinking went wrong and why I can't construct the distribution myself by summing the gaussians as given by the GMM fit!?! Thanks a lot for any input!
Just in case anyone in the future is wondering about the same thing: One has to normalise the individual components, not the sum:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
# Define simple gaussian
def gauss_function(x, amp, x0, sigma):
return amp * np.exp(-(x - x0) ** 2. / (2. * sigma ** 2.))
# Generate sample from three gaussian distributions
samples = np.random.normal(-0.5, 0.2, 2000)
samples = np.append(samples, np.random.normal(-0.1, 0.07, 5000))
samples = np.append(samples, np.random.normal(0.2, 0.13, 10000))
# Fit GMM
gmm = GaussianMixture(n_components=3, covariance_type="full", tol=0.001)
gmm = gmm.fit(X=np.expand_dims(samples, 1))
# Evaluate GMM
gmm_x = np.linspace(-2, 1.5, 5000)
gmm_y = np.exp(gmm.score_samples(gmm_x.reshape(-1, 1)))
# Construct function manually as sum of gaussians
gmm_y_sum = np.full_like(gmm_x, fill_value=0, dtype=np.float32)
for m, c, w in zip(gmm.means_.ravel(), gmm.covariances_.ravel(), gmm.weights_.ravel()):
gauss = gauss_function(x=gmm_x, amp=1, x0=m, sigma=np.sqrt(c))
gmm_y_sum += gauss / np.trapz(gauss, gmm_x) * w
# Make regular histogram
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=[8, 5])
ax.hist(samples, bins=50, normed=True, alpha=0.5, color="#0070FF")
ax.plot(gmm_x, gmm_y, color="crimson", lw=4, label="GMM")
ax.plot(gmm_x, gmm_y_sum, color="black", lw=4, label="Gauss_sum", linestyle="dashed")
# Annotate diagram
ax.set_ylabel("Probability density")
ax.set_xlabel("Arbitrary units")
# Make legend
plt.legend()
plt.show()
Related
I have a histogram with a fitted gaussian curve, and I'd like to find and calculate the full width at half maximum for this curve. The data used in this code is a single column from a dataframe. I've included a link to an image of my plot. I'm new to python and have no idea how to do this.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def gaussian(n, mean, amplitude, standard_deviation):
return amplitude * np.exp( - (n - mean)**2 / (2*standard_deviation ** 2))
n = df_OI_CMC['Area_1_Micrometers']
#Plot Histogram 1
bin_heights, bin_borders, _ = plt.hist(n, bins =
(0,1,5,10,25,50,75,100,125,150,200,250,500,750,1000,2500,5000,7500,10000), label='histogram',
edgecolor ='white')
bin_widths = np.diff(bin_borders)
bin_centers = bin_borders[:-1] + np.diff(bin_borders) / 2
#Generate enough x values to make the curves look smooth
n_interval_for_fit = np.linspace(bin_borders[0], bin_borders[-1], 10000)
n_interval_for_fit_2 = np.linspace(bin_borders[0], bin_borders_2[-1], 10000)
#CurveFit to Histogram
popt, _ = curve_fit(gaussian, bin_centers, bin_heights, p0=[-44.0543433,
1480.64682738,68.86641026])
plt.rcParams["figure.figsize"] = [12,12]
plt.plot(n_interval_for_fit, gaussian(n_interval_for_fit, *popt), label='fit')
plt.ylim([0, 1500])
plt.xlim([-10,1000])
I need to fit data (x axes: sigma, y axes : Mbh) with an exponential model. This is my code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
#define my data
Mbh = np.array([1.8e6,2.5e6,4.5e7,3.7e7,4.4e7,1.5e7,1.4e7,4.1e7, 1.0e9,2.1e8,1.0e8,1.0e8,1.6e7,1.9e8,3.9e7,5.2e8,3.1e8,3.0e8,7.0e7,1.1e8,3.0e9,5.6e7,7.8e7,2.0e9,1.7e8,1.4e7,2.4e8,5.3e8,3.3e8,3.5e6,2.5e9])
sigma = np.array([103,75,160,209,205,151,175,140,230,205,145,206,143,182,130,315,242,225,186,190,375,162,152,385,177,90,234,290,266,67,340])
#define my model to fit
def Mbh02(alpha, sigma, beta):
return alpha * np.exp(beta*sigma);
#calculate the fit parameter:
#for second model
popt02, pcov02 = curve_fit(Mbh02, sigma, Mbh, p0=[1, 0.058])
print(f'Parameter of the second function : {popt02}')
sigma_plot = [103,75,160,209,205,151,175,140,230,205,145,206,143,182,130,315,242,225,186,190,375,162,152,385,177,90,234,290,266,67,340]
sigma_plot.sort()
sigma_plot = np.array(sigma_plot)
#plot model with data with
plt.figure(figsize=(6,6))
plt.scatter(sigma, Mbh * 1e-9, marker = '+', color ='black', label = 'Data')
plt.plot(sigma_plot , Mbh02(alpha = popt02[0], sigma = sigma_plot, beta = popt02[1]) * 1e-9, color='orange', ls ='-', label ='2. fit')
plt.ylabel(r'$M_{BH}$ in $M_\odot *10^9$ unit', fontsize=16)
plt.xlabel(r'$\sigma$', fontsize=16)
# plt.ylim(-1,10)
plt.title('Plot of the black hole mass $M_{BH}$ \nagainst the velocity dispersion $\sigma$ \nfor different elliptical galaxies', fontsize=18)
plt.grid(True)
plt.legend()
plt.show()
and I get the following parameter
:
print(popt01) = [16.13278858 0.91788691]
which looks :
If I try to find the parameter manually, and plotting them with:
plt.plot(sigma_plot , (1 * np.exp(0.058 * sigma_plot)) * 1e-9, ls ='--', label ='2. fit manual')
I get the following plot which is much better:
What is the problem ? Why is curve_fit not working and giving such parameter ?
In the curve_fit documentation, it says
Assumes ydata = f(xdata, *params) + eps
So if you change your function definition so that the x data is first in your function, it will work:
def Mbh02(sigma, alpha, beta):
return alpha * np.exp(beta*sigma);
# Rest of code
plt.plot(sigma_plot , Mbh02(sigma_plot, *popt02) * 1e-9, color='orange', ls ='-')
Have you tried fitting the log(Mbh) with a linear fit instead of fitting the exp. model directly? This usually gives a lot of stability.
import numpy as np
import matplotlib.pyplot as plt
Mbh = np.array([1.8e6,2.5e6,4.5e7,3.7e7,4.4e7,1.5e7,1.4e7,4.1e7, 1.0e9,2.1e8,1.0e8,1.0e8,1.6e7,1.9e8,3.9e7,5.2e8,3.1e8,3.0e8,7.0e7,1.1e8,3.0e9,5.6e7,7.8e7,2.0e9,1.7e8,1.4e7,2.4e8,5.3e8,3.3e8,3.5e6,2.5e9])
sigma = np.array([103,75,160,209,205,151,175,140,230,205,145,206,143,182,130,315,242,225,186,190,375,162,152,385,177,90,234,290,266,67,340])
plt.figure(2)
plt.plot(sigma,Mbh,'.')
lnMbh= np.log(Mbh)
p = np.polyfit(sigma,lnMbh,1)
plt.plot(sigma, np.exp(np.polyval(p,sigma)),'*')
alpha = np.log(p[0])
beta = p[1]
As mentioned here, scikit-learn's Gaussian process regression (GPR) permits "prediction without prior fitting (based on the GP prior)". But I have an idea for what my prior should be (i.e. it should not simply have a mean of zero but perhaps my output, y, scales linearly with my input, X, i.e. y = X). How could I encode this information into GPR?
Below is a working example, but it assumed zero mean for my prior. I read that "The GaussianProcessRegressor does not allow for the specification of the mean function, always assuming it to be the zero function, highlighting the diminished role of the mean function in calculating the posterior." I believe this is the motivation behind custom kernels (e.g. heteroscedastic) with variable scales at different X, although I'm still trying to better understand what capability they provide. Are there ways to get around the zero mean prior so that an arbitrary prior can be specified in scikit-learn?
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
def f(x):
"""The function to predict."""
return 1.5*(1. - np.tanh(100.*(x-0.96))) + 1.5*x*(x-0.95) + 0.4 + 1.5*(1.-x)* np.random.random(x.shape)
# Instantiate a Gaussian Process model
kernel = C(10.0, (1e-5, 1e5)) * RBF(10.0, (1e-5, 1e5))
X = np.array([0.803,0.827,0.861,0.875,0.892,0.905,
0.91,0.92,0.925,0.935,0.941,0.947,0.96,
0.974,0.985,0.995,1.0])
X = np.atleast_2d(X).T
# Observations and noise
y = f(X).ravel()
noise = np.linspace(0.4,0.3,len(X))
y += noise
# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=noise ** 2,
n_restarts_optimizer=10)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)
# Make the prediction on the meshed x-axis (ask for MSE as well)
x = np.atleast_2d(np.linspace(0.8, 1.02, 1000)).T
y_pred, sigma = gp.predict(x, return_std=True)
plt.figure()
plt.errorbar(X.ravel(), y, noise, fmt='k.', markersize=10, label=u'Observations')
plt.plot(x, y_pred, 'k-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([y_pred - 1.9600 * sigma,
(y_pred + 1.9600 * sigma)[::-1]]),
alpha=.1, fc='k', ec='None', label='95% confidence interval')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(0.8, 1.02)
plt.ylim(0, 5)
plt.legend(loc='lower left')
plt.show()
Here is an example on how to use the prior mean function to the sklearn GPR model.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel
A=np.linspace(5,25,num=100)
# prior mean function
prior_beta=12-0.3*A
# true function
true_beta=20-0.7*A
rng = np.random.seed(44)
# Training data
size=15
ind=np.random.randint(0,100,size=size)
# generate the posterior variance (noisy samples)
var_=np.random.uniform(0.1,10.0,size=size)
A_=A[ind][:, np.newaxis]
beta_=true_beta[ind]-prior_beta[ind]
beta_1=true_beta[ind]
plt.figure()
kernel = ConstantKernel(4) * RBF(length_scale=2, length_scale_bounds=(1e-3, 1e2))
gp = GaussianProcessRegressor(kernel=kernel,
alpha=var_,optimizer=None).fit(A_, beta_)
X_ = np.linspace(5, 25, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
# Now you add the prior mean function back
y_mean=y_mean+12-0.3*X_
plt.plot(X_, y_mean, 'k', lw=3, zorder=9, label='predicted')
plt.fill_between(X_, y_mean - 3*np.sqrt(np.diag(y_cov)),
y_mean + 3*np.sqrt(np.diag(y_cov)),
alpha=0.5, color='k', label='+-3sigma')
plt.plot(A,true_beta, 'r', lw=3, zorder=9,label='truth')
plt.plot(A,prior_beta, 'blue', lw=3, zorder=9,label='prior')
plt.errorbar(A_[:,0], beta_1, yerr=3*np.sqrt(var_), fmt='x',ecolor='g',marker='s',
mfc='g', ms=10,capsize=6,label='training set')
plt.title("Initial: %s\n"% (kernel))
plt.legend()
plt.show()
I have a problem with the shape of the 2D Gaussian distribution.
I generated two random Gaussian distributions, and I used them to generate a 2D Gaussian distribution. What I expect is that the 2D Gaussian distribution has an ellipsoidal shape, instead I get a circle. Can someone explain to me where I'm wrong?
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate
import scipy.special as special
mu1, sigma1 = 0, 0.1
s1 = np.random.normal(mu1, sigma1, 10000)
mu2, sigma2 = 0.8, 0.3
s2 = np.random.normal(mu2, sigma2, 10000)
plt.figure(1)
plt.title('Histogram of a 2D-Gaussian Distribution')
bins1 = plt.hist(s1, 100)
bins2 = plt.hist(s2, 100)
plt.show()
plt.figure(2)
plt.title('2D-Gaussian Distribution')
bins = plt.hist2d(s1, s2, 100)
cb = plt.colorbar()
cb.set_label('counts in bin')
plt.show()
Thank you in advance.
Its probably just the scaling of the axes. Try
plt.axis('equal')
for the second plot.
Result:
I have some data which I have fitted a normal distribution to using the scipy.stats.normal objects fit function like so:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import matplotlib.mlab as mlab
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
mu, sigma = norm.fit(x)
n, bins, patches = ax.hist(x,nbins,normed=1,facecolor = 'grey', alpha = 0.5, label='before');
y0 = mlab.normpdf(bins, mu, sigma) # Line of best fit
ax.plot(bins,y0,'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu$={}, $\sigma$={}'.format(mu, sigma))
plt.show()
I would now like to extract the uncertainty/error in the fitted mu and sigma values. How can I go about this?
You can use scipy.optimize.curve_fit:
This method does not only return the estimated optimal values of
the parameters, but also the corresponding covariance matrix:
popt : array
Optimal values for the parameters so that the sum of the squared residuals
of f(xdata, *popt) - ydata is minimized
pcov : 2d array
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).
How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
You can calculate the standard deviation errors of the parameters from the square roots of the diagonal elements of the covariance matrix as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.optimize import curve_fit
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
n, bins, patches = ax.hist(x,nbins, density=True, facecolor = 'grey', alpha = 0.5, label='before');
centers = (0.5*(bins[1:]+bins[:-1]))
pars, cov = curve_fit(lambda x, mu, sig : norm.pdf(x, loc=mu, scale=sig), centers, n, p0=[0,1])
ax.plot(centers, norm.pdf(centers,*pars), 'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu={:.4f}\pm{:.4f}$, $\sigma={:.4f}\pm{:.4f}$'.format(pars[0],np.sqrt(cov[0,0]), pars[1], np.sqrt(cov[1,1 ])))
plt.show()
This results in the following plot:
See also lmfit (https://github.com/lmfit/lmfit-py) which gives an easier interface and reports uncertainties in fitted variables. To fit data to a normal distribution, see http://lmfit.github.io/lmfit-py/builtin_models.html#example-1-fit-peak-data-to-gaussian-lorentzian-and-voigt-profiles
and use something like
from lmfit.models import GaussianModel
model = GaussianModel()
# create parameters with initial guesses:
params = model.make_params(center=9, amplitude=40, sigma=1)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
The report will include the 1-sigma errors like
[[Variables]]
sigma: 1.23218358 +/- 0.007374 (0.60%) (init= 1.0)
center: 9.24277047 +/- 0.007374 (0.08%) (init= 9.0)
amplitude: 30.3135620 +/- 0.157126 (0.52%) (init= 40.0)
fwhm: 2.90157055 +/- 0.017366 (0.60%) == '2.3548200*sigma'
height: 9.81457817 +/- 0.050872 (0.52%) == '0.3989423*amplitude/max(1.e-15, sigma)'