I am doing some work which requires fitting a Gaussian to a cluster of points which is expected to be distributed normally.
I have data which looks like this, you can see the small tightly grouped cluster of points on the left:
I zoom in around the cluster, and use scikit-learn KDE to get a density distribution (with Gaussian kernel), which looks like this:
Then I fit the Gaussian and it turns out to have far too small sigma:
centroid_x: -36.3204357
centroid_y: -12.8734763
sigma_x: 0.17916588
sigma_y: 0.07428976
From inspection of the density distribution, the x and y sigma should be more on the order of ~1, rather than ~0.1. Does anyone know why this behaviour might be occurring? I don't believe there are significant errors in my code or method, this technique has worked well on other data sets, for example:
Related
is there a way in python to generate random data based on the distribution of the alreday existing data?
Here are the statistical parameters of my dataset:
Data
count 209.000000
mean 1.280144
std 0.374602
min 0.880000
25% 1.060000
50% 1.150000
75% 1.400000
max 4.140000
as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas?
Thank you.
Edit: Performing KDE:
from sklearn.neighbors import KernelDensity
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.525566).fit(data['y'].to_numpy().reshape(-1, 1))
sns.distplot(kde.sample(2400))
In general, real-world data doesn't exactly follow a "nice" distribution like the normal or Weibull distributions.
Similarly to machine learning, there are generally two steps to sampling from a distribution of data points:
Fit a data model to the data.
Then, predict a new data point based on that model, with the help of randomness.
There are several ways to estimate the distribution of data and sample from that estimate:
Kernel density estimation.
Gaussian mixture models.
Histograms.
Regression models.
Other machine learning models.
In addition, methods such as maximum likelihood estimation make it possible to fit a known distribution (such as the normal distribution) to data, but the estimated distribution is generally rougher than with kernel density estimation or other machine learning models.
See also my section "Random Numbers from a Distribution of Data Points".
I have the given data set:
Of which I would like to fit a Gaussian curve at the point where the red arrow is directed towards. I have attempted to do so by restricting the data points to a range of channels close to the peak, using scipy.optimize.curve_fit and a gaussian function to obtain the fit as shown below.
This method, however, does not take into account the slope of the background noise of the data points. Thus affecting the accuracy of the position of the peak of the fitted curve by the above-mentioned method.
I would like to take into account this background slope. How do I go about doing so in python?
You have to somehow model the background and the Gaussian peak, and perhaps any other peaks in the spectrum. Your background looks to be roughly 1/x (or some other power of x), but it might also be exponential. You may know this, or you may find that plotting on a semi-log plot can help decide which of these forms is better.
To fit the background and Gaussian with curve_fit, you would have to write a model function that modeled both. Allow me to recommend using lmfit (http://lmfit.github.io/lmfit-py/) as it has several built-in models and can help you compose a model of several different line shapes. An example that might be helpful for your problem is at (http://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes).
A script to fit your data might look like
import numpy as np
from lmfit.models import PowerLawModel, ExponentialModel, GaussianModel
# make models for individual components
mod_expon = ExponentialModel(prefix='exp_')
mod_gauss = GaussianModel(prefix='g1_')
# sum components to make a composite model (add more if needed)
model = mod_expon + mod_gauss
# create fitting parameters by name, give initial values
params = model.make_params(g1_amplitude=5, g1_center=55, g1_sigma=5,
exp_amplitude=5, exp_decay=10)
# do fit
result = model.fit(ydata, params, x=xdata)
# print out fitting statistics, best-fit parameters, uncertainties,....
print(result.fit_report())
There are many more examples in the docs, including showing how to extract and plot the individual components, and so on.
How I would do this is to use a fit that fits to both the signal and the background. That is, fit not just a Gaussian, but a fit that is a Guassian plus a function that fits the background. The first approximation to your background is a linear slope, so you could use a form like a*exp(-(x-x0)**2/w**2) + m*x + c.
This gives you more fitting parameters, all of which are interdependent, but if you can give them reasonable initial values then the fit normally converges well.
I have a vector of data points that seems to represent a 3D Gaussian distribution or a Gaussian mixture distribution. Is there a way to fit a 3D Gaussian distribution or a Gaussian mixture distribution to this matrix, and if yes, do there exist libraries to do that (e.g. in Python)?
The question seems related to the following one, but I would like to fit a 3D Gaussian to it:
Fit multivariate gaussian distribution to a given dataset
The targeted end results would look like this (a single distribution or a mixture):
For example, very much simplified, my data vector (from which the Gaussian (mixture) distribution should be learned) looks like this:
[[0,0,0,0,0,0], [0,1,1,1,1,0], [0,1,2,2,1,0], [1,2,3,3,2,1], [0,1,2,2,1,0], [0,0,0,0,0,0]]
I can give an answer if you know the number of Gaussians. Your vector gives the Z values at a grid of X, Y points. You can make X and Y vectors:
import numpy as np
num_x, num_y = np.shape(z)
xx = np.outer(np.ones(num_x), np.arange(num_y))
yy = np.outer(np.arange(num_x), np.ones(num_y))
Then follow any routine fitting procedure, for instance 2D Gaussian Fit for intensities at certain coordinates in Python.
There is so-called Gaussian Mixture Models (GMM), with lots of literature behind it. And there is python code to do sampling, parameters estimation etc, not sure if it fits your needs
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html
Disclaimer: used scikit, but never used GMM
I'm trying to fit an exponential distribution to a dataset I have. Strangely, no matter what I do I can't seem to scale the histogram so it fits the fitted exponential distribution.
param=expon.fit(data)
pdf_fitted=norm.pdf(x,loc=param[0],scale=param[1])
plot(x,pdf_fitted,'r-')
hist(constraint1N55, normed=1,alpha=.3,histtype='stepfilled')
For some reason, the histogram takes up much more space than the probability distribution, even though I have normed=1. Is there something I can do to make things fit more appropriately?
You made an error. You fitted to an exponential, but plotted a normal distribution:
pdf_fitted=expon.pdf(x,loc=param[0],scale=param[1])
The data looks good when plotted properly:
I am trying to fit a gamma distribution to my data points, and I can do that using code below.
import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)
I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).
To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.
For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.
Any thoughts on using fit() for this purpose?
I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.
Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.
Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.
http://scikit-learn.org/stable/modules/density.html
http://en.wikipedia.org/wiki/Kernel_density_estimation
For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.
from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints)
A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.
Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:
http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html
Hope this helps!