I'm trying to fit an exponential distribution to a dataset I have. Strangely, no matter what I do I can't seem to scale the histogram so it fits the fitted exponential distribution.
param=expon.fit(data)
pdf_fitted=norm.pdf(x,loc=param[0],scale=param[1])
plot(x,pdf_fitted,'r-')
hist(constraint1N55, normed=1,alpha=.3,histtype='stepfilled')
For some reason, the histogram takes up much more space than the probability distribution, even though I have normed=1. Is there something I can do to make things fit more appropriately?
You made an error. You fitted to an exponential, but plotted a normal distribution:
pdf_fitted=expon.pdf(x,loc=param[0],scale=param[1])
The data looks good when plotted properly:
Related
I'm trying to fit a smooth curve to a set of data that's very noisy. By using the "UnivariateSpline" function from scipy I have almost managed to reach my goal, but the fitting of the curve seems to not be able to fit the beginning correctly.
The first picture shows the whole plot (red is the fitted curve, green the noisy data).
First plot
The second picture is zoomed in on the part that the fitting gets wrong.
Second plot with the fitting error
Does anyone have an idea for how to make this more aligned with the green data?
I have tried splitting up the first part of the data (from x=0 to the spike, an exponential-like curve) and the second part from the top of the spike and out (a negative exponential function). But this didn't work.
In the end, the important thing is to always have increasing values of y with increasing x before the spike, and the opposite with increasing x after the spike.
First, if the data is noisy and you don't care to fit the noise, then interpolation is always tricky because of its focus on fitting all the data points. It may be ok for a one-off analysis though.
The issue that you're observing is the trade-off set by the smoothing factor and the sharp transition in your data. You can try to reduce s to a value smaller than the number of data points, but you may find that it will start to fit the noise.
Otherwise, I'd recommend that you try fitting an MLP model. Be sure to use activation='tanh' as that will make the function approximation smoother. You may also want to increase alpha, which controls the weight decay and further helps to keep the neural net function smooth. Lastly, set early_stopping=True to reduce the chances of overfitting to the noise.
I am doing some work which requires fitting a Gaussian to a cluster of points which is expected to be distributed normally.
I have data which looks like this, you can see the small tightly grouped cluster of points on the left:
I zoom in around the cluster, and use scikit-learn KDE to get a density distribution (with Gaussian kernel), which looks like this:
Then I fit the Gaussian and it turns out to have far too small sigma:
centroid_x: -36.3204357
centroid_y: -12.8734763
sigma_x: 0.17916588
sigma_y: 0.07428976
From inspection of the density distribution, the x and y sigma should be more on the order of ~1, rather than ~0.1. Does anyone know why this behaviour might be occurring? I don't believe there are significant errors in my code or method, this technique has worked well on other data sets, for example:
I have the given data set:
Of which I would like to fit a Gaussian curve at the point where the red arrow is directed towards. I have attempted to do so by restricting the data points to a range of channels close to the peak, using scipy.optimize.curve_fit and a gaussian function to obtain the fit as shown below.
This method, however, does not take into account the slope of the background noise of the data points. Thus affecting the accuracy of the position of the peak of the fitted curve by the above-mentioned method.
I would like to take into account this background slope. How do I go about doing so in python?
You have to somehow model the background and the Gaussian peak, and perhaps any other peaks in the spectrum. Your background looks to be roughly 1/x (or some other power of x), but it might also be exponential. You may know this, or you may find that plotting on a semi-log plot can help decide which of these forms is better.
To fit the background and Gaussian with curve_fit, you would have to write a model function that modeled both. Allow me to recommend using lmfit (http://lmfit.github.io/lmfit-py/) as it has several built-in models and can help you compose a model of several different line shapes. An example that might be helpful for your problem is at (http://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes).
A script to fit your data might look like
import numpy as np
from lmfit.models import PowerLawModel, ExponentialModel, GaussianModel
# make models for individual components
mod_expon = ExponentialModel(prefix='exp_')
mod_gauss = GaussianModel(prefix='g1_')
# sum components to make a composite model (add more if needed)
model = mod_expon + mod_gauss
# create fitting parameters by name, give initial values
params = model.make_params(g1_amplitude=5, g1_center=55, g1_sigma=5,
exp_amplitude=5, exp_decay=10)
# do fit
result = model.fit(ydata, params, x=xdata)
# print out fitting statistics, best-fit parameters, uncertainties,....
print(result.fit_report())
There are many more examples in the docs, including showing how to extract and plot the individual components, and so on.
How I would do this is to use a fit that fits to both the signal and the background. That is, fit not just a Gaussian, but a fit that is a Guassian plus a function that fits the background. The first approximation to your background is a linear slope, so you could use a form like a*exp(-(x-x0)**2/w**2) + m*x + c.
This gives you more fitting parameters, all of which are interdependent, but if you can give them reasonable initial values then the fit normally converges well.
I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.
I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.
You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)
I am trying to fit a gamma distribution to my data points, and I can do that using code below.
import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)
I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).
To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.
For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.
Any thoughts on using fit() for this purpose?
I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.
Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.
Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.
http://scikit-learn.org/stable/modules/density.html
http://en.wikipedia.org/wiki/Kernel_density_estimation
For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.
from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints)
A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.
Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:
http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html
Hope this helps!