Evaluating a Gaussian Fit - python

I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.

I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.

You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.

There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)

Related

Chi-square value from lmfit

I have been trying to do a pixel-to-pixel fitting of a set of images ie I have data at different wavelengths in different images and I am trying to fit a function for each pixel individually. I have done the fitting using lmfit and obtained the values of the unknown parameters for each pixel. Now, I want to obtain the chi-squared value for each fit. I know that lmfit has this attribute called chisqr which can give me the same but what is confusing me is this line from the lmfit github site:
"Note that the calculation of chi-square and reduced chi-square assume that the returned residual function is scaled properly to the uncertainties in the data. For these statistics to be meaningful, the person writing the function to be minimized must scale them properly."
I doubt that the values I am getting from the chisqr attribute are not exactly right and some scaling needs to be done. Can somebody please explain how lmfit calculates the chisquare value and what scaling am I required to do?
This is a sample of my fitting function
def fcn2fit(params,freq,F,sigma):
colden=params['colden'].value
tk=params['tk'].value
model = greybodyfit(np.array(freq),colden,tk)
return (model - F)/sigma
colden and tk are the free parameters, freq is the independent variable and F is the dependent variable, sigma is the error in F. Is returning (model-F)/sigma the right way of scaling the residuals so that the chisqr attribute gives the correct chi-square value?
The value reported for chi-square is the sum of the square of the residual for the fit. Lmfit cannot tell whether that residual function is properly scaled by the standard error of the data - this scaling must be done in the objective function if you are using lmfit.minimize or passed in as weights if using lmfit.Model.

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?
The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

Weighted Gaussian kernel density estimation in `python`

Update: Weighted samples are now supported by scipy.stats.gaussian_kde. See here and here for details.
It is currently not possible to use scipy.stats.gaussian_kde to estimate the density of a random variable based on weighted samples. What methods are available to estimate densities of continuous random variables based on weighted samples?
Neither sklearn.neighbors.KernelDensity nor statsmodels.nonparametric seem to support weighted samples. I modified scipy.stats.gaussian_kde to allow for heterogeneous sampling weights and thought the results might be useful for others. An example is shown below.
An ipython notebook can be found here: http://nbviewer.ipython.org/gist/tillahoffmann/f844bce2ec264c1c8cb5
Implementation details
The weighted arithmetic mean is
The unbiased data covariance matrix is then given by
The bandwidth can be chosen by scott or silverman rules as in scipy. However, the number of samples used to calculate the bandwidth is Kish's approximation for the effective sample size.
For univariate distributions you can use KDEUnivariate from statsmodels. It is not well documented, but the fit methods accepts a weights argument. Then you cannot use FFT. Here is an example:
import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate
kde1= KDEUnivariate(np.array([10.,10.,10.,5.]))
kde1.fit(bw=0.5)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support],'x-')
kde1= KDEUnivariate(np.array([10.,5.]))
kde1.fit(weights=np.array([3.,1.]),
bw=0.5,
fft=False)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support], 'o-')
which produces this figure:
Check out the packages PyQT-Fit and statistics for Python. They seem to have kernel density estimation with weighted observations.

How to normalize a histogram of an exponential distributionin scipy?

I'm trying to fit an exponential distribution to a dataset I have. Strangely, no matter what I do I can't seem to scale the histogram so it fits the fitted exponential distribution.
param=expon.fit(data)
pdf_fitted=norm.pdf(x,loc=param[0],scale=param[1])
plot(x,pdf_fitted,'r-')
hist(constraint1N55, normed=1,alpha=.3,histtype='stepfilled')
For some reason, the histogram takes up much more space than the probability distribution, even though I have normed=1. Is there something I can do to make things fit more appropriately?
You made an error. You fitted to an exponential, but plotted a normal distribution:
pdf_fitted=expon.pdf(x,loc=param[0],scale=param[1])
The data looks good when plotted properly:

Fitting data points to a cumulative distribution

I am trying to fit a gamma distribution to my data points, and I can do that using code below.
import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)
I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).
To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.
For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.
Any thoughts on using fit() for this purpose?
I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.
Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.
Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.
http://scikit-learn.org/stable/modules/density.html
http://en.wikipedia.org/wiki/Kernel_density_estimation
For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.
from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints)
A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.
Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:
http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html
Hope this helps!

Categories