I am trying to fit a gamma distribution to my data points, and I can do that using code below.
import scipy.stats as ss
import numpy as np
dataPoints = np.arange(0,1000,0.2)
fit_alpha,fit_loc,fit_beta = ss.rv_continuous.fit(ss.gamma, dataPoints, floc=0)
I want to reconstruct a larger distribution using many such small gamma distributions (the larger distribution is irrelevant for the question, only justifying why I am trying to fit a cdf as opposed to a pdf).
To achieve that, I want to fit a cumulative distribution, as opposed to a pdf, to my smaller distribution data.—More precisely, I want to fit the data to only a part of the cumulative distribution.
For example, I want to fit the data only until the cumulative probability function (with a certain scale and shape) reaches 0.6.
Any thoughts on using fit() for this purpose?
I understand that you are trying to piecewise reconstruct your cdf with several small gamma distributions each with a different scale and shape parameter capturing the 'local' regions of your distribution.
Probably makes sense if your empirical distribution is multi-modal / difficult to be summarized by one 'global' parametric distribution.
Don't know if you have specific reasons behind specifically fitting several gamma distributions, but in case your goal is to try to fit a distribution which is relatively smooth and captures your empirical cdf well perhaps you can take a look at Kernel Density Estimation. It is essentially a non-parametric way to fit a distribution to your data.
http://scikit-learn.org/stable/modules/density.html
http://en.wikipedia.org/wiki/Kernel_density_estimation
For example, you can try out a gaussian kernel and change the bandwidth parameter to control how smooth your fit is. A bandwith which is too small leads to an unsmooth ("overfitted") result [high variance, low bias]. A bandwidth which is too large results in a very smooth result but with high bias.
from sklearn.neighbors.kde import KernelDensity
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(dataPoints)
A good way then to select a bandwidth parameter that balances bias - variance tradeoff is to use cross-validation. Essentially the high level idea is you partition your data, run analysis on the training set and 'validate' on the test set, this will prevent overfitting the data.
Fortunately, sklearn also implements a nice example of choosing the best bandwidth of a Guassian Kernel using Cross Validation which you can borrow some code from:
http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html
Hope this helps!
Related
I'm currently trying to train a GP regression model in GPflow which will predict precipitation values given some meteorological inputs. I'm using a Linear+RBF+WhiteNoise kernel, which seems appropriate given the set of predictors I'm using.
My problem at the moment is that when I get the model to predict new values, it has a tendency to predict negative precipitation - see attached figure.
How can I "enforce" physical constraints when building the model? The training data doesn't contain any negative precipitation values, but it does contain a lot of values close to zero, which I assume means the GPR model isn't learning the "precipitation must be >=0" constraint very well.
If there's a way of explicitly enforcing a constraint like this it'd be perfect, but I'm not sure how that would work. Would this require a different optimization algorithm? Or is it possible to somehow build this constraint into the kernel structure?
This is more of a question for CrossValidated ... A Gaussian process is essentially a distribution over functions with Gaussian marginals: the predictive distribution of f(x) at any point is by construction a Gaussian, not constrained. E.g. if you have lots of observations close to zero, your model expects that something just below zero must also be very likely.
If your observations are strictly positive, you could use a different likelihood, e.g. Exponential (gpflow.likelihoods.Exponential) or Beta (gpflow.likelihoods.Beta). Note that model.predict_y() always returns mean and variance, and for non-Gaussian likelihoods the variance may not actually be what you want. In practice, you're more likely to care about quantiles (e.g. 10%-90% confidence interval); there is an open issue on the GPflow github that relates to this. Which likelihood you use is part of your modelling choice, and depends on your data.
The simplest practical answer to your problem is to consider modelling the log-precipitation: if your original dataset is X and Y (with Y > 0 for all entries), compute logY = np.log(Y) and create your GP model e.g. using gpflow.models.GPR((X, logY), kernel). You then predict logY at test points, and can then convert it back from log-precipitation into precipitation space. (This is equivalent to a LogNormal likelihood, which isn't currently implemented in GPflow, though this would be straightforward.)
I am doing some work which requires fitting a Gaussian to a cluster of points which is expected to be distributed normally.
I have data which looks like this, you can see the small tightly grouped cluster of points on the left:
I zoom in around the cluster, and use scikit-learn KDE to get a density distribution (with Gaussian kernel), which looks like this:
Then I fit the Gaussian and it turns out to have far too small sigma:
centroid_x: -36.3204357
centroid_y: -12.8734763
sigma_x: 0.17916588
sigma_y: 0.07428976
From inspection of the density distribution, the x and y sigma should be more on the order of ~1, rather than ~0.1. Does anyone know why this behaviour might be occurring? I don't believe there are significant errors in my code or method, this technique has worked well on other data sets, for example:
is there a way in python to generate random data based on the distribution of the alreday existing data?
Here are the statistical parameters of my dataset:
Data
count 209.000000
mean 1.280144
std 0.374602
min 0.880000
25% 1.060000
50% 1.150000
75% 1.400000
max 4.140000
as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas?
Thank you.
Edit: Performing KDE:
from sklearn.neighbors import KernelDensity
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.525566).fit(data['y'].to_numpy().reshape(-1, 1))
sns.distplot(kde.sample(2400))
In general, real-world data doesn't exactly follow a "nice" distribution like the normal or Weibull distributions.
Similarly to machine learning, there are generally two steps to sampling from a distribution of data points:
Fit a data model to the data.
Then, predict a new data point based on that model, with the help of randomness.
There are several ways to estimate the distribution of data and sample from that estimate:
Kernel density estimation.
Gaussian mixture models.
Histograms.
Regression models.
Other machine learning models.
In addition, methods such as maximum likelihood estimation make it possible to fit a known distribution (such as the normal distribution) to data, but the estimated distribution is generally rougher than with kernel density estimation or other machine learning models.
See also my section "Random Numbers from a Distribution of Data Points".
I'd like to know ways to determine how well a Gaussian function is fitting my data.
Here are a few plots I've been testing methods against. Currently, I'm just using the RMSE of the fit versus the sample (red is fit, blue is sample).
For instance, here are 2 good fits:
And here are 2 terrible fits that should be flagged as bad data:
In general, I'm looking for suggestions of additional metrics to measure the goodness of fit. Additionally, as you can see in the second 'good' fit, there can sometimes be other peaks outside the data. Currently, these are penalized by the RSME method, though they should not be.
I'm looking for suggestions of additional metrics to measure the goodness of fit.
The one-sample Kolmogorov-Smirnov (KS) test would be a good starting point.
I'd suggest the Wikipedia article as an introduction.
The test is available in SciPy as scipy.stats.kstest. The function computes and returns both the KS test statistic and the p-value.
You can try quantile-quantile (qq) plots using probplot from stats:
import pylab
from stats import probplot
plot = probplot(data, dist='norm', plot=pylab)
pylab.show()
Calculate quantiles for a probability plot, and optionally show the
plot.
Generates a probability plot of sample data against the quantiles of a
specified theoretical distribution (the normal distribution by
default). probplot optionally calculates a best-fit line for the data
and plots the results using Matplotlib or a given plot function.
There are other ways of evaluating a good fit, but most of them are not robust to outliers.
There is MSE - Mean squared error, which you already know, and RMSE which is the root of it.
But you can also measure it using MAE - Mean Absolute Error and MAPE - Mean absolute percentage error.
Also, there is the Kolmogorov-Smirnov test which is far more complex and you would probably need a library to do that, while MAE, MAPE and MSE you can implement yourself quiet easily.
(If you are dealing with unsupervised data and/or classification, which is not your case apparently, ROC curves and confusion matrix are also accuracy metrics.)
Update: Weighted samples are now supported by scipy.stats.gaussian_kde. See here and here for details.
It is currently not possible to use scipy.stats.gaussian_kde to estimate the density of a random variable based on weighted samples. What methods are available to estimate densities of continuous random variables based on weighted samples?
Neither sklearn.neighbors.KernelDensity nor statsmodels.nonparametric seem to support weighted samples. I modified scipy.stats.gaussian_kde to allow for heterogeneous sampling weights and thought the results might be useful for others. An example is shown below.
An ipython notebook can be found here: http://nbviewer.ipython.org/gist/tillahoffmann/f844bce2ec264c1c8cb5
Implementation details
The weighted arithmetic mean is
The unbiased data covariance matrix is then given by
The bandwidth can be chosen by scott or silverman rules as in scipy. However, the number of samples used to calculate the bandwidth is Kish's approximation for the effective sample size.
For univariate distributions you can use KDEUnivariate from statsmodels. It is not well documented, but the fit methods accepts a weights argument. Then you cannot use FFT. Here is an example:
import matplotlib.pyplot as plt
from statsmodels.nonparametric.kde import KDEUnivariate
kde1= KDEUnivariate(np.array([10.,10.,10.,5.]))
kde1.fit(bw=0.5)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support],'x-')
kde1= KDEUnivariate(np.array([10.,5.]))
kde1.fit(weights=np.array([3.,1.]),
bw=0.5,
fft=False)
plt.plot(kde1.support, [kde1.evaluate(xi) for xi in kde1.support], 'o-')
which produces this figure:
Check out the packages PyQT-Fit and statistics for Python. They seem to have kernel density estimation with weighted observations.