What does 'frozen distribution' mean in Scipy? - python

In the documentation of scipy, the 'frozen pdf', etc, is mentioned sometimes, but I don't know the meaning of it? Is it a statistical concept or scipy terminology?

I agree that the docs are somewhat unclear on the issue. It seems that the frozen distribution fixes the first n moments for programmer's convenience. I am unaware of the term "forzen distribution" outside of SciPy.
SciPy's frozen distribution is perhaps best described here:
Passing the loc and scale keywords time and again can become quite
bothersome. The concept of freezing a RV is used to solve such
problems.
rv = gamma(1, scale=2.)
By using rv we no longer have to include the scale or the shape
parameters anymore. Thus, distributions can be used in one of two
ways, either by passing all distribution parameters to each method
call (such as we did earlier) or by freezing the parameters for the
instance of the distribution. Let us check this:
rv.mean(), rv.std() (2.0, 2.0)
This is, indeed, what we should get.
In the scipy tutorial page, we see the following line:
(We explain the meaning of a frozen distribution below).
The only mention of frozen distribution after that point is the following:
The main additional methods of the not frozen distribution are related
to the estimation of distribution parameters:
fit: maximum likelihood estimation of distribution parameters, including location
and scale
fit_loc_scale: estimation of location and scale when shape parameters are given
nnlf: negative log likelihood function
expect: calculate the expectation of a function against the pdf or pmf

Related

Fit spline with given number of knots, but not knot positions

Given a set of 2D points, I would like to fit the optimal spline to this data with a given number of internal knots.
I have seen that we can use scipy's LSQUnivariateSpline to specify the number and position of knots, however it does not allow us to only specify the number of knots.
From the UnivariateSpline documentation, it seems implied that they have a method for fitting the spline with a given number of knots, as the documentation for the smoothing factor s states (emphasis mine):
Positive smoothing factor used to choose the number of knots. Number
of knots will be increased until the smoothing condition is satisfied...
So while I could go about this in a kind of backwards way and search through smoothing factors until it yields a spline with the desired number of knots, this seems to be a rather ridiculous way to approach this from a computational efficiency standpoint. Two extra search steps are happening just to cancel each other out and obtain a result that was already computed directly at the start.
I've searched around but haven't found a function to access this spline interpolation with a given number of knots directly. I'm not sure if I've missed something simple, or if it's hidden deeper down somewhere and/or not available in the API.
Note: a scipy solution is not required, any python libraries or handcrafted python code is fine (I am using scipy here just because that's where all of my searches about spline interpolation in python have landed me).
Unfortunately, it looks like the UnivariateSpline constructor passes off the computational work to the function dfitpack.curf0, which is implemented in Fortran.
Therefore, although the documentation indicates that the smoothing requirement is met by adjusting the number of knots, there is no way to directly access the function which fits a spline given a number of knots from the python API.
In light of this, it looks like one may need to look to another library or write the algorithm oneself, if avoiding the roundabout double search method is desired. However, in many cases, it may be acceptable to simply run a binary search for the desired number of knots by adjusting the smoothing parameter.
Scipy does not have smoothing splines with a fixed number of knots. You either provide your knots, or let FITPACK select it via the smoothing condition knob.

KDE method more sensitive to multimodality

I am using scipy's stats.gaussian_kde() to produce a PDF estimate for a sample of data.
Scipy's docs clearly state:
"The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed."
Is there a method that may be more sensitive to spikes in frequency that does not involve manually setting bandwidth?
My assumption is that, because it is non-parametric, Gaussian KDE does not assume the shape of the distribution-- yet it seems to be forced to assume normality nonetheless.
You can have a look here for an in-depth discussion on KDEs.
The only issue with multimodal distribution, as stated, is over-smoothing, but no normality is assumed, the Gaussian kernel is just a smoothing choice.

Identifying a distribution from a pdf in python

I have a probability density function of an unknown distribution which is given as a set of tuples (x, f(x)), where x=numpy.arange(0,1,size) and f(x) is the corresponding probability.
What is the best way to identify the corresponding distribution? So far my idea is to draw a large amount of samples based on the pdf (by writing the code myself from scratch) and then use the obtained data to fit all of the distributions implemented in scipy.stats, then take the best fit.
Is there a better way to solve this problem? For example, is there some kind of utility in scipy.stats that I'm missing that would help me solve this problem?
In a fundamental sense, it's not really possible to summarize a distribution based on empirical samples - see here a discussion.
It's possible to do something more limited, which is to reject/accept the hypothesis that it comes out of one of a finite set of (parametric) distributions, based on a somewhat arbitrary criterion.
Given the finite set of distributions, for each distribution, you could perhaps realistically do the following:
Fit the distribution's parameters to the data. E.g., scipy.stats.beta.fit will fit the best parameters of the Beta distribution (all scipy distributions have this method).
Reject/accept the hypothesis that the data was generated by this distribution. There is more than a single way of doing this. A particularly simple way is to use the rvs() method of the distribution to generate another sample, then use ks_2samp to generate a Kolmogorov-Smirnov test.
Note that some specific distributions might have better, ad-hoc algorithms for testing whether a member of the distribution's family generated the data. As usual, the Normal distribution has many in particular (see Normalcy test).

What is scipy's equivalent to matlab's `mle` function?

I'm trying to fit some data to a mixed model using an expectation maximization approach. In Matlab, the code is as follows
% mixture model's PDF
mixtureModel = ...
#(x,pguess,kappa) pguess/180 + (1-pguess)*exp(kappa*cos(2*x/180*pi))/(180*besseli(0,kappa));
% Set up parameters for the MLE function
options = statset('mlecustom');
options.MaxIter = 20000;
options.MaxFunEvals = 20000;
% fit the model using maximum likelihood estimate
params = mle(data, 'pdf', mixtureModel, 'start', [.1 1/10], ...
'lowerbound', [0 1/50], 'upperbound', [1 50], ...
'options', options);
The data parameter is a 1-D vector of floats.
I'm wondering how the equivalent computation can be achieved in Python. I looked into scipy.optimize.minimize, but this doesn't seem to be a drop-in replacement for Matlab's mle.
I'm a bit lost and overwhelmed, can somebody point me in the right direction (ideally with some example code?)
Thanks very much in advance!
Edit: In the meantime I've found this, but I'm still rather lost as (1) this seems primarily focused on mixed guassian models (which mine is not) and (2) my mathematical skills are severely lacking. That said, I'll happily accept an answer that elucidates how this notebook relates to my specific problem!
This is a mixture model (not mixed model) of uniform and von mises distributions whose parameters you are trying to infer using direct maximum likelihood estimation (not EM, although that may be more appropriate). You can find theses written on this exact problem if you search on the internet. SciPy doesn't have anything that would be as clear a choice as matlab's fmincon which it uses as its default in your code, but you could look for scipy optimization methods that allow bounds on parameters. The scipy interface is different from that of matlab's mle, and you will want to pass the data in the 'args' argument of the scipy minimization functions, whereas the pguess and kappa parameters will need to be represented by a parameter array of length 2.
I believe the scikit-learn toolkit has what you need:
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GMM.html.
Gaussian Mixture Model
Representation of a Gaussian mixture model probability distribution. This class allows for easy evaluation of, sampling from, and maximum-likelihood estimation of the parameters of a GMM distribution.
Initializes parameters such that every mixture component has zero mean and identity covariance.

Method of moments in scipy?

Following from this question, is there a way to use any method other than MLE (maximum-likelihood estimation) for fitting a continuous distribution in scipy? I think that my data may be resulting in the MLE method diverging, so I want to try using the method of moments instead, but I can't find out how to do it in scipy. Specifically, I'm expecting to find something like
scipy.stats.genextreme.fit(data, method=method_of_moments)
Does anyone know if this is possible, and if so how to do it?
Few things to mention:
1) scipy does not have support for GMM. There is some support for GMM via statsmodels (http://statsmodels.sourceforge.net/stable/gmm.html), you can also access many R routines via Rpy2 (and R is bound to have every flavour of GMM ever invented): http://rpy.sourceforge.net/rpy2/doc-2.1/html/index.html
2) Regarding stability of convergence, if this is the issue, then probably your problem is not with the objective being maximised (eg. likelihood, as opposed to a generalised moment) but with the optimiser. Gradient optimisers can be really fussy (or rather, the problems we give them are not really suited for gradient optimisation, leading to poor convergence).
If statsmodels and Rpy do not give you the routine you need, it is perhaps a good idea to write out your moment computation out verbose, and see how you can maximise it yourself - perhaps a custom-made little tool would work well for you?

Categories