KDE method more sensitive to multimodality

KDE method more sensitive to multimodality - python

I am using scipy's stats.gaussian_kde() to produce a PDF estimate for a sample of data.
Scipy's docs clearly state:
"The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed."
Is there a method that may be more sensitive to spikes in frequency that does not involve manually setting bandwidth?
My assumption is that, because it is non-parametric, Gaussian KDE does not assume the shape of the distribution-- yet it seems to be forced to assume normality nonetheless.

You can have a look here for an in-depth discussion on KDEs.
The only issue with multimodal distribution, as stated, is over-smoothing, but no normality is assumed, the Gaussian kernel is just a smoothing choice.

Related

how to separate two distributions from a pdf(probability density function)?

Assume the pdf(probability density function) of my dataset is as below, there are two distributions in the dataset and I want to fit them with a Gamma and a Gaussian distribution.
I have read this but this applies only for two Gaussian distribution.
How to use python to separate two gaussian curves?
Here is the steps that I would like to do
manually estimate the mean and variance of the Gaussian distribution
base on estimated mean and variance, create the pdf of the Gaussian distribution
substract the pdf of Gaussian from the original pdf
fit the pdf to a Gamma distribution
I am able to do 1~3, but for step 4 I do not know how to fit a Gamma distribution from a pdf
(not from data samples so stats.gamma.fit(data) does not work here).
Are the above steps reasonable for dealing with this kind of problem, and how to do step 4 in Python ? Appreciated for any help.

Interesting question. One issue I can see is that it will be sometimes difficult to disambiguate which mode is the Gamma and which is the Gaussian.
What I would perhaps do is try an expectation-maximization (EM) algorithm. Given the ambiguity stated above, I would do a few runs and select the best fit.
Perhaps, to speed things up a little, you could try EM with two possible starting points (somewhat similar to your idea):
estimate a mixture model of two Gaussians, say g0, g1.
run one EM to solve a mixture (Gamma, Gaussian), starting with an initial point that is (gamma_params_approx_gauss(g0), g1), where gamma_params_approx_gauss(g0) is a maximum-likelihood estimator of Gamma parameters given Gaussian parameters (see e.g. here).
run another EM, but starting with (gamma_params_approx_gauss(g1), g0).
select the best fit among the two EM results.

What does 'frozen distribution' mean in Scipy?

In the documentation of scipy, the 'frozen pdf', etc, is mentioned sometimes, but I don't know the meaning of it? Is it a statistical concept or scipy terminology?

I agree that the docs are somewhat unclear on the issue. It seems that the frozen distribution fixes the first n moments for programmer's convenience. I am unaware of the term "forzen distribution" outside of SciPy.
SciPy's frozen distribution is perhaps best described here:
Passing the loc and scale keywords time and again can become quite
bothersome. The concept of freezing a RV is used to solve such
problems.
rv = gamma(1, scale=2.)
By using rv we no longer have to include the scale or the shape
parameters anymore. Thus, distributions can be used in one of two
ways, either by passing all distribution parameters to each method
call (such as we did earlier) or by freezing the parameters for the
instance of the distribution. Let us check this:
rv.mean(), rv.std() (2.0, 2.0)
This is, indeed, what we should get.
In the scipy tutorial page, we see the following line:
(We explain the meaning of a frozen distribution below).
The only mention of frozen distribution after that point is the following:
The main additional methods of the not frozen distribution are related
to the estimation of distribution parameters:
fit: maximum likelihood estimation of distribution parameters, including location
and scale
fit_loc_scale: estimation of location and scale when shape parameters are given
nnlf: negative log likelihood function
expect: calculate the expectation of a function against the pdf or pmf

Is there a fast alternative to scipy _norm_pdf for correlated distribution sampling?

I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.
In doing this, I've found a quirk in SciPy performance:
# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)
# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)
Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.
The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.
So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?

If you need just the normal ppf, it is indeed puzzling that it is so slow, but you can use scipy.special.erfinv instead:
x = np.random.uniform(0,1,100)
np.allclose(special.erfinv(2*x-1)*np.sqrt(2),stats.norm().ppf(x))
# True
timeit(lambda:stats.norm().ppf(x),number=1000)
# 0.7717257660115138
timeit(lambda:special.erfinv(2*x-1)*np.sqrt(2),number=1000)
# 0.015020604943856597
EDIT:
lognormal and triangle are also straight forward:
c = np.random.uniform()
np.allclose(np.exp(c*special.erfinv(2*x-1)*np.sqrt(2)),stats.lognorm(c).ppf(x))
# True
np.allclose(((1-np.sqrt(1-(x-c)/((x>c)-c)))*((x>c)-c))+c,stats.triang(c).ppf(x))
# True
skew normal I'm not familiar enough, unfortunately.

Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.
To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.

Identifying a distribution from a pdf in python

I have a probability density function of an unknown distribution which is given as a set of tuples (x, f(x)), where x=numpy.arange(0,1,size) and f(x) is the corresponding probability.
What is the best way to identify the corresponding distribution? So far my idea is to draw a large amount of samples based on the pdf (by writing the code myself from scratch) and then use the obtained data to fit all of the distributions implemented in scipy.stats, then take the best fit.
Is there a better way to solve this problem? For example, is there some kind of utility in scipy.stats that I'm missing that would help me solve this problem?

In a fundamental sense, it's not really possible to summarize a distribution based on empirical samples - see here a discussion.
It's possible to do something more limited, which is to reject/accept the hypothesis that it comes out of one of a finite set of (parametric) distributions, based on a somewhat arbitrary criterion.
Given the finite set of distributions, for each distribution, you could perhaps realistically do the following:
Fit the distribution's parameters to the data. E.g., scipy.stats.beta.fit will fit the best parameters of the Beta distribution (all scipy distributions have this method).
Reject/accept the hypothesis that the data was generated by this distribution. There is more than a single way of doing this. A particularly simple way is to use the rvs() method of the distribution to generate another sample, then use ks_2samp to generate a Kolmogorov-Smirnov test.
Note that some specific distributions might have better, ad-hoc algorithms for testing whether a member of the distribution's family generated the data. As usual, the Normal distribution has many in particular (see Normalcy test).

Fitting a bimodal distribution to a set of values

Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!

What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.

I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.