Find underlaying normal distribution of random vectors - python

I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?

If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.

Related

Python: How to discretize continuous probability distributions for Kullback-Leibler Divergence

I want to find out how many samples are needed at minimum to more or less correctly fit a probability distribution (In my case the Generalized Extreme Value Distribution from scipy.stats).
In order to evaluate the matched function, I want to compute the KL-divergence between the original function and the fitted one.
Unfortunately, all implementations I found (e.g. scipy.stats.entropy) only take discrete arrays as input. So obviously I thought of approximating the pdf by a discrete array, but I just can't seem to figure it out.
Does anyone have experience with something similar? I would be thankful for hints relating directly to my question, but also for better alternatives to determine a distance between two functions in python, if there are any.

Is there a fast alternative to scipy _norm_pdf for correlated distribution sampling?

I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.
In doing this, I've found a quirk in SciPy performance:
# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)
# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)
Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.
The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.
So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?
If you need just the normal ppf, it is indeed puzzling that it is so slow, but you can use scipy.special.erfinv instead:
x = np.random.uniform(0,1,100)
np.allclose(special.erfinv(2*x-1)*np.sqrt(2),stats.norm().ppf(x))
# True
timeit(lambda:stats.norm().ppf(x),number=1000)
# 0.7717257660115138
timeit(lambda:special.erfinv(2*x-1)*np.sqrt(2),number=1000)
# 0.015020604943856597
EDIT:
lognormal and triangle are also straight forward:
c = np.random.uniform()
np.allclose(np.exp(c*special.erfinv(2*x-1)*np.sqrt(2)),stats.lognorm(c).ppf(x))
# True
np.allclose(((1-np.sqrt(1-(x-c)/((x>c)-c)))*((x>c)-c))+c,stats.triang(c).ppf(x))
# True
skew normal I'm not familiar enough, unfortunately.
Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.
To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.

Sampling methods

Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute difference
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?
There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.

Find a random method that best fit list of values

I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.

Fitting a bimodal distribution to a set of values

Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...

Categories