How to use Erlang Distribution Function Scipy - python

I want to randomly generate numbers that follow a Erlang Distribution for an arrival process. I want to set the number of arrivals k as a parameter of the Erlang Distribution.
scipy.stats.erlang.rvs(a, loc=0, scale=1, size=1, random_state=None)
I am not so sure what loc and scale mean, as in the documentation they did not really clarify what they represent.
Any help would be appreciated.

As Erlang distribution is a particular case of the Gamma distribution, by checking the gamma documentation:
The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, gamma.pdf(x, a, loc, scale) is identically equivalent to gamma.pdf(y, a) / scale with y = (x - loc) / scale. Note that shifting the location of a distribution does not make it a “noncentral” distribution; noncentral generalizations of some distributions are available in separate classes.
In the case of Erlang distribution, a should be an integer and the scale should be 1/lambda.

Related

Generate Random Number in Range from Single-Tailed Distribution with Python

I want to generate a random float in the range [0, 1) from a one-tailed distribution that looks like this
The above is the chi-squared distribution. I can only find resources on drawing from a uniform distribution in a range, however.
You could use a Beta distribution, e.g.
import numpy as np
np.random.seed(2018)
np.random.beta(2, 5, 10)
#array([ 0.18094173, 0.26192478, 0.14055507, 0.07172968, 0.11830031,
# 0.1027738 , 0.20499125, 0.23220654, 0.0251325 , 0.26324832])
Here we draw numbers from a Beta(2, 5) distribution
The Beta distribution is a very versatile and fundamental distribution in statistics; without going into any details, by changing the parameters alpha and beta you can make the distribution left-skewed, right-skewed, uniform, symmetric etc. The distribution is defined on the interval [0, 1] which is consistent with what you're after.
A more technical comment
While the Kumaraswamy distribution certainly has more benign algebraic properties than the Beta distribution I would argue that the latter is the more fundamental distribution; for example, in Bayesian inference, the Beta distribution often enters as the conjugate prior when dealing with binomial(-like) processes.
Secondly, the mean and variance of the Beta distribution can be expressed quite simply in terms of the parameters alpha, beta; for example, the mean is simply given by alpha / (alpha + beta).
Lastly, from a computational and statistical inference point of view, fitting a Beta distribution to data is usually done in a few lines of code in Python (or R), where most Python libraries like numpy and scipy already include methods to deal with the Beta distribution.
I would leaning toward distribution which is naturally bounded on [0...1] interval (or any other [a...b] interval which could be rescaled later), like #MauritsEvers answer. Reason is, you know the distribution and could derive (or read) some interesting facts about it. If you use chi2 adn truncate it, it is unclear how to argue about properties of what you've got.
Personally I prefer Kumaraswamy distribution over Beta distribution, expressions for mean,mode, variance etc are a lot simpler.
Just install it
pip install kumaraswamy
and sample
from kumaraswamy import kumaraswamy
d = kumaraswamy(a=2.0, b=5.0)
q = d.rvs(10)
print(q)
will produce 10 numbers following magenta curve in the Wiki article.
If you don't want Beta or Kumaraswamy, there is f.e. Logit-normal distribution and quite a few others
Look at the numpy.random.chisquare method library.
numpy.random.chisquare(df, size=None)
>>> np.random.chisquare(2,4)
array([ 1.89920014, 9.00867716, 3.13710533, 5.62318272])
If you want to draw a sample of size N = 5 from a ChiSquare distribution, you can try OpenTURNS library:
import openturns as ot`
# define your distribution. Here, nu = 3. (nu is a float > 0)
distribution = ot.ChiSquare(3)
# draw a sample of size N from `distribution`
N=5
sample = distribution.getSample(N)
A complete list of distributions is available here
sample has an OpenTURNS format but you can manipulate it as a Numpy array:
s = np.array(Sample)
print(s)
>>>array([[1.65299759],
[6.78405097],
[0.88528975],
[0.87900211],
[0.25031129]])
You can also easily plot the distribution PDF just by calling : distribution.drawPDF()
Customizations:
from openturns.viewer import View
graph = distribution.drawPDF()
title = str(distribution)[:100].split('\n')[0]
graph.setTitle(title)
View(graph, add_legend=False)

How to generate random numbers with predefined probability distribution?

I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.

What are the loc and scale parameters in scipy.stats.maxwell?

The maxwell-boltzmann distribution is given by
(from MathWorld - A Wolfram Web Resource: wolfram.com)
. The scipy.stats.maxwell distribution uses loc and scale parameters to define this distribution. How are the parameters in the two definitions connected? I also would appreciate if someone could tell in general how to determine the relation between parameters in scipy.stats and their usual definition.
The loc parameter always shifts the x variable. In other words, it generalizes
the distribution to allow shifting x=0 to x=loc. So that when loc is nonzero,
maxwell.pdf(x) = sqrt(2/pi)x**2 * exp(-x**2/2), for x > 0
becomes
maxwell.pdf(x, loc) = sqrt(2/pi)(x-loc)**2 * exp(-(x-loc)**2/2), for x > loc.
The doc string for scipy.stats.maxwell states:
A special case of a chi distribution, with df = 3, loc = 0.0, and
given scale = a, where a is the parameter used in the Mathworld
description.
So the scale corresponds to the parameter a in the equation
(from MathWorld - A Wolfram Web Resource: wolfram.com)
In general you need to read the distribution's doc string to know what parameters the distribution has. The beta distribution, for example, has a and b shape parameters in addition to loc and scale.
However, I believe for all continuous distributions,
distribution.pdf(x, loc, scale) is identically equivalent to
distribution.pdf(y) / scale with y = (x - loc) / scale.

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

Scipy - Stats - Meaning of parameters for probability distributions

Scipy docs give the distribution form used by exponential as:
expon.pdf(x) = lambda * exp(- lambda*x)
However the fit function takes :
fit(data, loc=0, scale=1)
And the rvs function takes:
rvs(loc=0, scale=1, size=1)
Question 1:
Why the extraneous location variable? I know that exponentials are just specific forms of a more general distribution (gamma) but why include the uneeded information? Even gamma doesn't have a location parameter.
Question 2:
Is the out put of the fit(...) in the same order as the input variable. By that I mean
If I do :
t = fit([....]) , t will have the form t[0], t[1]
Should I interpret t[0] as the shape and t1 as the scale.
Does this hold for all the distributions?
What about for gamma :
fit(data, a, loc=0, scale=1)
Every univariate probability distribution, no matter what its usual formulation, can be extended to include a location and scale parameter. Sometimes, this entails extending the support of the distribution from just the positive/non-negative reals to the whole real number line with just a PDF value of 0 when below the loc value. scipy.stats does this to move all of the handling of loc and scale to a common method shared by all distributions. It has been suggested to remove this, and make distributions like gamma loc-less to follow their canonical formulations. However, it turns out that some people do actually use "shifted gamma" distributions with nonzero loc parameters to model the sizes of sunspots, if I remember correctly, and the current behavior of scipy.stats was perfect for them. So we're keeping it.
The output of the fit() method is a tuple of the form (shape0, shape1, ..., shapeN, loc, scale) if there are N shape parameters. For a normal distribution, which has no shape parameters, it will return just (loc, scale). For a gamma distribution, which has one, it will return (shape, loc, scale). Multiple shape parameters will be in the same order that you give to every other method on the distribution. This holds for all distributions.

Categories