The maxwell-boltzmann distribution is given by
(from MathWorld - A Wolfram Web Resource: wolfram.com)
. The scipy.stats.maxwell distribution uses loc and scale parameters to define this distribution. How are the parameters in the two definitions connected? I also would appreciate if someone could tell in general how to determine the relation between parameters in scipy.stats and their usual definition.
The loc parameter always shifts the x variable. In other words, it generalizes
the distribution to allow shifting x=0 to x=loc. So that when loc is nonzero,
maxwell.pdf(x) = sqrt(2/pi)x**2 * exp(-x**2/2), for x > 0
becomes
maxwell.pdf(x, loc) = sqrt(2/pi)(x-loc)**2 * exp(-(x-loc)**2/2), for x > loc.
The doc string for scipy.stats.maxwell states:
A special case of a chi distribution, with df = 3, loc = 0.0, and
given scale = a, where a is the parameter used in the Mathworld
description.
So the scale corresponds to the parameter a in the equation
(from MathWorld - A Wolfram Web Resource: wolfram.com)
In general you need to read the distribution's doc string to know what parameters the distribution has. The beta distribution, for example, has a and b shape parameters in addition to loc and scale.
However, I believe for all continuous distributions,
distribution.pdf(x, loc, scale) is identically equivalent to
distribution.pdf(y) / scale with y = (x - loc) / scale.
Related
I want to randomly generate numbers that follow a Erlang Distribution for an arrival process. I want to set the number of arrivals k as a parameter of the Erlang Distribution.
scipy.stats.erlang.rvs(a, loc=0, scale=1, size=1, random_state=None)
I am not so sure what loc and scale mean, as in the documentation they did not really clarify what they represent.
Any help would be appreciated.
As Erlang distribution is a particular case of the Gamma distribution, by checking the gamma documentation:
The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, gamma.pdf(x, a, loc, scale) is identically equivalent to gamma.pdf(y, a) / scale with y = (x - loc) / scale. Note that shifting the location of a distribution does not make it a “noncentral” distribution; noncentral generalizations of some distributions are available in separate classes.
In the case of Erlang distribution, a should be an integer and the scale should be 1/lambda.
I was looking here: numpy
And I can see you can use the command np.random.standard_cauchy() specifying an array, to sample from a standard Cauchy.
I need to sample from a Cauchy which might have x_0 != 0 and gamma != 1, i.e. might not be located at the origin, nor have scale equal to 1.
How can I do this?
If you have scipy, you can use scipy.stats.cauchy, which takes a location (x0) and a scale (gamma) parameter. It exposes the rvs method to draw random samples:
x = stats.cauchy.rvs(loc=100, scale=2.5, size=1000) # draw 1000 samples
You may avoid the dependency on SciPy, since the Cauchy distribution is part of the location-scale family. That means, if you draw a sample x from Cauchy(0, 1), just shift it by x_0 and multiply with gamma and x' = x_0 + gamma * x will be distributed according to Cauchy(x_0, gamma).
From the docs
The probability mass function for zipf is:
zipf.pmf(k, a) = 1/(zeta(a) * k**a)
for k >= 1.
zipf takes a as shape parameter.
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, zipf.pmf(k, a, loc) is identically equivalent to zipf.pmf(k - loc, a).
But what does the a and k refer to? What does "shape parameter" mean?
Additionally, in scipy.stats.zipf.interval, there's an alpha parameter.
The description of the .interval() method is simply:
Endpoints of the range that contains alpha percent of the distribution
What does the alpha parameter mean? Is that the "confidence interval"?
What does "shape parameter" mean?
As the name suggests, a shape parameter determines the shape of a distribution. This is probably easiest to explain when starting with what a shape parameter is not:
A location parameter shifts the distribution but leaves it otherwise unchanged. For example, the mean of a normal distribution is a location parameter. If X is normally distributed with mean mu, then X + a is normally distributed with mean mu + a.
A scale parameter makes the distribution wider or narrower. For example, the standard deviation of a normal distribution is a scale parameter. If X is normally distributed with standard deviation sigma, then X * a is normally distributed with standard deviation sigma * a.
Finally, a shape parameter changes the shape of the distribution. For example, the Gamma distribution has a shape parameter k that determines how skewed the distribution is (= how much it "leans" to one side).
But what does the a and k refer to?
k is the variable parameterized by the distribution. With zipf.pmf you can compute the probability of any k, given shape parameter a. Below is a plot that demonstrates how achanges the shape of the distribution (the individual probabilities of different k).
A high a makes large values of k very unlikely, while a low a makes small k less likely and larger kare possible.
What does the alpha parameter mean? Is that the "confidence interval"?
It is wrong to say that alpha is the confidence interval. It is the confidence level. I guess that is what you meant. For example, alpha=0.95 Means that you have a 95% confidence interval. If you generate random ks from the particular distribution, 95% of them will be in the range returned by zipf.interval.
Code for the plot:
from scipy.stats import zipf
import matplotlib.pyplot as plt
import numpy as np
k = np.linspace(0, 10, 101)
for a in [1.3, 2.6]:
p = zipf.pmf(k, a=a)
plt.plot(k, p, label='a={}'.format(a), linewidth=2)
plt.xlabel('k')
plt.ylabel('probability')
plt.legend()
plt.show()
Could you use the kstest in scipy.stats for the non-standard distribution functions (ie. vary the DOF for Students t, or vary gamma for Cauchy)? My end goal is to find the max p-value and corresponding parameter for my distribution fit but that isn't the issue.
EDIT:
"
scipy.stat's cauchy pdf is:
cauchy.pdf(x) = 1 / (pi * (1 + x**2))
where it implies x_0 = 0 for the location parameter and for gamma, Y = 1. I actually need it to look like this
cauchy.pdf(x, x_0, Y) = Y**2 / [(Y * pi) * ((x - x_0)**2 + Y**2)]
"
Q1) Could Students t, at least, could be used in a way perhaps like
stuff = []
for dof in xrange(0,100):
d, p, dof = scipy.stats.kstest(data, "t", args = (dof, ))
stuff.append(np.hstack((d, p, dof)))
since it seems to have the option to vary the parameter?
Q2) How would you do this if you needed the full normal distribution equation (need to vary sigma) and Cauchy as written above (need to vary gamma)? EDIT: Instead of searching scipy.stats for non-standard distributions, is it actually possible to feed a function I write into the kstest that will find p-value's?
Thanks kindly
It seems that what you really want to do is parameter estimation.Using the KT-test in this manner is not really what it is meant for. You should use the .fit method for the corresponding distribution.
>>> import numpy as np, scipy.stats as stats
>>> arr = stats.norm.rvs(loc=10, scale=3, size=10) # generate 10 random samples from a normal distribution
>>> arr
array([ 11.54239861, 15.76348509, 12.65427353, 13.32551871,
10.5756376 , 7.98128118, 14.39058752, 15.08548683,
9.21976924, 13.1020294 ])
>>> stats.norm.fit(arr)
(12.364046769964004, 2.3998164726918607)
>>> stats.cauchy.fit(arr)
(12.921113834451496, 1.5012714431045815)
Now to quickly check the documentation:
>>> help(cauchy.fit)
Help on method fit in module scipy.stats._distn_infrastructure:
fit(data, *args, **kwds) method of scipy.stats._continuous_distns.cauchy_gen instance
Return MLEs for shape, location, and scale parameters from data.
MLE stands for Maximum Likelihood Estimate. Starting estimates for
the fit are given by input arguments; for any arguments not provided
with starting estimates, ``self._fitstart(data)`` is called to generate
such.
One can hold some parameters fixed to specific values by passing in
keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
and ``floc`` and ``fscale`` (for location and scale parameters,
respectively).
...
Returns
-------
shape, loc, scale : tuple of floats
MLEs for any shape statistics, followed by those for location and
scale.
Notes
-----
This fit is computed by maximizing a log-likelihood function, with
penalty applied for samples outside of range of the distribution. The
returned answer is not guaranteed to be the globally optimal MLE, it
may only be locally optimal, or the optimization may fail altogether.
So, let's say I wanted to hold one of those parameters constant, you could easily do:
>>> stats.cauchy.fit(arr, floc=10)
(10, 2.4905786982353786)
>>> stats.norm.fit(arr, floc=10)
(10, 3.3686549590571668)
Scipy docs give the distribution form used by exponential as:
expon.pdf(x) = lambda * exp(- lambda*x)
However the fit function takes :
fit(data, loc=0, scale=1)
And the rvs function takes:
rvs(loc=0, scale=1, size=1)
Question 1:
Why the extraneous location variable? I know that exponentials are just specific forms of a more general distribution (gamma) but why include the uneeded information? Even gamma doesn't have a location parameter.
Question 2:
Is the out put of the fit(...) in the same order as the input variable. By that I mean
If I do :
t = fit([....]) , t will have the form t[0], t[1]
Should I interpret t[0] as the shape and t1 as the scale.
Does this hold for all the distributions?
What about for gamma :
fit(data, a, loc=0, scale=1)
Every univariate probability distribution, no matter what its usual formulation, can be extended to include a location and scale parameter. Sometimes, this entails extending the support of the distribution from just the positive/non-negative reals to the whole real number line with just a PDF value of 0 when below the loc value. scipy.stats does this to move all of the handling of loc and scale to a common method shared by all distributions. It has been suggested to remove this, and make distributions like gamma loc-less to follow their canonical formulations. However, it turns out that some people do actually use "shifted gamma" distributions with nonzero loc parameters to model the sizes of sunspots, if I remember correctly, and the current behavior of scipy.stats was perfect for them. So we're keeping it.
The output of the fit() method is a tuple of the form (shape0, shape1, ..., shapeN, loc, scale) if there are N shape parameters. For a normal distribution, which has no shape parameters, it will return just (loc, scale). For a gamma distribution, which has one, it will return (shape, loc, scale). Multiple shape parameters will be in the same order that you give to every other method on the distribution. This holds for all distributions.