Scipy docs give the distribution form used by exponential as:
expon.pdf(x) = lambda * exp(- lambda*x)
However the fit function takes :
fit(data, loc=0, scale=1)
And the rvs function takes:
rvs(loc=0, scale=1, size=1)
Question 1:
Why the extraneous location variable? I know that exponentials are just specific forms of a more general distribution (gamma) but why include the uneeded information? Even gamma doesn't have a location parameter.
Question 2:
Is the out put of the fit(...) in the same order as the input variable. By that I mean
If I do :
t = fit([....]) , t will have the form t[0], t[1]
Should I interpret t[0] as the shape and t1 as the scale.
Does this hold for all the distributions?
What about for gamma :
fit(data, a, loc=0, scale=1)
Every univariate probability distribution, no matter what its usual formulation, can be extended to include a location and scale parameter. Sometimes, this entails extending the support of the distribution from just the positive/non-negative reals to the whole real number line with just a PDF value of 0 when below the loc value. scipy.stats does this to move all of the handling of loc and scale to a common method shared by all distributions. It has been suggested to remove this, and make distributions like gamma loc-less to follow their canonical formulations. However, it turns out that some people do actually use "shifted gamma" distributions with nonzero loc parameters to model the sizes of sunspots, if I remember correctly, and the current behavior of scipy.stats was perfect for them. So we're keeping it.
The output of the fit() method is a tuple of the form (shape0, shape1, ..., shapeN, loc, scale) if there are N shape parameters. For a normal distribution, which has no shape parameters, it will return just (loc, scale). For a gamma distribution, which has one, it will return (shape, loc, scale). Multiple shape parameters will be in the same order that you give to every other method on the distribution. This holds for all distributions.
Related
I want to randomly generate numbers that follow a Erlang Distribution for an arrival process. I want to set the number of arrivals k as a parameter of the Erlang Distribution.
scipy.stats.erlang.rvs(a, loc=0, scale=1, size=1, random_state=None)
I am not so sure what loc and scale mean, as in the documentation they did not really clarify what they represent.
Any help would be appreciated.
As Erlang distribution is a particular case of the Gamma distribution, by checking the gamma documentation:
The probability density above is defined in the “standardized” form. To shift and/or scale the distribution use the loc and scale parameters. Specifically, gamma.pdf(x, a, loc, scale) is identically equivalent to gamma.pdf(y, a) / scale with y = (x - loc) / scale. Note that shifting the location of a distribution does not make it a “noncentral” distribution; noncentral generalizations of some distributions are available in separate classes.
In the case of Erlang distribution, a should be an integer and the scale should be 1/lambda.
I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)
I want to find which distribution and with what parameters offers the closest fit to the CDF samples.
I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.
Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?
What alternatives could I use to find a matching known distribution?
I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.
If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.
We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.
I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.
n = 100
x = np.linspace(-4,4,n)
f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
data = f(x,0.2,1) + 0.05*np.random.randn(n)
Now, use curve_fit to find parameters.
mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
This gives output
>> mu,sigma
0.1828320963531838, 0.9452044983927278
We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well.
Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.
#tch Thank you for the answer. I read on the technique and successfully applied it. I wanted to apply the fit to all continuous distribution supported by scipy.stats so I ended up doing the following:
fitted = []
failed = []
for d in dist_list:
dist_name = d[0] #fetch the distribution name
dist_object = getattr(ss, dist_name) #fetch the distribution object
param_default = d[1] #fetch the default distribution parameters
# For distributions with only location and scale set those to the default loc=0 and scale=1
if not param_default:
param_default = (0,1)
# Computed parameters of fitted distribution
try:
param,cov = curve_fit(dist_object.cdf,data_in,data_out,p0=param_default,method='trf')
# Only take distributions which do not result in zero covariance as those are not a valid fit
if np.any(cov):
fitted.append((dist_name,param),)
# Capture which distributions are not possible to be fitted (variety of reasons)
except (NotImplementedError,RuntimeError) as e:
failed.append((dist_name,e),)
pass
In the above, the empirical cdf distribution is captured in data_out which holds the sampled cdf values for a range of data_in data points. The list dist_list holds for each distribution in scipy.stats.rv_continuous the name of the distribution as first element and a list of the default parameters as second element. Default parameters I extract from scipy.stats._distr_params.
Some distributions cannot be fitted and raise an error. I keep those is failed list.
Finally, I generate a list fitted which holds for each successfully fitted distribution the estimated parameters.
I have a 1-D discrete dataset. On this set, I want to perform a kernel density estimation with sklearn's built-in function:
from sklearn.neighbors.kde import KernelDensity
data = ... # array of shape [5000, 1]
## perform kde with gaussian kernels
kde = KernelDensity(kernel='gaussian', bandwidth=0.8).fit(data.reshape(-1, 1))
With help from kde's instance method score_samples, I am able to plot a reasonable estimation of the underlying density function:
## code for plot
X_plot = np.linspace(-5, 100, 10000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
plt.plot(X_plot[:, 0], np.exp(log_dens))
I want to use this distribution to perform a one-sample KS-test. I found that scipy already implements this functionality. Check the documentation here. It says:
scipy.stats.kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx')
rvs : str, array or callable
If a string, it should be the name of a distribution in scipy.stats.
If an array, it should be a 1-D array of observations of random
variables. If a callable, it should be a function to generate random
variables; it is required to have a keyword argument size.
cdf : str or callable
If a string, it should be the name of a distribution in scipy.stats. If rvs is a string then cdf can be False or the same as
rvs. If a callable, that callable is used to calculate the cdf.
Basically, rvs is the new sample data and cdf is the cumulative distribution function (integral of the pdf). I was not able to find out how to access the function that calculates the pdf within sklearn, so that I can integrate it and feed it to the kstest.
Has anybody an idea how to get there? Also, if there are any alternatives to this approach, please let me know.
You could simply integrate score_samples to obtain the cdf. scipy.integrate.quad might work.
** EDIT ** It seems that score_samples is the log density but when un-logged integrates to 1. Does need some reshaping though, and the scipy integration bounds don't accept arrays unfortunately.
def cdf(y):
return functools.partial(
scipy.integrate.quad,
lambda x: np.exp(kde.score_samples(np.array([x]).reshape(-1,1)))[0],
-np.inf
)(y)[0]
def array_cdf(X):
return np.array(list(map(cdf, X)))
scipy.stats.kstest(data, array_cdf)
From the docs
The probability mass function for zipf is:
zipf.pmf(k, a) = 1/(zeta(a) * k**a)
for k >= 1.
zipf takes a as shape parameter.
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, zipf.pmf(k, a, loc) is identically equivalent to zipf.pmf(k - loc, a).
But what does the a and k refer to? What does "shape parameter" mean?
Additionally, in scipy.stats.zipf.interval, there's an alpha parameter.
The description of the .interval() method is simply:
Endpoints of the range that contains alpha percent of the distribution
What does the alpha parameter mean? Is that the "confidence interval"?
What does "shape parameter" mean?
As the name suggests, a shape parameter determines the shape of a distribution. This is probably easiest to explain when starting with what a shape parameter is not:
A location parameter shifts the distribution but leaves it otherwise unchanged. For example, the mean of a normal distribution is a location parameter. If X is normally distributed with mean mu, then X + a is normally distributed with mean mu + a.
A scale parameter makes the distribution wider or narrower. For example, the standard deviation of a normal distribution is a scale parameter. If X is normally distributed with standard deviation sigma, then X * a is normally distributed with standard deviation sigma * a.
Finally, a shape parameter changes the shape of the distribution. For example, the Gamma distribution has a shape parameter k that determines how skewed the distribution is (= how much it "leans" to one side).
But what does the a and k refer to?
k is the variable parameterized by the distribution. With zipf.pmf you can compute the probability of any k, given shape parameter a. Below is a plot that demonstrates how achanges the shape of the distribution (the individual probabilities of different k).
A high a makes large values of k very unlikely, while a low a makes small k less likely and larger kare possible.
What does the alpha parameter mean? Is that the "confidence interval"?
It is wrong to say that alpha is the confidence interval. It is the confidence level. I guess that is what you meant. For example, alpha=0.95 Means that you have a 95% confidence interval. If you generate random ks from the particular distribution, 95% of them will be in the range returned by zipf.interval.
Code for the plot:
from scipy.stats import zipf
import matplotlib.pyplot as plt
import numpy as np
k = np.linspace(0, 10, 101)
for a in [1.3, 2.6]:
p = zipf.pmf(k, a=a)
plt.plot(k, p, label='a={}'.format(a), linewidth=2)
plt.xlabel('k')
plt.ylabel('probability')
plt.legend()
plt.show()
Could you use the kstest in scipy.stats for the non-standard distribution functions (ie. vary the DOF for Students t, or vary gamma for Cauchy)? My end goal is to find the max p-value and corresponding parameter for my distribution fit but that isn't the issue.
EDIT:
"
scipy.stat's cauchy pdf is:
cauchy.pdf(x) = 1 / (pi * (1 + x**2))
where it implies x_0 = 0 for the location parameter and for gamma, Y = 1. I actually need it to look like this
cauchy.pdf(x, x_0, Y) = Y**2 / [(Y * pi) * ((x - x_0)**2 + Y**2)]
"
Q1) Could Students t, at least, could be used in a way perhaps like
stuff = []
for dof in xrange(0,100):
d, p, dof = scipy.stats.kstest(data, "t", args = (dof, ))
stuff.append(np.hstack((d, p, dof)))
since it seems to have the option to vary the parameter?
Q2) How would you do this if you needed the full normal distribution equation (need to vary sigma) and Cauchy as written above (need to vary gamma)? EDIT: Instead of searching scipy.stats for non-standard distributions, is it actually possible to feed a function I write into the kstest that will find p-value's?
Thanks kindly
It seems that what you really want to do is parameter estimation.Using the KT-test in this manner is not really what it is meant for. You should use the .fit method for the corresponding distribution.
>>> import numpy as np, scipy.stats as stats
>>> arr = stats.norm.rvs(loc=10, scale=3, size=10) # generate 10 random samples from a normal distribution
>>> arr
array([ 11.54239861, 15.76348509, 12.65427353, 13.32551871,
10.5756376 , 7.98128118, 14.39058752, 15.08548683,
9.21976924, 13.1020294 ])
>>> stats.norm.fit(arr)
(12.364046769964004, 2.3998164726918607)
>>> stats.cauchy.fit(arr)
(12.921113834451496, 1.5012714431045815)
Now to quickly check the documentation:
>>> help(cauchy.fit)
Help on method fit in module scipy.stats._distn_infrastructure:
fit(data, *args, **kwds) method of scipy.stats._continuous_distns.cauchy_gen instance
Return MLEs for shape, location, and scale parameters from data.
MLE stands for Maximum Likelihood Estimate. Starting estimates for
the fit are given by input arguments; for any arguments not provided
with starting estimates, ``self._fitstart(data)`` is called to generate
such.
One can hold some parameters fixed to specific values by passing in
keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
and ``floc`` and ``fscale`` (for location and scale parameters,
respectively).
...
Returns
-------
shape, loc, scale : tuple of floats
MLEs for any shape statistics, followed by those for location and
scale.
Notes
-----
This fit is computed by maximizing a log-likelihood function, with
penalty applied for samples outside of range of the distribution. The
returned answer is not guaranteed to be the globally optimal MLE, it
may only be locally optimal, or the optimization may fail altogether.
So, let's say I wanted to hold one of those parameters constant, you could easily do:
>>> stats.cauchy.fit(arr, floc=10)
(10, 2.4905786982353786)
>>> stats.norm.fit(arr, floc=10)
(10, 3.3686549590571668)