Non-standard distributions variables for KS testing? - python

Could you use the kstest in scipy.stats for the non-standard distribution functions (ie. vary the DOF for Students t, or vary gamma for Cauchy)? My end goal is to find the max p-value and corresponding parameter for my distribution fit but that isn't the issue.
EDIT:
"
scipy.stat's cauchy pdf is:
cauchy.pdf(x) = 1 / (pi * (1 + x**2))
where it implies x_0 = 0 for the location parameter and for gamma, Y = 1. I actually need it to look like this
cauchy.pdf(x, x_0, Y) = Y**2 / [(Y * pi) * ((x - x_0)**2 + Y**2)]
"
Q1) Could Students t, at least, could be used in a way perhaps like
stuff = []
for dof in xrange(0,100):
d, p, dof = scipy.stats.kstest(data, "t", args = (dof, ))
stuff.append(np.hstack((d, p, dof)))
since it seems to have the option to vary the parameter?
Q2) How would you do this if you needed the full normal distribution equation (need to vary sigma) and Cauchy as written above (need to vary gamma)? EDIT: Instead of searching scipy.stats for non-standard distributions, is it actually possible to feed a function I write into the kstest that will find p-value's?
Thanks kindly

It seems that what you really want to do is parameter estimation.Using the KT-test in this manner is not really what it is meant for. You should use the .fit method for the corresponding distribution.
>>> import numpy as np, scipy.stats as stats
>>> arr = stats.norm.rvs(loc=10, scale=3, size=10) # generate 10 random samples from a normal distribution
>>> arr
array([ 11.54239861, 15.76348509, 12.65427353, 13.32551871,
10.5756376 , 7.98128118, 14.39058752, 15.08548683,
9.21976924, 13.1020294 ])
>>> stats.norm.fit(arr)
(12.364046769964004, 2.3998164726918607)
>>> stats.cauchy.fit(arr)
(12.921113834451496, 1.5012714431045815)
Now to quickly check the documentation:
>>> help(cauchy.fit)
Help on method fit in module scipy.stats._distn_infrastructure:
fit(data, *args, **kwds) method of scipy.stats._continuous_distns.cauchy_gen instance
Return MLEs for shape, location, and scale parameters from data.
MLE stands for Maximum Likelihood Estimate. Starting estimates for
the fit are given by input arguments; for any arguments not provided
with starting estimates, ``self._fitstart(data)`` is called to generate
such.
One can hold some parameters fixed to specific values by passing in
keyword arguments ``f0``, ``f1``, ..., ``fn`` (for shape parameters)
and ``floc`` and ``fscale`` (for location and scale parameters,
respectively).
...
Returns
-------
shape, loc, scale : tuple of floats
MLEs for any shape statistics, followed by those for location and
scale.
Notes
-----
This fit is computed by maximizing a log-likelihood function, with
penalty applied for samples outside of range of the distribution. The
returned answer is not guaranteed to be the globally optimal MLE, it
may only be locally optimal, or the optimization may fail altogether.
So, let's say I wanted to hold one of those parameters constant, you could easily do:
>>> stats.cauchy.fit(arr, floc=10)
(10, 2.4905786982353786)
>>> stats.norm.fit(arr, floc=10)
(10, 3.3686549590571668)

Related

Python: how to randomly sample from nonstandard Cauchy distribution, hence with different parameters?

I was looking here: numpy
And I can see you can use the command np.random.standard_cauchy() specifying an array, to sample from a standard Cauchy.
I need to sample from a Cauchy which might have x_0 != 0 and gamma != 1, i.e. might not be located at the origin, nor have scale equal to 1.
How can I do this?
If you have scipy, you can use scipy.stats.cauchy, which takes a location (x0) and a scale (gamma) parameter. It exposes the rvs method to draw random samples:
x = stats.cauchy.rvs(loc=100, scale=2.5, size=1000) # draw 1000 samples
You may avoid the dependency on SciPy, since the Cauchy distribution is part of the location-scale family. That means, if you draw a sample x from Cauchy(0, 1), just shift it by x_0 and multiply with gamma and x' = x_0 + gamma * x will be distributed according to Cauchy(x_0, gamma).

What does the parameters in scipy.stats.zipf mean?

From the docs
The probability mass function for zipf is:
zipf.pmf(k, a) = 1/(zeta(a) * k**a)
for k >= 1.
zipf takes a as shape parameter.
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, zipf.pmf(k, a, loc) is identically equivalent to zipf.pmf(k - loc, a).
But what does the a and k refer to? What does "shape parameter" mean?
Additionally, in scipy.stats.zipf.interval, there's an alpha parameter.
The description of the .interval() method is simply:
Endpoints of the range that contains alpha percent of the distribution
What does the alpha parameter mean? Is that the "confidence interval"?
What does "shape parameter" mean?
As the name suggests, a shape parameter determines the shape of a distribution. This is probably easiest to explain when starting with what a shape parameter is not:
A location parameter shifts the distribution but leaves it otherwise unchanged. For example, the mean of a normal distribution is a location parameter. If X is normally distributed with mean mu, then X + a is normally distributed with mean mu + a.
A scale parameter makes the distribution wider or narrower. For example, the standard deviation of a normal distribution is a scale parameter. If X is normally distributed with standard deviation sigma, then X * a is normally distributed with standard deviation sigma * a.
Finally, a shape parameter changes the shape of the distribution. For example, the Gamma distribution has a shape parameter k that determines how skewed the distribution is (= how much it "leans" to one side).
But what does the a and k refer to?
k is the variable parameterized by the distribution. With zipf.pmf you can compute the probability of any k, given shape parameter a. Below is a plot that demonstrates how achanges the shape of the distribution (the individual probabilities of different k).
A high a makes large values of k very unlikely, while a low a makes small k less likely and larger kare possible.
What does the alpha parameter mean? Is that the "confidence interval"?
It is wrong to say that alpha is the confidence interval. It is the confidence level. I guess that is what you meant. For example, alpha=0.95 Means that you have a 95% confidence interval. If you generate random ks from the particular distribution, 95% of them will be in the range returned by zipf.interval.
Code for the plot:
from scipy.stats import zipf
import matplotlib.pyplot as plt
import numpy as np
k = np.linspace(0, 10, 101)
for a in [1.3, 2.6]:
p = zipf.pmf(k, a=a)
plt.plot(k, p, label='a={}'.format(a), linewidth=2)
plt.xlabel('k')
plt.ylabel('probability')
plt.legend()
plt.show()

Reducing difference between two graphs by optimizing more than one variable in MATLAB/Python?

Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)

Linear fit including all errors with NumPy/SciPy

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.
There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.
Fit with scipy.optimize.curve_fit gives me:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.
I read about another fit method here as well, so I tried to fit with scipy.odr as well:
Beta: [ 2.00538124 2.95000413]
Beta Std Error: [ 0.00652719 0.03870884]
Same but with 20 * y_err:
Beta: [ 2.00517894 2.9489472 ]
Beta Std Error: [ 0.00642428 0.03647149]
The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.
Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.
I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.
Fit with linfit:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00772283 0.04449971]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.15445662 0.88999413]
Fit with linfit(relsigma=True):
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Should I answer my question or just close/delete it now?
One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.
This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.
Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:
Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.
Please note that, from the documentation of curvefit:
sigma : None or N-length sequence
If not None, this vector will be used as relative weights in the
least-squares problem.
The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.
When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458, 3.97843448]))
('Errors: ', array([ 0.09617346, 0.64127574]))
Compares to the original result:
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386, 2.97817514]))
('Errors: ', array([ 0.00782591, 0.02983339]))
Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.
Short answer
For absolute values that include uncertainty in y (and in x for odr case):
In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov))
where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
flag.
For relative values (excludes uncertainty):
In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:
p, cov = numpy.polyfit(x, y, 1,cov = True)
errorbars = numpy.sqrt(numpy.diag(cov))
Long answer
There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?
My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing
diagonal elements of the covariance matrix cov(0,0) and
cov(1,1),
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant
odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207]
curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713]
polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]
Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.
I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).
import scipy.odr
import scipy.optimize
import numpy
x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4
def stddev(cov): return numpy.sqrt(numpy.diag(cov))
def f(B, x): return B[0]*x + B[1]
linear = scipy.odr.Model(f)
mydata = scipy.odr.RealData(x, y, sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput = myodr.run()
cov = myoutput.cov_beta
sd = myoutput.sd_beta
p = myoutput.beta
print 'odr: ', cov[0,0], cov[1,1], sd, stddev(cov)
p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = False,
xtol = 1e-20)
p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = True,
xtol = 1e-20)
print 'curve_fit: ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)
p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit: ', cov4[0,0], cov4[1,1], stddev(cov4)

Scipy - Stats - Meaning of parameters for probability distributions

Scipy docs give the distribution form used by exponential as:
expon.pdf(x) = lambda * exp(- lambda*x)
However the fit function takes :
fit(data, loc=0, scale=1)
And the rvs function takes:
rvs(loc=0, scale=1, size=1)
Question 1:
Why the extraneous location variable? I know that exponentials are just specific forms of a more general distribution (gamma) but why include the uneeded information? Even gamma doesn't have a location parameter.
Question 2:
Is the out put of the fit(...) in the same order as the input variable. By that I mean
If I do :
t = fit([....]) , t will have the form t[0], t[1]
Should I interpret t[0] as the shape and t1 as the scale.
Does this hold for all the distributions?
What about for gamma :
fit(data, a, loc=0, scale=1)
Every univariate probability distribution, no matter what its usual formulation, can be extended to include a location and scale parameter. Sometimes, this entails extending the support of the distribution from just the positive/non-negative reals to the whole real number line with just a PDF value of 0 when below the loc value. scipy.stats does this to move all of the handling of loc and scale to a common method shared by all distributions. It has been suggested to remove this, and make distributions like gamma loc-less to follow their canonical formulations. However, it turns out that some people do actually use "shifted gamma" distributions with nonzero loc parameters to model the sizes of sunspots, if I remember correctly, and the current behavior of scipy.stats was perfect for them. So we're keeping it.
The output of the fit() method is a tuple of the form (shape0, shape1, ..., shapeN, loc, scale) if there are N shape parameters. For a normal distribution, which has no shape parameters, it will return just (loc, scale). For a gamma distribution, which has one, it will return (shape, loc, scale). Multiple shape parameters will be in the same order that you give to every other method on the distribution. This holds for all distributions.

Categories