RVS Argument of Scipy Kstest for non-standard (Landau) distribution - python

I am working with Landau distributions (https://en.wikipedia.org/wiki/Landau_distribution) and am trying to use scipy.stats.kstest to get a goodness of fit statistic.
Since the Landau distribution is a non-standard distribution for scipy, I am using an import from ROOT (the CERN C++ package) as a custom, callable CDF as per the stats.kstest documentation. I also have a 1-D array of my measured/empirical data.
As it stands, I have the following definition of my kstest function:
def get_ks_test(fit_info):
coeffs, erfc_bool = fit_info[0], fit_info[4]
(mu,sigma,A_l,A_e) = coeffs
def Lcdf(x):
return [A_l*Math.landau_cdf(i,sigma,mu) for i in x]
def integrand(x):
return A_e*erfc((x-mu)/sigma)
def total_cdf(x):
return Lcdf(x) + quad(integrand,0,x)
scatter_data = pdf_fit_return[3]
return kstest(scatter_data,total_cdf)
Here Lcdf and integrand are combined to form my callable CDF function. The problem arises with the first argument of kstest, the "rvs" array scatter_data. As I understand it, the "rvs" argument is supposed to be an array of observations to match with the distribution in question. But in the source code, "rvs" is sent as an argument to "cdf". This makes no sense to me; shouldn't the CDF be a function of x, not of the data it's supposed to match?
When I pass my scatter_data as the first argument of kstest, it tries to plug those values into total_cdf when that function is supposed to take an argument in x. I don't know if I am failing to understand the usage of rvs/cdf here or what, but the way the kstest function is designed seems backwards to me.
I would like to be able to give kstest my data, my intended distribution (a callable CDF) but all the examples I've seen online involve included or standard distributions (normal, beta, etc.).

Related

What do the scipy.stats.binom and script.stats.hypergeom functions actually do?

I am trying to work with some hypergeometric and binomial random variables, and so I am looking at the scipy.stats functionality. But I'm confused what scipy.stats.binom() and script.stats.hypergeom() functions actually do. Do they implicitly create a PMF for with given parameters, which we then access with the stats.pmf() function, or do they define a function from the sample space to the numerical quantities we define? The last is what a random variable actually does, but I haven't passed a sample space to the binom or hypergeom functions, so I'm confused about what they are actually doing. The reference manual doesn't clear things up.
Thank you for any help.
According to the documentation:
A binomial discrete random variable.
As an instance of the rv_discrete class, binom object inherits from it
a collection of generic methods (see below for the full list), and
completes them with details specific for this particular distribution.
Some of these methods are pmf(k, n, p, loc=0), median(n, p, loc=0), and std(n, p, loc=0).
Alternatively, the distribution object can be called (as a function) to fix the shape and location. This returns a “frozen” RV object holding the given parameters fixed.
So that
from scipy.stats import binom
n,p = 5, 0.4
rv = binom(n, p)
rv.rvs(size=1000)
binom.rvs(n, p, size=1000)
do the same thing, because you froze the parameters at n,p when you called the constructor function binom.

Limits involving the cumulative distribution function of a normal variable

I'm working through some exercises on improper integrals and I've stumbled across an issue I can't resolve. I'm attempting to use the limit() function on the following problem:
Here N(x) is the cumulative distribution function of the standard normal variable.
The limit() function so far hasn't caused any problems, including problems which require L'Hôpital's rule be applied. However, I'm struggling to get compute the correct answer for this particular problem and can't work out why. The following code yields an incorrect answer
from sympy import *
x, y = symbols('x y')
init_printing(use_unicode=False) #Print the answers in unicode characters
cum_distribution = (1/sqrt(2*pi)*(integrate(exp(-y**2/2), (y, -oo, x))))
func = (cum_distribution -(1/2)-(x/sqrt(2*pi)))/(x**3)
limit(func, x, 0)
If I apply L'Hôpital's rule, i get the correct
l_hopital = diff((cum_distribution -(1/2)-(x/sqrt(2*pi))), x)/diff(x**3, x)
limit(l_hopital, x, 0)
I looked through the limit() function source code and my understanding is that L'Hôpital's rule isn't applied? In this case, can this problem be solved using the limit() function without applying this rule?
At present, a limit involving the function erf (known as the error function, related to normal CDF) can only be evaluated when the argument of erf tends to positive infinity. Limits at other places are either not evaluated, or evaluated incorrectly. (Related PR). This includes the limit
limit(-(sqrt(2)*x - sqrt(pi)*erf(sqrt(2)*x/2))/(2*sqrt(pi)*x**3), x, 0)
which returns unevaluated (though I would not call this incorrect). As a workaround, you can compute the Taylor series of this function with one term (the constant term), which gives the correct value of the limit:
series(func, x, 0, 1).removeO()
returns -sqrt(2)/(12*sqrt(pi)).
As in calculus practice, L'Hopital's rule is inferior to power series techniques when it comes to algorithmic computations, and SymPy relies primarily on the latter. The algorithm it uses is devised and explained in On Computing Limits in a Symbolic Manipulation System by Dominik Gruntz.

What is fitfunc and errfunc intuitively in python?

I just wanted to ask you all about what is fitfunc, errfunc followed by scipy.optimize.leastsq is intuitively. I am not really used to python but I would like to understand this. Here is the code that I am trying to understand.
def optimize_parameters2(p0,mz):
fitfunc = lambda p,p0,mz: calculate_sp2(p, p0, mz)
errfunc = lambda p,p0,mz: exp-fitfunc(p,p0,mz)
return scipy.optimize.leastsq(errfunc, p0, args=(p0,mz))
Can someone please explain what this code is saying narratively word by word?
Sorry for being so specific but I really do have trouble understanding what it's saying.
This particular code snippet is implementing nonlinear least-squares regression to find the parameters of a curve function (this is the fitfunc, here) that best fit a set of data (exp, probably an abbreviation for "experimental data"). leastsq() is a somewhat more general routine for doing nonlinear least-squares optimization, not just curve-fitting. It requires a function (named errfunc, here) that is given a vector of parameters (p) and returns an array. It will attempt to find the parameter vector that minimizes the square of the returned array. In order to implement "fitting a curve to data" with leastsq, you have to provide an errfunc that evaluates the curve (fitfunc) at the given trial parameter vector and then subtracts it from the data (i.e. calculate the "error" or sometimes called the "residuals").
Just to be clear, none of these names are important. I'm just using them to refer to specific parts of the code snippet you provided. You will find other code that uses leastsq() for curve-fitting that names and organizes the code a little bit differently, but now that you know the general scheme, you should be able to follow along.
Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at runtime, using a construct called lambda. In your example, fitfunc and errfunc are two such lambda functions.
I believe calculate_sp2 and exp_fitfunc are simply two functions which are in the code but you didn't provide their code in the example. So, in short fitfunc actually calls the calculate_sp2 function with 3 parameters (p, p0, mz) and returns the value which is returned by calculate_sp2. errfunc also works in the same manner.
As mentioned in official documentation of scipy.optimize.leastsq, leastsq() minimizes the sum of squares of a set of equations. You can learn about the parameters of leastsq() from the official documentation.
I am giving a simple example to illustrate how lambda function works.
def add(x,y):
return x + y
def subtract(x,y):
return x-y if x > y else y-x
def main(x,y):
addition = lambda x,y: add(x,y)
subtraction = lambda x,y: subtract(x,y)
return addition(x,y) * subtraction(x,y)
print(main(7,4)) # prints 33 which is equal to (7+4)*(7-4)

Understanding numpy.random.lognormal

I'm translating Matlab code (written by someone else) to Python.
In one section of the Matlab code, a variable X_new is set to a value drawn from a log-normal distribution as follows:
% log normal distribution
X_new = exp(normrnd(log(X_old), sigma));
That is, a random value is drawn from a normal distribution centered at log(X_old), and X_new is set to e raised to this value.
The direct translation of this code to Python is as follows:
import numpy as np
X_new = np.exp(np.random.normal(np.log(X_old), sigma))
But numpy includes a log-normal distribution which can be sampled directly.
My question is, is the line of code that follows equivalent to the lines of code above?
X_new = np.random.lognormal(np.log(X_old), sigma)
I think I'm going to have to answer my own question here.
From the documentation for np.random.lognormal, we have
A variable x has a log-normal distribution if log(x) is normally distributed.
Let's think about X_new from the Matlab code as a particular instance of a random variable x. The question is, is log(x) normally distributed here? Well, log(X_new) is just normrnd(log(X_old), sigma). So the answer is yes.
Now let's move to the call to np.random.lognormal in the second version of the Python code. X_new is again a particular instance of a random variable we can call x. Is log(x) normally distributed here? Yes, it must be, else numpy would not call this function lognormal. The mean of the underlying normal distribution is log(X_old) which is the same as the mean of the normal distribution in the Matlab code.
Hence, all implementations of the log-normal distribution in the question are equivalent (ignoring any very low-level implementation differences between the languages).

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!
This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

Categories