Probability density function convolution in python: how to efficiently? - python

Given that this is a fairly long post, I think it's best if I start with my question: What is the best way of implementing this model in python?
I'm developling a statistical model that involves the convolution of probability density functions. Without giving away what I can't, here is my problem:
Parameter A describes the time from t0 to event A at t1. It is given by a uniform distribution with lower and upper bounds a and b respectively.
A ~ U(a, b)
Parameter B describes the time from t1 to event B at t2. It is also given by a uniform distribution with lower and upper bounds c and d respectively.
B ~ U(c, d)
Parameter G describes the time from t0 to event C, at an unknown time. It is given by a gamma distribution with shape and scale parameters alpha and beta, with mean = alpha / beta and standard deviation = alpha / (beta ^ 2).
G ~ G(alpha, beta)
We would like to know the probability that G falls between t1 and t2. We would also like to perform a grid search over these parameters, so it is important that the calculation is completed efficiently. The target end user includes civil service agencies who do not have access to high performance computing (like we do), so parallellisation is not an option.
The time at which t1 occurs is given by A, however the time at which t2 occurs is given as the sum of A and B
Y = A + B
W = A - G
Z = Y - G
As these two distributions are uniform, they may be analytically convolved. The probability that G occurs between A and Y is given as P(A < G < Y). We can introduce variables Y and Z as above:
P(A < G < Y) = P(W < 0 < Z)
P(A < G < Y) = P(Z > 0) - P(W > 0)
P(A < G < Y) = [1 - P(Z < 0)] - [1 - P(W < 0)]
P(A < G < Y) = Fw(0) - Fz(0)
Fw(0) and Fz(0) refer to the cumulitive density functions of distributions Y and Z respectively.
Fw(0) can be calculated as shown in this image.
For Fz(0), a triple integral is required.
What would be the best way of computing this maths in python? We have a working program, but it is slow even when we constrain the integration bounds and analytically convolve A and B to give Y. The strategy is to convolve numerically, using quad from scipy.integrate. This is slow and frequently gives us roundoff errors and overruns the maximum number of subdivisions.
Is there a better way?
We can verify the results of the program against two existing implementations, one programmed in fortran but utilising NAG library integration routines, and another written in matlab. The fortran version is satisfactory for our own scientific research, but our motivation for using python is to produce a distributable source code written in a freely available language using open source routines such as those in the Scipy library. The use of NAG library routines percludes this from fortran.

Related

Python: How can I test for a relatively straight line in a series of cubic lines?

I have a collection of curved lines, representing the third degree polynomial line of best fit for some datasets.
I want to differentiate relatively flat lines, filtering these plots, for further analyses.
For example I want to filter subplots 20935, 21004, 21010, 18761, 21037.
How can I do this, with a list of floats as input for these lines?
(using Python 3.8, Numpy, Math, mathplotlib in an anaconda env)
If you have got a list of xs and their respective ys, you can compute the slope for each point and check if the slope is always a constant value.
threshold = 0.001 # add your precision here. zero indicates a perfect straight line
is_straight_line = True
slope = (y[1]-y[0]) / (x[1] - x[0])
for i, (xval, yval) in enumerate(zip(x[2:], y[2:])):
s = (yval - y[i-1]) / (xval - x[i-1])
if abs(s - slope) > threshold:
is_straight_line = False
break
print(is_straight_line)
if you need the computation to be efficient, you should consider using numpy instead.
Knowledge of first-year calculus is assumed. There's a geometric property called "curvature" that basically determines how much a shape bends at a certain point (really the inverse of the radius of the osculating circle at that point).
We can use this link to develop a formula for a cubic function with coefficients [a, b, c, d] at x = x.
def cubic_curvature(a, b, c, d, x):
k = abs(6*a*x + 2*b) / (1 + (3*a*x**2 + 2*b*x + c)**2) ** 1.5
return k
More general algorithms can be created for any polynomial, possibly with assistance from the sympy library depending on your needs.
With this in mind, you can set some threshold for curvature that determines whether the cubic is "straight" enough given its coefficients (I believe scipy or similar should be able to give you these from a list of points) and the x-value to be evaluated at (try the median independent variable).

How to implement Quadratic constraint in SCS for python

I have an quadratic optimisation problem of the form
min w.r.t. to x 1/2 x'x + q'x S.t. Gx <= h
I have a rather big problem ( few million points and constraints), and while cvxopt's default solver proved effective, I'm curious about implementing it in SCS which should be faster (and without using the CVXPY interface).
After a literature search (mainly Boyd's Convex Analysis), the reformulation to SOCP-form should yield
min w.r.t. to z of c'z subject to Az + s = b, s in K
with c = (1 q)' and z = (t x)' where t is a scalar, and K is the cartesian product of the linear cones associated with my original constraints (Gx \leq h) and of a quadratic cone Q = { (t,x) | t >= ||x||}
However, how should I actually define A and b ?
I imagined something in the like of A = [[0 G],[2, 1, .., 1]] and b = (h 0).
However, with the 'q' option set to [1] in the python scs.solve cone dictionary argument I can't get it to work ? What is the expected syntax of A's last line? (That is, assuming my mathematical reformulation is correct...)
Thank you for your help !

SciPy: what is a "generalized" continuous random variable?

I was trying to implement a half-logistic distribution and came across halflogistic and genhalflogistic.
halflogistic: "A half-logistic continuous random variable."
genhalflogistic: "A generalized half-logistic continuous random variable."
This "generalized" version comes up for some of SciPy's other continuous random variables as well, such as gennorm.
My question is: what does "generalized" mean and how is it different from the non-generalized version?
"Generalized" means having one or more additional parameters which somehow affect the shape of the distribution. To find what they are, compare the probability density functions. Let's start with normal:
norm.pdf(x) = exp(-x**2/2)/sqrt(2*pi)
versus
beta
gennorm.pdf(x, beta) = --------------- exp(-|x|**beta)
2 gamma(1/beta)
Here, beta is the additional parameter. If beta = 2, you get the normal distribution (scaled a bit differently compared to norm). With 0 < beta < 2 you get other stable distributions.
It's a bit more confusing with half-logistic, though, because the formulas do not look alike:
halflogistic.pdf(x) = 2 * exp(-x) / (1+exp(-x))**2
versus
genhalflogistic.pdf(x, c) = 2 * (1-c*x)**(1/c-1) / (1+(1-c*x)**(1/c))**2
But taking the limit as c→0 in the latter formula gives the former. So, c is the shape parameter here. The support of generalized half-logistic is the interval [0, 1/c]. The limit form c→0 has infinite support [0, ∞).

how to estimate parameters of mixture of 2 exponential random variables (ideally in Python)

Imagine a simulation experiment in which the output is n total numbers, where k of them are sampled from an exponential random variable with rate a and n-k are sampled from an exponential random variable with rate b. The constraints are that 0 < a ≤ b and 0 ≤ k ≤ n, but a, b, and k are all unknown. Further, because of details of the simulation experiment, when a << b, k ≈ 0, and when a = b, k ≈ n/2.
My goal is to estimate either a or b (don't care about k, and I don't need to estimate both a and b: just one of the two is fine). From speculation, it seems as though estimating just b might be the easiest path (when a << b, there is pretty much nothing to use to estimate a and plenty to estimate b, and when a = b, both there is still plenty to estimate b). I want to do it in Python ideally, but I am open to any free software.
My first approach was to use sklearn.optimize to optimize a likelihood function where, for each number in my dataset, I compute P(X=x) for an exponential with rate a, compute the same for an exponential with rate b, and simply choose the larger of the two:
from sys import stdin
from math import exp,log
from scipy.optimize import fmin
DATA = None
def pdf(x,l): # compute P(X=x) for an exponential rv X with rate l
return l*exp(-1*l*x)
def logML(X,la,lb): # compute the log-ML of data points X given two exponentials with rates la and lb where la < lb
ml = 0.0
for x in X:
ml += log(max(pdf(x,la),pdf(x,lb)))
return ml
def f(x): # objective function to minimize
assert DATA is not None, "DATA cannot be None"
la,lb = x
if la > lb: # force la <= lb
return float('inf')
elif la <= 0 or lb <= 0:
return float('inf') # force la and lb > 0
return -1*logML(DATA,la,lb)
if __name__ == "__main__":
DATA = [float(x) for x in stdin.read().split()] # read input data
Xbar = sum(DATA)/len(DATA) # compute mean
x0 = [1/Xbar,1/Xbar] # start with la = lb = 1/mean
result = fmin(f,x0,disp=DISP)
print("ML Rates: la = %f and lb = %f" % tuple(result))
This unfortunately didn't work very well. For some selections of the parameters, it's within an order of magnitude, but for others, it's absurdly off. Given my problem (with its constraints) and my goal of estimating the larger parameter of the two exponentials (without caring about the smaller parameter nor the number of points that came from either), any ideas?
I posted the question in more general statistical terms on the stats Stack Exchange, and it got an answer:
https://stats.stackexchange.com/questions/291642/how-to-estimate-parameters-of-mixture-of-2-exponential-random-variables-ideally
Also, I tried the following, which worked decently well:
First, for every single integer percentile (1st percentile, 2nd percentile, ..., 99th percentile), I compute the estimate of b using the quantile closed-form equation (where the i-th quantile is the (i *100)-th percentile) for an exponential distribution (the i-th quantile = −ln(1 − i) / λ, so λ = −ln(1 − i) / (i-th quantile)). The result is a list where each i-th element corresponds to the b estimate using the (i+1)-th percentile.
Then, I perform peak-calling on this list using the Python implementation of the Matlab peak-calling function. Then, I take the list of resulting peaks and return the minimum. It seems to work fairly well.
I will implement the EM solution in the Stack Exchange post as well and see which works better.
EDIT: I implemented the EM solution, and it seems to work decently well in my simulations (n = 1000, various a and b).

On ordinary differential equations (ODE) and optimization, in Python

I want to solve this kind of problem:
dy/dt = 0.01*y*(1-y), find t when y = 0.8 (0<t<3000)
I've tried the ode function in Python, but it can only calculate y when t is given.
So are there any simple ways to solve this problem in Python?
PS: This function is just a simple example. My real problem is so complex that can't be solve analytically. So I want to know how to solve it numerically. And I think this problem is more like an optimization problem:
Objective function y(t) = 0.8, Subject to dy/dt = 0.01*y*(1-y), and 0<t<3000
PPS: My real problem is:
objective function: F(t) = 0.85,
subject to: F(t) = sqrt(x(t)^2+y(t)^2+z(t)^2),
x''(t) = (1/F(t)-1)*250*x(t),
y''(t) = (1/F(t)-1)*250*y(t),
z''(t) = (1/F(t)-1)*250*z(t)-10,
x(0) = 0, y(0) = 0, z(0) = 0.7,
x'(0) = 0.1, y'(0) = 1.5, z'(0) = 0,
0<t<5
This differential equation can be solved analytically quite easily:
dy/dt = 0.01 * y * (1-y)
rearrange to gather y and t terms on opposite sides
100 dt = 1/(y * (1-y)) dy
The lhs integrates trivially to 100 * t, rhs is slightly more complicated. We can always write a product of two quotients as a sum of the two quotients * some constants:
1/(y * (1-y)) = A/y + B/(1-y)
The values for A and B can be worked out by putting the rhs on the same denominator and comparing constant and first order y terms on both sides. In this case it is simple, A=B=1. Thus we have to integrate
1/y + 1/(1-y) dy
The first term integrates to ln(y), the second term can be integrated with a change of variables u = 1-y to -ln(1-y). Our integrated equation therefor looks like:
100 * t + C = ln(y) - ln(1-y)
not forgetting the constant of integration (it is convenient to write it on the lhs here). We can combine the two logarithm terms:
100 * t + C = ln( y / (1-y) )
In order to solve t for an exact value of y, we first need to work out the value of C. We do this using the initial conditions. It is clear that if y starts at 1, dy/dt = 0 and the value of y never changes. Thus plug in the values for y and t at the beginning
100 * 0 + C = ln( y(0) / (1 - y(0) )
This will give a value for C (assuming y is not 0 or 1) and then use y=0.8 to get a value for t. Note that because of the logarithm and the factor 100 multiplying t y will reach 0.8 within a relatively short range of t values, unless the initial value of y is incredibly small. It is of course also straightforward to rearrange the equation above to express y in terms of t, then you can plot the function as well.
Edit: Numerical integration
For a more complexed ODE which cannot be solved analytically, you will have to try numerically. Initially we only know the value of the function at zero time y(0) (we have to know at least that in order to uniquely define the trajectory of the function), and how to evaluate the gradient. The idea of numerical integration is that we can use our knowledge of the gradient (which tells us how the function is changing) to work out what the value of the function will be in the vicinity of our starting point. The simplest way to do this is Euler integration:
y(dt) = y(0) + dy/dt * dt
Euler integration assumes that the gradient is constant between t=0 and t=dt. Once y(dt) is known, the gradient can be calculated there also and in turn used to calculate y(2 * dt) and so on, gradually building up the complete trajectory of the function. If you are looking for a particular target value, just wait until the trajectory goes past that value, then interpolate between the last two positions to get the precise t.
The problem with Euler integration (and with all other numerical integration methods) is that its results are only accurate when its assumptions are valid. Because the gradient is not constant between pairs of time points, a certain amount of error will arise for each integration step, which over time will build up until the answer is completely inaccurate. In order to improve the quality of the integration, it is necessary to use more sophisticated approximations to the gradient. Check out for example the Runge-Kutta methods, which are a family of integrators which remove progressive orders of error term at the cost of increased computation time. If your function is differentiable, knowing the second or even third derivatives can also be used to reduce the integration error.
Fortunately of course, somebody else has done the hard work here, and you don't have to worry too much about solving problems like numerical stability or have an in depth understanding of all the details (although understanding roughly what is going on helps a lot). Check out http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.ode.html#scipy.integrate.ode for an example of an integrator class which you should be able to use straightaway. For instance
from scipy.integrate import ode
def deriv(t, y):
return 0.01 * y * (1 - y)
my_integrator = ode(deriv)
my_integrator.set_initial_value(0.5)
t = 0.1 # start with a small value of time
while t < 3000:
y = my_integrator.integrate(t)
if y > 0.8:
print "y(%f) = %f" % (t, y)
break
t += 0.1
This code will print out the first t value when y passes 0.8 (or nothing if it never reaches 0.8). If you want a more accurate value of t, keep the y of the previous t as well and interpolate between them.
As an addition to Krastanov`s answer:
Aside of PyDSTool there are other packages, like Pysundials and Assimulo which provide bindings to the solver IDA from Sundials. This solver has root finding capabilites.
Use scipy.integrate.odeint to handle your integration, and analyse the results afterward.
import numpy as np
from scipy.integrate import odeint
ts = np.arange(0,3000,1) # time series - start, stop, step
def rhs(y,t):
return 0.01*y*(1-y)
y0 = np.array([1]) # initial value
ys = odeint(rhs,y0,ts)
Then analyse the numpy array ys to find your answer (dimensions of array ts matches ys). (This may not work first time because I am constructing from memory).
This might involve using the scipy interpolate function for the ys array, such that you get a result at time t.
EDIT: I see that you wish to solve a spring in 3D. This should be fine with the above method; Odeint on the scipy website has examples for systems such as coupled springs that can be solved for, and these could be extended.
What you are asking for is a ODE integrator with root finding capabilities. They exist and the low-level code for such integrators is supplied with scipy, but they have not yet been wrapped in python bindings.
For more information see this mailing list post that provides a few alternatives: http://mail.scipy.org/pipermail/scipy-user/2010-March/024890.html
You can use the following example implementation which uses backtracking (hence it is not optimal as it is a bolt-on addition to an integrator that does not have root finding on its own): https://github.com/scipy/scipy/pull/4904/files

Categories