I was wondering if anybody knows of a reliable and fast analytic HDI calculation, preferably for beta functions.
The definition of the HDI is in this question called "Highest Posterior Density Region".
I am looking for a function that has the following I/O:
Input
credMass - credible interval mass (e.g, 0.95 for 95% credible interval)
a - shape parameter (e.g, number of HEADS coin tosses)
b - shape parameter (e.g, number of TAILS coin tosses)
output
ci_min - minimum bound of mass (value between 0. to ci_max)
ci_max - maximum bound of mass (value between ci_min to 1.)
One way to go about it is to speed up this script that I found
in the same said question (adapted from R to Python) and taken from the book Doing bayesian data analysis by John K. Kruschke). I have used this solution, which is quite reliable, but it's a bit too slow for multiple calls. A speedup by 100X or even 10X would be very helpful!
from scipy.optimize import fmin
from scipy.stats import *
def HDIofICDF(dist_name, credMass=0.95, **args):
# freeze distribution with given arguments
distri = dist_name(**args)
# initial guess for HDIlowTailPr
incredMass = 1.0 - credMass
def intervalWidth(lowTailPr):
return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)
# find lowTailPr that minimizes intervalWidth
HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
# return interval as array([low, high])
return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])
usage
print HDIofICDF(beta, credMass=0.95, a=5, b=4)
Word of caution! Some solutions confuse this HDI with the Equal Tail Interval solution (which the said question calls "Central Credible Region") which is far easier to compute, but does not answer the same question. (E.g, see Kruschke's Why HDI and not equal-tailed interval?)
Also this question does not concern the MCMC approaches that I have seen in PyMC3 (pymc3.stats.hpd(a), where a is a random variable sample), but rather regards an analytical solution.
Thank you!
Not analytical, but just in case you still find this useful. Instead of evaluating the ppf (that can be very slow for many distributions) you can find an approximate solution by evaluating the pdf, something like this.
pdf = dist.pdf(x_vals)
pdf = pdf / pdf.sum()
idx = np.argsort(pdf)[::-1]
mass_cum = 0
indices = []
for i in idx:
mass_cum += pdf[i]
indices.append(i)
if mass_cum >= mass:
break
return x_vals[np.sort(indices)[[0, -1]]]
Where mass is a number between 0 and 1. And x_vals is a vector of equally spaced values in the support of the distribution. So for the beta x_vals = np.linspace(0, 1, size) should do the work. You can control the tradeoff of accuracy and speed by setting a proper value of size for your problem. How much faster this solution is compared with the other one (using the ppf and optimization) will depend on the details of the SciPy's ppf and pdf method (that will be different for different distributions). And could even depend on the parameters of the distribution.
Related
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
I was trying to implement a half-logistic distribution and came across halflogistic and genhalflogistic.
halflogistic: "A half-logistic continuous random variable."
genhalflogistic: "A generalized half-logistic continuous random variable."
This "generalized" version comes up for some of SciPy's other continuous random variables as well, such as gennorm.
My question is: what does "generalized" mean and how is it different from the non-generalized version?
"Generalized" means having one or more additional parameters which somehow affect the shape of the distribution. To find what they are, compare the probability density functions. Let's start with normal:
norm.pdf(x) = exp(-x**2/2)/sqrt(2*pi)
versus
beta
gennorm.pdf(x, beta) = --------------- exp(-|x|**beta)
2 gamma(1/beta)
Here, beta is the additional parameter. If beta = 2, you get the normal distribution (scaled a bit differently compared to norm). With 0 < beta < 2 you get other stable distributions.
It's a bit more confusing with half-logistic, though, because the formulas do not look alike:
halflogistic.pdf(x) = 2 * exp(-x) / (1+exp(-x))**2
versus
genhalflogistic.pdf(x, c) = 2 * (1-c*x)**(1/c-1) / (1+(1-c*x)**(1/c))**2
But taking the limit as c→0 in the latter formula gives the former. So, c is the shape parameter here. The support of generalized half-logistic is the interval [0, 1/c]. The limit form c→0 has infinite support [0, ∞).
I'm trying to obtain the function expected_W or H that is the result of an integration:
where:
theta is a vector with two elements: theta_0 and theta_1
f(beta | theta) is a normal density for beta with mean theta_0 and variance theta_1
q(epsilon) is a normal density for epsilon with mean zero and variance sigma_epsilon (set to 1 by default).
w(p, theta, eps, beta) is a function I take as input, so I cannot predict exactly how it looks. It will likely be non-linear, but not particularly nasty.
This is the way I implement the problem. I'm sure the wrapper functions I make are a mess, so I'd be happy to receive any help on that too.
from __future__ import division
from scipy import integrate
from scipy.stats import norm
import math
import numpy as np
def exp_w(w_B, sigma_eps = 1, **kwargs):
'''
Integrates the w_B function
Input:
+ w_B : the function to be integrated.
+ sigma_eps : variance of the epsilon term. Set to 1 by default
'''
#The integrand function gives everything under the integral:
# w(B(p, \theta, \epsilon, \beta)) f(\beta | \theta ) q(\epsilon)
def integrand(eps, beta, p, theta_0, theta_1, sigma_eps=sigma_eps):
q_e = norm.pdf(eps, loc=0, scale=math.sqrt(sigma_eps))
f_beta = norm.pdf(beta, loc=theta_0, scale=math.sqrt(theta_1))
return w_B(p = p,
theta_0 = theta_0, theta_1 = theta_1,
eps = eps, beta=beta)* q_e *f_beta
#limits of integration. Using limited support for now.
eps_inf = lambda beta : -10 # otherwise: -np.inf
eps_sup = lambda beta : 10 # otherwise: np.inf
beta_inf = -10
beta_sup = 10
def integrated_f(p, theta_0, theta_1):
return integrate.dblquad(integrand, beta_inf, beta_sup,
eps_inf, eps_sup,
args = (p, theta_0, theta_1))
# this integrated_f is the H referenced at the top of the question
return integrated_f
I tested this function with a simple w function for which I know the analytic solution (this won't usually be the case).
def test_exp_w():
def w_B(p, theta_0, theta_1, eps, beta):
return 3*(p*eps + p*(theta_0 + theta_1) - beta)
# Function that I get
integrated = exp_w(w_B, sigma_eps = 1)
# Function that I should get
def exp_result(p, theta_0, theta_1):
return 3*p*(theta_0 + theta_1) - 3*theta_0
args = np.random.rand(3)
d_args = {'p' : args[0], 'theta_0' : args[1], 'theta_1' : args[2]}
if not (np.allclose(
integrated(**d_args)[0], exp_result(**d_args)) ):
raise Exception("Integration procedure isn't working!")
Hence, my implementation seems to be working, but it's very slow for my purpose. I need to repeat this process with tens or hundreds of thousands of times (this is a step in a Value function iteration. I can give more info if people think it's relevant).
With scipy version 0.14.0 and numpy version 1.8.1, this integral takes 15 seconds to compute.
Does anybody have any suggestion on how to go about this?
To start with, tt probably would help to get bounded domains of integration, but I haven't figure out how to do that or if the gaussian quadrature in SciPy takes care of it in a good way (does it use Gauss-Hermite?).
Thanks for your time.
---- Edit: adding profiling times -----
%lprun results gives that most of the time is spent in
_distn_infraestructure.py:1529(pdf) and
_continuous_distns.py:97(_norm_pdf)
each with a whopping 83244 number calls.
The time taken to integrate your function sounds very long if the function is not a nasty one.
First thing I suggest you do is to profile where the time is spent. Is it spent in dblquad or elsewhere? How many calls are made to w_B during the integration? If the time is spent in dblquad and the number of calls is very high, could you use looser tolerances in the integration?
It seems that the multiplication by the gaussians actually enables you to limit the integration limits a great deal, as most of the energy of the gaussian is within a very small area. You might want to try and calculate reasonable tighter bounds. You have already limited the area into -10..10; is there any significant performance change between -100..100, -10..10, and -1..1?
If you know your functions are relatively smooth, then there is a Mickey-Mouse version of the integration:
determine reasonable upper and lower limits in both axes (by the gaussians)
calculate a reasonable grid density (e.g. 100 points in each direction)
calculate the w_B for each of these points (and this will be much faster, if it is possible to require a vectorized version of w_B)
sum it all together
This is very low-tech but also very fast. Whether or not it gives you results which are good enough for the outer iteration is an interesting question. It just might.
I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]
It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.
Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.