I want to supply a negative exponent for the scipy.stats.powerlaw routine, e.g. a=-1.5, in order to draw random samples:
"""
powerlaw.pdf(x, a) = a * x**(a-1)
"""
from scipy.stats import powerlaw
R = powerlaw.rvs(a, size=100)
Why is a > 0 required, how can I supply a negative a in order to generate the random samples, and how can I supply a normalization coefficient/transform, i.e.
PDF(x,C,a) = C * x**a
The documentation is here
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html
Thanks!
EDIT: I should add that I'm trying to replicate IDL's RANDOMP function:
http://idlastro.gsfc.nasa.gov/ftp/pro/math/randomp.pro
A PDF, integrated over its domain, must equal one. In other words, the area under a probability density function's curve must equal one.
In [36]: import scipy.integrate as integrate
In [40]: y, err = integrate.quad(lambda x: 0.5*x**(-0.5), 0, 1)
In [41]: y
Out[41]: 0.9999999999999998 # The integral is close to 1
The powerlaw density function has a domain from 0 <= x <= 1. On this domain, the integral of x**b is finite for any b > -1. When b is smaller, x**b blows up too rapidly near x = 0. So it is not a valid probability density function when b <= -1.
In [38]: integrate.quad(lambda x: x**(-1), 0, 1)
UserWarning: The maximum number of subdivisions (50) has been achieved...
# The integral blows up
Thus for x**(a-1), a must satisfy a-1 > -1 or equivalently, a > 0.
The first constant a in a * x**(a-1) is the normalizing constant which makes the integral of a * x**(a-1) over the domain [0,1] equal to 1. So you don't get to choose this constant independent of a.
Now if you change the domain to be a measurable distance away from 0, then yes, you could define a PDF of the form C * x**a for negative a. But you'd have to state what domain you want, and I don't think there is (yet) a PDF available in scipy.stats for this.
The Python package powerlaw can do this. Consider for a>1 a power law distribution with probability density function
f(x) = c * x^(-a)
for x > x_min and f(x) = 0 otherwise. Here c is a normalization factor and is determined as
c = (a-1) * x_min^(a-1).
In the example below it is a = 1.5 and x_min = 1.0 and comparing the probability density function estimated from the random sample with the PDF from the expression above gives the expected result.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pl
import numpy as np
import powerlaw
a, xmin = 1.5, 1.0
N = 10000
# generates random variates of power law distribution
vrs = powerlaw.Power_Law(xmin=xmin, parameters=[a]).generate_random(N)
# plotting the PDF estimated from variates
bin_min, bin_max = np.min(vrs), np.max(vrs)
bins = 10**(np.linspace(np.log10(bin_min), np.log10(bin_max), 100))
counts, edges = np.histogram(vrs, bins, density=True)
centers = (edges[1:] + edges[:-1])/2.
# plotting the expected PDF
xs = np.linspace(bin_min, bin_max, 100000)
pl.plot(xs, [(a-1)*xmin**(a-1)*x**(-a) for x in xs], color='red')
pl.plot(centers, counts, '.')
pl.xscale('log')
pl.yscale('log')
pl.savefig('powerlaw_variates.png')
returns
If r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the exponent of the distribution.
If you want to generate power-law distribution, you can use a random deviation. You just have to generate a random number between [0,1] and apply the inverse method (Wolfram). In this case, the probability density function is:
p(k) = k^(-gamma)
and y is the variable uniform between 0 and 1.
y ~ U(0,1)
import numpy as np
def power_law(k_min, k_max, y, gamma):
return ((k_max**(-gamma+1) - k_min**(-gamma+1))*y + k_min**(-gamma+1.0))**(1.0/(-gamma + 1.0))
Now to generate a distribution, you just have to create an array
nodes = 1000
scale_free_distribution = np.zeros(nodes, float)
k_min = 1.0
k_max = 100*k_min
gamma = 3.0
for n in range(nodes):
scale_free_distribution[n] = power_law(k_min, k_max,np.random.uniform(0,1), gamma)
This will work to generate a power-law distribution with gamma=3.0, if you want to fix the average of distribution, you have to study Complex Networks cause the k_min depends of k_max and the average connectivity.
My answer is almost the same as Virgil's above, with the crucial difference that that alpha is actually the negative exponent of powerlaw distribution
So, if r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the negative exponent of the distribution, that is the P(x) = [constant] * x**-alpha
Related
I can't seem to wrap my head around why the theoretical and simulated results are so different for the probablity distribution of a normal random variable squared. (e.g. the power of a Gaussian noise voltage signal)
I suspect I'm doing something wrong and wanted to ask, if anyone could help with this.
Here is the code explaining what I'm trying to do:
import numpy as np
from scipy.integrate import quad, simps
from matplotlib import pyplot as plt
def PDF(x, sigma=1, mu=0): # Gaussian normal distribution PDF
return 1/(np.sqrt(2*np.pi*sigma))*np.exp(-1/(2*sigma**2)*(x-mu)**2)
def PDFu(u, u_rms=1, u_mean=0):
return PDF(u, sigma=u_rms, mu=u_mean)
def PDFP(P):
return 2*PDFu(np.sqrt(P)) # substitute the input variable with the 'scaled' one
def probDensity(x, nbins): # calculate the probability density based on the input samples
distr, bins = np.histogram(x, nbins) # similar to plt.hist(density=True)
binWidth = bins[1]-bins[0]
binCenters = bins[:-1]+binWidth/2
return distr/len(x)/binWidth, binCenters
npoints = 100000
rms = 1
u = np.random.normal(0, rms, npoints) # samples with Gaussian normal distribution
P = u**2 # square of the samples with Gaussian normal distribution - should follow chi-squared distribution?
nbins = 500
u_distr, u_bins = probDensity(u, nbins) # calculate PDF based on the samples
print('U_distr integral = ', simps(u_distr,u_bins)) # integrate the calculated PDF, should be 1
plt.plot(u_bins, u_distr)
us = np.linspace(-10, 10, 500)
PDFu_u = PDFu(us) # calculate the theoretical PDF
print('PDFu_u integral = ', quad(PDFu, -np.Inf, np.Inf)) # integral of the theoretical PDF, should be 1
plt.plot(us, PDFu_u)
nbins = 1000
P_distr, P_bins = probDensity(P, nbins) # calculate PDF based on the samples
print('P_distr integral = ', simps(P_distr, P_bins)) # integrate the calculated PDF, should be 1
plt.plot(P_bins, P_distr)
Ps = np.linspace(0, 8, npoints)
PDFP_P = PDFP(Ps) # calculate the theoretical PDF
plt.plot(Ps, PDFP_P)
print('PDFP_P integral = ', quad(PDFP, 0, np.Inf)) # integral of the theoretical PDF, should be 1
plt.show()
The theroetical and the simulated probablity distribution of the normal random variable (u) seem to match nicely, I use this as a sanity check. But the difference is substantial in case of the squared variable and I can't understand why and how to get them to match. Btw, I tried various plausible scaling factors for the theoretical distribution (e.g. 0.5, 2, sqrt(2)), but it did not work and I don't see why I would even need it. Shouldn't it work with just substituting 'P' with 'u' according to the formula u=sqrt(P*R) [R=1] and using the normal distribution of 'u' to calculate the PDF value for certain 'P's?
I trust the simulated distribution a little more and I am wondering how the theoretical one should be properly calculated. Why doesn't the substituition method work?
Thank you for the help in advance!
Your theoretical density for the square of a Gaussian is wrong. Here is the calc. If X is Gaussian then for the CDF $F$ of the squared variable $Y=X^2$ we have
$$
F(x) = P(Y<x) = P(X^2 <x) = P(-\sqrt{x} < X < \sqrt{x}) = \Phi(\sqrt{x}) - \Phi(-\sqrt{x})
$$
where $\Phi$ is the Gaussian CDF
so for the PDF $f(x)$ of $Y$ we differentiate that and we get
$$
f(x) = F'(x) = (1/(2\sqrt{x})) \Psi'(\sqrt{x}) + (1/(2\sqrt{x})) \Psi'(-\sqrt{x}) = (1/(2\sqrt{x})) (\psi(\sqrt{x}) + \psi(-\sqrt{x})
$$
where $\psi$ is the Gaussian PDF
so at the very least you are missing the term $(1/(2\sqrt{x}))$
Here is an image of the formulas if it helps
For reference, here is the code with the corrected PDF, based on piterbarg's answer. Thanks again!
import numpy as np
from scipy.integrate import quad, simps
from matplotlib import pyplot as plt
def PDF(x, sigma=1, mu=0): # Gaussian normal distribution PDF
return 1/(np.sqrt(2*np.pi*sigma))*np.exp(-1/(2*sigma**2)*(x-mu)**2)
def PDFu(u, u_rms=1, u_mean=0):
return PDF(u, sigma=u_rms, mu=u_mean)
def PDFP(P):
return 1/(2*np.sqrt(P))*2*PDFu(np.sqrt(P)) # substitute the input variable with the 'scaled' one
def probDensity(x, nbins): # calculate the probability density based on the input samples
distr, bins = np.histogram(x, nbins) # similar to plt.hist(density=True)
binWidth = bins[1]-bins[0]
binCenters = bins[:-1]+binWidth/2
return distr/len(x)/binWidth, binCenters
npoints = 100000
rms = 1
u = np.random.normal(0, rms, npoints) # samples with Gaussian normal distribution
P = u**2 # square of the samples with Gaussian normal distribution - should follow chi-squared distribution?
nbins = 500
u_distr, u_bins = probDensity(u, nbins) # calculate PDF based on the samples
print('U_distr integral = ', simps(u_distr,u_bins)) # integrate the calculated PDF, should be 1
plt.plot(u_bins, u_distr)
us = np.linspace(-10, 10, 500)
PDFu_u = PDFu(us) # calculate the theoretical PDF
print('PDFu_u integral = ', quad(PDFu, -np.Inf, np.Inf)) # integral of the theoretical PDF, should be 1
plt.plot(us, PDFu_u)
nbins = 1000
P_distr, P_bins = probDensity(P, nbins) # calculate PDF based on the samples
print('P_distr integral = ', simps(P_distr, P_bins)) # integrate the calculated PDF, should be 1
plt.plot(P_bins, P_distr)
Ps = np.linspace(0, 8, npoints)
PDFP_P = PDFP(Ps) # calculate the theoretical PDF
plt.plot(Ps, PDFP_P)
print('PDFP_P integral = ', quad(PDFP, 0, np.Inf)) # integral of the theoretical PDF, should be 1
plt.show()
The formula below is a special case of the Wasserstein distance/optimal transport when the source and target distributions, x and y (also called marginal distributions) are 1D, that is, are vectors.
where F^{-1} are inverse probability distribution functions of the cumulative distributions of the marginals u and v, derived from real data called x and y, both generated from the normal distribution:
import numpy as np
from numpy.random import randn
import scipy.stats as ss
n = 100
x = randn(n)
y = randn(n)
How can the integral in the formula be coded in python and scipy? I'm guessing the x and y have to be converted to ranked marginals, which are non-negative and sum to 1, while Scipy's ppf could be used to calculate the inverse F^{-1}'s?
Note that when n gets large we have that a sorted set of n samples approaches the inverse CDF sampled at 1/n, 2/n, ..., n/n. E.g.:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.plot(norm.ppf(np.linspace(0, 1, 1000)), label="invcdf")
plt.plot(np.sort(np.random.normal(size=1000)), label="sortsample")
plt.legend()
plt.show()
Also note that your integral from 0 to 1 can be approximated as a sum over 1/n, 2/n, ..., n/n.
Thus we can simply answer your question:
def W(p, u, v):
assert len(u) == len(v)
return np.mean(np.abs(np.sort(u) - np.sort(v))**p)**(1/p)
Note that if len(u) != len(v) you can still apply the method with linear interpolation:
def W(p, u, v):
u = np.sort(u)
v = np.sort(v)
if len(u) != len(v):
if len(u) > len(v): u, v = v, u
us = np.linspace(0, 1, len(u))
vs = np.linspace(0, 1, len(v))
u = np.linalg.interp(u, us, vs)
return np.mean(np.abs(u - v)**p)**(1/p)
An alternative method if you have prior information about the sort of distribution of your data, but not its parameters, is to find the best fitting distribution on your data (e.g. with scipy.stats.norm.fit) for both u and v and then do the integral with the desired precision. E.g.:
from scipy.stats import norm as gauss
def W_gauss(p, u, v, num_steps):
ud = gauss(*gauss.fit(u))
vd = gauss(*gauss.fit(v))
z = np.linspace(0, 1, num_steps, endpoint=False) + 1/(2*num_steps)
return np.mean(np.abs(ud.ppf(z) - vd.ppf(z))**p)**(1/p)
I guess I am a bit late but, but this is what I would do for an exact solution (using only numpy):
import numpy as np
from numpy.random import randn
n = 100
m = 80
p = 2
x = np.sort(randn(n))
y = np.sort(randn(m))
a = np.ones(n)/n
b = np.ones(m)/m
# cdfs
ca = np.cumsum(a)
cb = np.cumsum(b)
# points on which we need to evaluate the quantile functions
cba = np.sort(np.hstack([ca, cb]))
# weights for integral
h = np.diff(np.hstack([0, cba]))
# construction of first quantile function
bins = ca + 1e-10 # small tolerance to avoid rounding errors and enforce right continuity
index_qx = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qx = x[index_qx] # quantile funciton F^{-1}
# construction of second quantile function
bins = cb + 1e-10
index_qy = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qy = y[index_qy] # quantile funciton G^{-1}
ot_cost = np.sum((qx - qy)**p * h)
print(ot_cost)
In case you are interested, here you can find a more detailed numpy based implementation of the ot problem on the real line with dual and primal solutions as well: https://github.com/gnies/1d-optimal-transport. (I am still working on it though).
What do I have to use to figure out the inverse probability density function for normal distribution? I'm using scipy to find out normal distribution probability density function:
from scipy.stats import norm
norm.pdf(1000, loc=1040, scale=210)
0.0018655737107410499
How can I figure out that 0.0018 probability corresponds to 1000 in the given normal distribution?
There can be no 1:1 mapping from probability density to quantile.
Because the PDF of the normal distribution is quadratic, there can be either 2, 1 or zero quantiles that have a particular probability density.
Update
It's actually not that hard to find the roots analytically. The PDF of a normal distribution is given by:
With a bit of rearrangement we get:
(x - mu)**2 = -2 * sigma**2 * log( pd * sigma * sqrt(2 * pi))
If the discriminant on the RHS is < 0, there are no real roots. If it equals zero, there is a single root (where x = mu), and where it is > 0 there are two roots.
To put it all together into a function:
import numpy as np
def get_quantiles(pd, mu, sigma):
discrim = -2 * sigma**2 * np.log(pd * sigma * np.sqrt(2 * np.pi))
# no real roots
if discrim < 0:
return None
# one root, where x == mu
elif discrim == 0:
return mu
# two roots
else:
return mu - np.sqrt(discrim), mu + np.sqrt(discrim)
This gives the desired quantile(s), to within rounding error:
from scipy.stats import norm
pd = norm.pdf(1000, loc=1040, scale=210)
print get_quantiles(pd, 1040, 210)
# (1000.0000000000001, 1079.9999999999998)
import scipy.stats as stats
import scipy.optimize as optimize
norm = stats.norm(loc=1040, scale=210)
y = norm.pdf(1000)
print(y)
# 0.00186557371074
print(optimize.fsolve(lambda x:norm.pdf(x)-y, norm.mean()-norm.std()))
# [ 1000.]
print(optimize.fsolve(lambda x:norm.pdf(x)-y, norm.mean()+norm.std()))
# [ 1080.]
There exist distributions which attain any value an infinite number of times. (For example, the simple function with value 1 on an infinite sequence of intervals with lengths 1/2, 1/4, 1/8, etc. attains the value 1 an infinite number of times. And it is a distribution since 1/2 + 1/4 + 1/8 + ... = 1)
So the use of fsolve above is not guaranteed to find all values of x where pdf(x) equals a certain value, but it may help you find some root.
I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?
Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.
Thanks for your time.
Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.
I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.
def binP(N, p, x1, x2):
p = float(p)
q = p/(1-p)
k = 0.0
v = 1.0
s = 0.0
tot = 0.0
while(k<=N):
tot += v
if(k >= x1 and k <= x2):
s += v
if(tot > 10**30):
s = s/10**30
tot = tot/10**30
v = v/10**30
k += 1
v = v*q*(N+1-k)/k
return s/tot
def calcBin(vx, vN, vCL = 95):
'''
Calculate the exact confidence interval for a binomial proportion
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
'''
vx = float(vx)
vN = float(vN)
#Set the confidence bounds
vTU = (100 - float(vCL))/2
vTL = vTU
vP = vx/vN
if(vx==0):
dl = 0.0
else:
v = vP/2
vsL = 0
vsH = vP
p = vTL/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, vx, vN) > p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
dl = v
if(vx==vN):
ul = 1.0
else:
v = (1+vP)/2
vsL =vP
vsH = 1
p = vTU/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, 0, vx) < p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
ul = v
return (dl, ul)
While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.
This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.
import numpy
from scipy.stats import beta
from scipy.stats import norm
def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
"""
Function computes the posterior mode along with the upper and lower bounds of the
**Highest Posterior Density Region**.
Parameters
----------
n: number of successes
N: sample size
pct: the size of the confidence interval (between 0 and 1)
a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
n_pbins: the number of bins to segment the p_range into (Default=1e3)
Returns
-------
A tuple that contains the mode as well as the lower and upper bounds of the interval
(mode, lower, upper)
"""
# fixed random variable object for posterior Beta distribution
rv = beta(n+a, N-n+b)
# determine the mode and standard deviation of the posterior
stdev = rv.stats('v')**0.5
mode = (n+a-1.)/(N+a+b-2.)
# compute the number of sigma that corresponds to this confidence
# this is used to set the rough range of possible success probabilities
n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
# set the min and max values for success probability
max_p = mode + n_sigma * stdev
if max_p > 1:
max_p = 1.
min_p = mode - n_sigma * stdev
if min_p > 1:
min_p = 1.
# make the range of success probabilities
p_range = numpy.linspace(min_p, max_p, n_pbins+1)
# construct the probability mass function over the given range
if mode > 0.5:
sf = rv.sf(p_range)
pmf = sf[:-1] - sf[1:]
else:
cdf = rv.cdf(p_range)
pmf = cdf[1:] - cdf[:-1]
# find the upper and lower bounds of the interval
sorted_idxs = numpy.argsort( pmf )[::-1]
cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
j = numpy.argmin( numpy.abs(cumsum - pct) )
upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
lower = p_range[ (sorted_idxs[:j+1]).min() ]
return (mode, lower, upper)
Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.
from scipy.stats import binom
from scipy.optimize import bisect
def binomial_ci( mle, N, alpha=0.05 ):
"""
One sided confidence interval for a binomial test.
If after N trials we obtain mle as the proportion of those
trials that resulted in success, find c such that
P(k/N < mle; theta = c) = alpha
where k/N is the proportion of successes in the set of trials,
and theta is the success probability for each trial.
"""
to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
return bisect(to_minimise,0,1)
To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.
The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.
def binomial_ci(x, n, alpha=0.05):
#x is number of successes, n is number of trials
from scipy import stats
if x==0:
c1 = 0
else:
c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
if x==n:
c2=1
else:
c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
return c1, c2
You may check the code by e.g.:
p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5
I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval
I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.
The docstring explains most of it. It assumes you have scipy installed.
def exact_CI(x, N, alpha=0.95):
"""
Calculate the exact confidence interval of a proportion
where there is a wide range in the sample size or the proportion.
This method avoids the assumption that data are normally distributed. The sample size
and proportion are desctibed by a beta distribution.
Parameters
----------
x: the number of cases from which the proportion is calulated as a positive integer.
N: the sample size as a positive integer.
alpha : set at 0.95 for 95% confidence intervals.
Returns
-------
The proportion with the lower and upper confidence intervals as a dict.
"""
from scipy.stats import beta
x = float(x)
N = float(N)
p = round((x/N)*100,2)
intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
intervals.insert(0,p)
result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
return result
A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,
import math
def binconf(p, n, c=0.95):
'''
Calculate binomial confidence interval based on the number of positive and
negative events observed.
Parameters
----------
p: int
number of positive events observed
n: int
number of negative events observed
c : optional, [0,1]
confidence percentage. e.g. 0.95 means 95% confident the probability of
success lies between the 2 returned values
Returns
-------
theta_low : float
lower bound on confidence interval
theta_high : float
upper bound on confidence interval
'''
p, n = float(p), float(n)
N = p + n
if N == 0.0: return (0.0, 1.0)
p = p / N
z = normcdfi(1 - 0.5 * (1-c))
a1 = 1.0 / (1.0 + z * z / N)
a2 = p + z * z / (2 * N)
a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))
return (a1 * (a2 - a3), a1 * (a2 + a3))
def erfi(x):
"""Approximation to inverse error function"""
a = 0.147 # MAGIC!!!
a1 = math.log(1 - x * x)
a2 = (
2.0 / (math.pi * a)
+ a1 / 2.0
)
return (
sign(x) *
math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
)
def sign(x):
if x < 0: return -1
if x == 0: return 0
if x > 0: return 1
def normcdfi(p, mu=0.0, sigma2=1.0):
"""Inverse CDF of normal distribution"""
if mu == 0.0 and sigma2 == 1.0:
return math.sqrt(2) * erfi(2 * p - 1)
else:
return mu + math.sqrt(sigma2) * normcdfi(p)
Astropy provides such a function (although installing and importing astropy may be a bit excessive):
astropy.stats.binom_conf_interval
I am not an expert on statistics, but binomtest is built into SciPy and produces the same results as the accepted answer:
from scipy.stats import binomtest
binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)
binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)
It uses Clopper-Pearson exact method by default, which matches Curt's accepted answer, which gives these values, for comparison:
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
It also has options for Wilson's method, with or without continuity correction, which matches TheBamf's astropy answer:
binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)
binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])
This also matches R's binom.test and statsmodels.stats.proportion.proportion_confint, according to cxrodgers' comment:
For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson.
binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)
Given a mean and standard-deviation defining a normal distribution, how would you calculate the following probabilities in pure-Python (i.e. no Numpy/Scipy or other packages not in the standard library)?
The probability of a random variable r where r < x or r <= x.
The probability of a random variable r where r > x or r >= x.
The probability of a random variable r where x > r > y.
I've found some libraries, like Pgnumerics, that provide functions for calculating these, but the underlying math is unclear to me.
Edit: To show this isn't homework, posted below is my working code for Python<=2.6, albeit I'm not sure if it handles the boundary conditions correctly.
from math import *
import unittest
def erfcc(x):
"""
Complementary error function.
"""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
def normrange(x1, x2, mu, sigma, f=True):
"""
Calculates probability of random variable falling between two points.
"""
p1 = normdist(x1, mu, sigma, f)
p2 = normdist(x2, mu, sigma, f)
return abs(p1-p2)
All these are very similar: If you can compute #1 using a function cdf(x), then the solution to #2 is simply 1 - cdf(x), and for #3 it's cdf(x) - cdf(y).
Since Python includes the (gauss) error function built in since version 2.7 you can do this by calculating the cdf of the normal distribution using the equation from the article you linked to:
import math
print 0.5 * (1 + math.erf((x - mean)/math.sqrt(2 * standard_dev**2)))
where mean is the mean and standard_dev is the standard deviation.
Some notes since what you asked seemed relatively straightforward given the information in the article:
CDF of a random variable (say X) is the probability that X lies between -infinity and some limit, say x (lower case). CDF is the integral of the pdf for continuous distributions. The cdf is exactly what you described for #1, you want some normally distributed RV to be between -infinity and x (<= x).
< and <= as well as > and >= are same for continuous random variables as the probability that the rv is any single point is 0. So whether or not x itself is included doesn't actually matter when calculating the probabilities for continuous distributions.
Sum of probabilities is 1, if its not < x then it's >= x so if you have the cdf(x). then 1 - cdf(x) is the probability that the random variable X >= x. Since >= is equivalent for continuous random variables to >, this is also the probability X > x.