Maximum likelihood estimation for trajectories estimation in python

Maximum likelihood estimation for trajectories estimation in python - python

I am trying to predict trajectories using maximum likelihood estimation. How should I use the mean and variance from MLE to compute the parameters for my trajectory estimation?
Let's say I have a function representing the X coordinate of a gesture, where:
X(t)=a*X(t-1)+a1*X(t-2)+a2*Y(t-1)+ε and a2=da(2-1)+u, where ε and u are noise.
Where t represents the next time period, t-1 the current one and Y the Y coordinate of the hand. I need to estimate a, a1 and a1 using MLE in order to predict X(t).
Any suggestions since I am really new to this?
Currently using some code in python for mean and variance computation.
import pandas as pd
import numpy as np
def expectation_max(data, max_iter=1000):
data = pd.DataFrame(data)
mu0 = data.mean()
c0 = data.cov()
for j in range(max_iter):
w = []
# perform the E part of algorithm
for i in data:
wk = (5 + len(data))/(5 + np.dot(np.dot(np.transpose(i - mu0), np.linalg.inv(c0)), (i - mu0)))
w.append(wk)
w = np.array(w)
# perform the M part of the algorithm
mu = (np.dot(w, data))/(np.sum(w))
c = 0
for i in range(len(data)):
c += w[i] * np.dot((data[i] - mu0), (np.transpose(data[i] - mu0)))
cov = c/len(data)
mu0 = mu
c0 = cov
return mu0, c0

Related

How to subtract baseline from spectrum with rising tail in python?

I have a spectrum that I want to subtract a baseline from. The spectrum data are:
1.484043000000000001e+00 1.121043091000000004e+03
1.472555999999999976e+00 1.140899658000000045e+03
1.461239999999999872e+00 1.135047851999999921e+03
1.450093000000000076e+00 1.153286499000000049e+03
1.439112000000000169e+00 1.158624877999999853e+03
1.428292000000000117e+00 1.249718872000000147e+03
1.417629999999999946e+00 1.491854857999999922e+03
1.407121999999999984e+00 2.524922362999999677e+03
1.396767000000000092e+00 4.102439940999999635e+03
1.386559000000000097e+00 4.013319579999999860e+03
1.376497999999999999e+00 3.128252441000000090e+03
1.366578000000000070e+00 2.633181152000000111e+03
1.356797999999999949e+00 2.340077147999999852e+03
1.347154999999999880e+00 2.099404540999999881e+03
1.337645999999999891e+00 2.012083983999999873e+03
1.328268000000000004e+00 2.052154540999999881e+03
1.319018999999999942e+00 2.061067871000000196e+03
1.309895999999999949e+00 2.205770507999999609e+03
1.300896999999999970e+00 2.199266602000000148e+03
1.292019000000000029e+00 2.317792235999999775e+03
1.283260000000000067e+00 2.357031494000000293e+03
1.274618000000000029e+00 2.434981689000000188e+03
1.266089999999999938e+00 2.540746337999999923e+03
1.257675000000000098e+00 2.605709472999999889e+03
1.249370000000000092e+00 2.667244141000000127e+03
1.241172999999999860e+00 2.800522704999999860e+03
I've taken only every 20th data point from the actual data file, but the general shape is preserved.
import matplotlib.pyplot as plt
share = the_above_array
plt.plot(share)
Original_spectrum
There is a clear tail in around the high x values. Assume the tail is an artifact and needs to be removed. I've tried solutions using the ALS algorithm by P. Eilers, a rubberband approach, and the peakutils package, but these end up subtracting the tail and creating a rise around the low x values or not creating a suitable baseline.
ALS algorithim, in this example I am using lam=1E6 and p=0.001; these were the best parameters I was able to manually find:
# ALS approach
from scipy import sparse
from scipy.sparse.linalg import spsolve
def baseline_als(y, lam, p, niter=10):
L = len(y)
D = sparse.csc_matrix(np.diff(np.eye(L), 2))
w = np.ones(L)
for i in range(niter):
W = sparse.spdiags(w, 0, L, L)
Z = W + lam * D.dot(D.transpose())
z = spsolve(Z, w*y)
w = p * (y > z) + (1-p) * (y < z)
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted)
ALS_plot
Rubberband approach:
# rubberband approach
from scipy.spatial import ConvexHull
def rubberband(x, y):
# Find the convex hull
v = ConvexHull(share).vertices
# Rotate convex hull vertices until they start from the lowest one
v = np.roll(v, v.argmax())
# Leave only the ascending part
v = v[:v.argmax()]
# Create baseline using linear interpolation between vertices
return np.interp(x, x[v], y[v])
baseline_rubber = rubberband(share[:,0], share[:,1])
intensity_rubber = share[:,1] - baseline_rubber
plt.plot(intensity_rubber)
Rubber_plot
peakutils package:
# peakutils approach
import peakutils
baseline_peakutils = peakutils.baseline(share[:,1])
intensity_peakutils = share[:,1] - baseline_peakutils
plt.plot(intensity_peakutils)
Peakutils_plot
Are there any suggestions, aside from masking the low x value data, for constructing a baseline and subtracting the tail without creating a rise in the low x values?

I found a set of similar ALS algorithms here. One of these algorithms, asymmetrically reweighted penalized least squares smoothing (arpls), gives a slightly better fit than als.
# arpls approach
from scipy.linalg import cholesky
def arpls(y, lam=1e4, ratio=0.05, itermax=100):
r"""
Baseline correction using asymmetrically
reweighted penalized least squares smoothing
Sung-June Baek, Aaron Park, Young-Jin Ahna and Jaebum Choo,
Analyst, 2015, 140, 250 (2015)
"""
N = len(y)
D = sparse.eye(N, format='csc')
D = D[1:] - D[:-1] # numpy.diff( ,2) does not work with sparse matrix. This is a workaround.
D = D[1:] - D[:-1]
H = lam * D.T * D
w = np.ones(N)
for i in range(itermax):
W = sparse.diags(w, 0, shape=(N, N))
WH = sparse.csc_matrix(W + H)
C = sparse.csc_matrix(cholesky(WH.todense()))
z = spsolve(C, spsolve(C.T, w * y))
d = y - z
dn = d[d < 0]
m = np.mean(dn)
s = np.std(dn)
wt = 1. / (1 + np.exp(2 * (d - (2 * s - m)) / s))
if np.linalg.norm(w - wt) / np.linalg.norm(w) < ratio:
break
w = wt
return z
baseline = baseline_als(share[:,1], 1E6, 0.001)
baseline_subtracted = share[:,1] - baseline
plt.plot(baseline_subtracted, 'r', label='als')
baseline_arpls = arpls(share[:,1], 1e5, 0.1)
intensity_arpls = share[:,1] - baseline_arpls
plt.plot(intensity_arpls, label='arpls')
plt.legend()
ARPLS plot
Fortunately, this improvement becomes better when using the data from the entire spectrum:
Note the parameters for either algorithm were different. For now, I think the arpls algorithm is as close as I can get, at least for spectra that look like this. We'll see how robust the algorithm can fit spectra with different shapes. Of course, I am always open to suggestions or improvements!

Have a look at the RamPy library in python, which proposes various baseline subtraction algorithms. This includes splines, ARPLS, ALS, polynomial functions, and many more. It also offers various other features, such as resampling, normalisation, and peak fitting examples.
In your case, a simple spline function fitted before and after the peak should easily do the job. Have a look at this example Jupyter notebook.

1D Wasserstein distance in Python

The formula below is a special case of the Wasserstein distance/optimal transport when the source and target distributions, x and y (also called marginal distributions) are 1D, that is, are vectors.
where F^{-1} are inverse probability distribution functions of the cumulative distributions of the marginals u and v, derived from real data called x and y, both generated from the normal distribution:
import numpy as np
from numpy.random import randn
import scipy.stats as ss
n = 100
x = randn(n)
y = randn(n)
How can the integral in the formula be coded in python and scipy? I'm guessing the x and y have to be converted to ranked marginals, which are non-negative and sum to 1, while Scipy's ppf could be used to calculate the inverse F^{-1}'s?

Note that when n gets large we have that a sorted set of n samples approaches the inverse CDF sampled at 1/n, 2/n, ..., n/n. E.g.:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.plot(norm.ppf(np.linspace(0, 1, 1000)), label="invcdf")
plt.plot(np.sort(np.random.normal(size=1000)), label="sortsample")
plt.legend()
plt.show()
Also note that your integral from 0 to 1 can be approximated as a sum over 1/n, 2/n, ..., n/n.
Thus we can simply answer your question:
def W(p, u, v):
assert len(u) == len(v)
return np.mean(np.abs(np.sort(u) - np.sort(v))**p)**(1/p)
Note that if len(u) != len(v) you can still apply the method with linear interpolation:
def W(p, u, v):
u = np.sort(u)
v = np.sort(v)
if len(u) != len(v):
if len(u) > len(v): u, v = v, u
us = np.linspace(0, 1, len(u))
vs = np.linspace(0, 1, len(v))
u = np.linalg.interp(u, us, vs)
return np.mean(np.abs(u - v)**p)**(1/p)
An alternative method if you have prior information about the sort of distribution of your data, but not its parameters, is to find the best fitting distribution on your data (e.g. with scipy.stats.norm.fit) for both u and v and then do the integral with the desired precision. E.g.:
from scipy.stats import norm as gauss
def W_gauss(p, u, v, num_steps):
ud = gauss(*gauss.fit(u))
vd = gauss(*gauss.fit(v))
z = np.linspace(0, 1, num_steps, endpoint=False) + 1/(2*num_steps)
return np.mean(np.abs(ud.ppf(z) - vd.ppf(z))**p)**(1/p)

I guess I am a bit late but, but this is what I would do for an exact solution (using only numpy):
import numpy as np
from numpy.random import randn
n = 100
m = 80
p = 2
x = np.sort(randn(n))
y = np.sort(randn(m))
a = np.ones(n)/n
b = np.ones(m)/m
# cdfs
ca = np.cumsum(a)
cb = np.cumsum(b)
# points on which we need to evaluate the quantile functions
cba = np.sort(np.hstack([ca, cb]))
# weights for integral
h = np.diff(np.hstack([0, cba]))
# construction of first quantile function
bins = ca + 1e-10 # small tolerance to avoid rounding errors and enforce right continuity
index_qx = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qx = x[index_qx] # quantile funciton F^{-1}
# construction of second quantile function
bins = cb + 1e-10
index_qy = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qy = y[index_qy] # quantile funciton G^{-1}
ot_cost = np.sum((qx - qy)**p * h)
print(ot_cost)
In case you are interested, here you can find a more detailed numpy based implementation of the ot problem on the real line with dual and primal solutions as well: https://github.com/gnies/1d-optimal-transport. (I am still working on it though).

Using Fractional Distance as Distance Metric for K Means Clustering for High Dimensional Dataset in Python

I have datasets of at least 30 features to be clustered via k means clustering on Python. Does anyone has any idea on how can I set fractional distance (https://medium.com/#amit02093/the-right-distance-approximation-in-high-dimensions-fractional-distances-bb0b8cd858b2) as the distance metric for K mean clustering? SK learn seem not to have fractional distance. Thank you in advance!
This is the code for fractional distance which I have obtained from another stackoverflow post:
def fractional_distance(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist

How do you compute the confidence interval for Pearson's r in Python?

In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)

According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html

Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level

Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high

Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?

I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"

Is there any python function/library for calculate binomial confidence intervals?

I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?
Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.
Thanks for your time.

Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.

I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.
def binP(N, p, x1, x2):
p = float(p)
q = p/(1-p)
k = 0.0
v = 1.0
s = 0.0
tot = 0.0
while(k<=N):
tot += v
if(k >= x1 and k <= x2):
s += v
if(tot > 10**30):
s = s/10**30
tot = tot/10**30
v = v/10**30
k += 1
v = v*q*(N+1-k)/k
return s/tot
def calcBin(vx, vN, vCL = 95):
'''
Calculate the exact confidence interval for a binomial proportion
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
'''
vx = float(vx)
vN = float(vN)
#Set the confidence bounds
vTU = (100 - float(vCL))/2
vTL = vTU
vP = vx/vN
if(vx==0):
dl = 0.0
else:
v = vP/2
vsL = 0
vsH = vP
p = vTL/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, vx, vN) > p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
dl = v
if(vx==vN):
ul = 1.0
else:
v = (1+vP)/2
vsL =vP
vsH = 1
p = vTU/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, 0, vx) < p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
ul = v
return (dl, ul)

While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.
This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.
import numpy
from scipy.stats import beta
from scipy.stats import norm
def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
"""
Function computes the posterior mode along with the upper and lower bounds of the
**Highest Posterior Density Region**.
Parameters
----------
n: number of successes
N: sample size
pct: the size of the confidence interval (between 0 and 1)
a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
n_pbins: the number of bins to segment the p_range into (Default=1e3)
Returns
-------
A tuple that contains the mode as well as the lower and upper bounds of the interval
(mode, lower, upper)
"""
# fixed random variable object for posterior Beta distribution
rv = beta(n+a, N-n+b)
# determine the mode and standard deviation of the posterior
stdev = rv.stats('v')**0.5
mode = (n+a-1.)/(N+a+b-2.)
# compute the number of sigma that corresponds to this confidence
# this is used to set the rough range of possible success probabilities
n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
# set the min and max values for success probability
max_p = mode + n_sigma * stdev
if max_p > 1:
max_p = 1.
min_p = mode - n_sigma * stdev
if min_p > 1:
min_p = 1.
# make the range of success probabilities
p_range = numpy.linspace(min_p, max_p, n_pbins+1)
# construct the probability mass function over the given range
if mode > 0.5:
sf = rv.sf(p_range)
pmf = sf[:-1] - sf[1:]
else:
cdf = rv.cdf(p_range)
pmf = cdf[1:] - cdf[:-1]
# find the upper and lower bounds of the interval
sorted_idxs = numpy.argsort( pmf )[::-1]
cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
j = numpy.argmin( numpy.abs(cumsum - pct) )
upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
lower = p_range[ (sorted_idxs[:j+1]).min() ]
return (mode, lower, upper)

Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.
from scipy.stats import binom
from scipy.optimize import bisect
def binomial_ci( mle, N, alpha=0.05 ):
"""
One sided confidence interval for a binomial test.
If after N trials we obtain mle as the proportion of those
trials that resulted in success, find c such that
P(k/N < mle; theta = c) = alpha
where k/N is the proportion of successes in the set of trials,
and theta is the success probability for each trial.
"""
to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
return bisect(to_minimise,0,1)
To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.

The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.
def binomial_ci(x, n, alpha=0.05):
#x is number of successes, n is number of trials
from scipy import stats
if x==0:
c1 = 0
else:
c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
if x==n:
c2=1
else:
c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
return c1, c2
You may check the code by e.g.:
p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5
I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.
The docstring explains most of it. It assumes you have scipy installed.
def exact_CI(x, N, alpha=0.95):
"""
Calculate the exact confidence interval of a proportion
where there is a wide range in the sample size or the proportion.
This method avoids the assumption that data are normally distributed. The sample size
and proportion are desctibed by a beta distribution.
Parameters
----------
x: the number of cases from which the proportion is calulated as a positive integer.
N: the sample size as a positive integer.
alpha : set at 0.95 for 95% confidence intervals.
Returns
-------
The proportion with the lower and upper confidence intervals as a dict.
"""
from scipy.stats import beta
x = float(x)
N = float(N)
p = round((x/N)*100,2)
intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
intervals.insert(0,p)
result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
return result

A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,
import math
def binconf(p, n, c=0.95):
'''
Calculate binomial confidence interval based on the number of positive and
negative events observed.
Parameters
----------
p: int
number of positive events observed
n: int
number of negative events observed
c : optional, [0,1]
confidence percentage. e.g. 0.95 means 95% confident the probability of
success lies between the 2 returned values
Returns
-------
theta_low : float
lower bound on confidence interval
theta_high : float
upper bound on confidence interval
'''
p, n = float(p), float(n)
N = p + n
if N == 0.0: return (0.0, 1.0)
p = p / N
z = normcdfi(1 - 0.5 * (1-c))
a1 = 1.0 / (1.0 + z * z / N)
a2 = p + z * z / (2 * N)
a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))
return (a1 * (a2 - a3), a1 * (a2 + a3))
def erfi(x):
"""Approximation to inverse error function"""
a = 0.147 # MAGIC!!!
a1 = math.log(1 - x * x)
a2 = (
2.0 / (math.pi * a)
+ a1 / 2.0
)
return (
sign(x) *
math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
)
def sign(x):
if x < 0: return -1
if x == 0: return 0
if x > 0: return 1
def normcdfi(p, mu=0.0, sigma2=1.0):
"""Inverse CDF of normal distribution"""
if mu == 0.0 and sigma2 == 1.0:
return math.sqrt(2) * erfi(2 * p - 1)
else:
return mu + math.sqrt(sigma2) * normcdfi(p)

Astropy provides such a function (although installing and importing astropy may be a bit excessive):
astropy.stats.binom_conf_interval

I am not an expert on statistics, but binomtest is built into SciPy and produces the same results as the accepted answer:
from scipy.stats import binomtest
binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)
binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)
It uses Clopper-Pearson exact method by default, which matches Curt's accepted answer, which gives these values, for comparison:
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
It also has options for Wilson's method, with or without continuity correction, which matches TheBamf's astropy answer:
binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)
binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])
This also matches R's binom.test and statsmodels.stats.proportion.proportion_confint, according to cxrodgers' comment:
For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson.
binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Maximum likelihood estimation for trajectories estimation in python - python

Related

How to subtract baseline from spectrum with rising tail in python?

1D Wasserstein distance in Python

Using Fractional Distance as Distance Metric for K Means Clustering for High Dimensional Dataset in Python

How do you compute the confidence interval for Pearson's r in Python?

Is there any python function/library for calculate binomial confidence intervals?

Categories

Resources