Divisive clustering from scratch

Divisive clustering from scratch - python

I'm programming divisive (top-down) clustering from scratch. In divisive clustering we start at the top with all examples (variables) in one cluster. The cluster is than split recursively until each example is in its singleton cluster.
I use Pearson's correlation coefficient as a measure for splitting clusters. Pasted below is my initial attempt. I read the data and compute matrix of correlation coefficient.
Now we need to split first cluster according to minimal value of the correlation coefficient. Any idea how to proceed? Any pointers and suggestions are welcome.
import pandas as pd
from math import sqrt
# Read data from GitHub
df = pd.read_csv('https://raw.githubusercontent.com/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0)
data = df.values.tolist()
data = data[1:10]
# Define correlation coefficient
def pearson(v1, v2):
# Simple sums
sum1 = sum(v1)
sum2 = sum(v2)
# Sums of the squares
sum1Sq = sum([pow(v, 2) for v in v1])
sum2Sq = sum([pow(v, 2) for v in v2])
# Sum of the products
pSum=sum([v1[i] * v2[i] for i in range(len(v1))])
# Calculate r (Pearson score)
num = pSum - (sum1 * sum2 / len(v1))
den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1)))
if den == 0: return 0
return num / den
# Dict for distances
dist={}
min_dist = pearson(data[0], data[0])
# Loop over upper triangle of data matrix
for i in range(len(data)):
for j in range(i + 1, len(data)):
# Compute distance for each pair
dist_curr = pearson(data[i], data[j])
# Store distance in dict
dist[(i, j)] = dist_curr
# Store min distance
if dist_curr < min_dist:
min_dist = dist_curr

Related

Why non-linear response to random values is always positive?

I'm creating a non-linear response to a series of random values from {-1, +1} using a simple Volterra kernel:
With a zero mean for a(k) values I would expect r(k) to have a zero mean as well for arbitrary w values. However, I get r(k) with an always positive mean value, while a mean for a(k) behaves as expected: is close to zero and changes sign from run to run.
Why don't I get a similar behavior for r(k)? Is it because a(k) are pseudo-random and two different values from a are not actually independent?
Here is a code that I use:
import numpy as np
import matplotlib.pyplot as plt
import itertools
# array of random values {-1, 1}
A = np.random.randint(2, size=10000)
A = [x*2 - 1 for x in A]
# array of random weights
M = 3
w = np.random.rand(int(M*(M+1)/2))
# non-linear response to random values
R = []
for i in range(M, len(A)):
vals = np.asarray([np.prod(x) for x in itertools.combinations_with_replacement(A[i-M:i], 2)])
R.append(np.dot(vals, w))
print(np.mean(A), np.var(A))
print(np.mean(R), np.var(R))
Edit:
Check on whether the quadratic form, which is employed by the kernel, is definite-positive fails (i.e. there are negative principal minors). The code to do the check:
import scipy.linalg as lin
wm = np.zeros((M,M))
w_index = 0
# check Sylvester's criterion
# reconstruct weights for quadratic form
for r in range(0,M):
for c in range(r,M):
wm[r,c] += w[w_index]/2
wm[c,r] += w[w_index]/2
w_index += 1
# check principal minors
for r in range(0,M):
if lin.det(wm[:r+1,:r+1])<0: print('found negative principal minor of order', r)

I'm not certain if this is the case for Volterra kernels, but many kernels are positive definite, and some kernels, such as covariance functions, do not admit values less than zero (e.g. Squared Exponential/RBF, Rational Quadratic, Matern kernels).
If these are not the cases for the Volterra kernel, you can also try changing the random seed to seed the RNG differently to check if this is still the case. Here is a looped version of your code that iterates over different random seeds:
import numpy as np
import matplotlib.pyplot as plt
import itertools
# Loop over random seeds
for i in range(10):
# Seed the RNG
np.random.seed(i)
# array of random values {-1, 1}
A = np.random.randint(2, size=10000)
A = [x*2 - 1 for x in A]
# array of random weights
M = 3
w = np.random.rand(int(M*(M+1)/2))
# non-linear response to random values
R = []
for i in range(M, len(A)):
vals = np.asarray([np.prod(x) for x in itertools.combinations_with_replacement(A[i-M:i], 2)])
R.append(np.dot(vals, w))
# Covert R to a numpy array to check for slicing
R = np.array(R)
print("A: ", np.mean(A), np.var(A))
print("R <= 0: ", R[R <= 0])
print("R: ", np.mean(R), np.var(R))
Running this, I get the following values:
A: 0.017 0.9997109999999997
R <= 0: []
R: 1.487637375177384 0.14880206863520892
A: -0.0012 0.9999985600000002
R <= 0: []
R: 2.28108226352669 0.5926651729251319
A: 0.0104 0.9998918400000001
R <= 0: []
R: 1.6138015284426408 0.9526360372883802
A: -0.0064 0.9999590399999999
R <= 0: []
R: 0.988332642595828 0.9650456000380685
A: 0.0026 0.9999932399999998
R <= 0: [-0.75835076 -0.75835076 -0.75835076 ... -0.75835076 -0.75835076
-0.75835076]
R: 0.7352258581171865 1.2668744674748733
A: -0.0048 0.9999769599999996
R <= 0: [-0.02201476 -0.29894937 -0.29894937 ... -0.02201476 -0.29894937
-0.02201476]
R: 0.7396699663779303 1.3844391355510492
A: -0.0012 0.9999985600000002
R <= 0: []
R: 2.4343947709617475 1.6377776468054106
A: -0.0052 0.99997296
R <= 0: []
R: 0.8778918601676095 0.07656607914368625
A: 0.0086 0.99992604
R <= 0: []
R: 2.3490174001719937 0.059871902764070624
A: 0.0046 0.9999788399999996
R <= 0: []
R: 1.7699147798471178 1.8049209966313247
So as you can see, R still has some negative values. My guess is that this occurs because your kernel is positive definite.

This question ended up being about math, and not programming. Nevertheless, this is my own answer.
Simply put, when indices of a(k-i) are equal, the variables in the resulting product are not independent (because they are equal). Such a product does not have a zero mean, hence the mean value of the whole equation is shifted into the positive range.
Formally, implemented function is a quadratic form, for which a mean value can be calculated by
where \mu and \Sigma are a vector of expected values and a covariance matrix for a vector A respectively.
Having a zero vector \mu leaves only the first part of this equation. The resulting estimate can be done with the following code. And it actually gives values that are close to the statistical results in the question.
# Estimate R mean
# sum weights in a main diagonal for quadratic form (matrix trace)
w_sum = 0
w_index = 0
for r in range(0,M):
for c in range(r,M):
if r==c: w_sum += w[w_index]
w_index += 1
Rmean_est = np.var(A) * w_sum
print(Rmean_est)
This estimate uses an assumption, that a elements with different indices are independent. Any implicit dependency due to the nature of pseudo-random generator, if present, probably gives only a slight change to the resulting estimate.

Slow at running the Calculataion of silhouette score in k prototypes clustering algorithm for mixed catgorical and numerical data

I'm using k-prototyps library for mixed numerical and numinal data type. According to https://github.com/nicodv/kmodes/issues/46
to calculate the silhouette score in k prototypes, I calculate the silhouette score of categorical data (based on hamming distance) and the silhouette score of numerical data (based on euclidean distance), but the developed code is Pretty slow and it takes 10h to calculate silhouette for 60000 records. My laptop has 12G Ram and corei 7.
Any help to improve the speed of code, please?
import numpy as np
import pandas as pd
from kmodes.kprototypes import KPrototypes
# -------- import data
df = pd.read_csv(r'C:\Users\data.csv')
# ------------- Normalize the data ---------------
# print(df.columns) # To get columns name
x_df = df[['R', 'F']]
x_df_norm = x_df.apply(lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis=0)))
x_df_norm['COType'] = df[['COType']]
def calc_euclian_dis(_s1, _s2):
# s1 = np.array((3, 5))
_eucl_dist = np.linalg.norm(_s2 - _s1) # calculate Euclidean distance, accept input an array [2 6]
return _eucl_dist
def calc_simpleMatching_dis(_s1, _s2):
_cat_dist = 0
if (_s1 != _s2):
_cat_dist = 1
return _cat_dist
k = 3
# calculate silhoutte for one cluster number
kproto = KPrototypes(n_clusters=k, init='Cao', verbose=2)
clusters_label = kproto.fit_predict(x_df_norm, categorical=[2])
_identical_cluster_labels = list(dict.fromkeys(clusters_label))
# Assign clusters lables to the Dataset
x_df_norm['Cluster_label'] = clusters_label
# ------------- calculate _silhouette_Index -------------
# 1. Calculate ai
_silhouette_Index_arr = []
for i in x_df_norm.itertuples():
_ai_cluster_label = i[-1]
# return samples of the same cluster
_samples_cluster = x_df_norm[x_df_norm['Cluster_label'] == _ai_cluster_label]
_dist_array_ai = []
_s1_nume_att = np.array((i[1], i[2]))
_s1_cat_att = i[3]
for j in _samples_cluster.itertuples():
_s2_nume_att = np.array((j[1], j[2]))
_s2_cat_att = j[3]
_euclian_dis = calc_euclian_dis(_s1_nume_att, _s2_nume_att)
_cat_dis = calc_simpleMatching_dis(_s1_cat_att, _s2_cat_att)
_dist_array_ai.append(_euclian_dis + (kproto.gamma * _cat_dis))
ai = np.average(_dist_array_ai)
# 2. Calculate bi
# 2.1. determine the samples of other clusters
_identical_cluster_labels.remove(_ai_cluster_label)
_dic_cluseter = {}
_bi_arr = []
for ii in _identical_cluster_labels:
_samples = x_df_norm[x_df_norm['Cluster_label'] == ii]
# 2.2. calculate bi
_dist_array_bi = []
for j in _samples.itertuples():
_s2_nume_att = np.array((j[1], j[2]))
_s2_cat_att = j[3]
_euclian_dis = calc_euclian_dis(_s1_nume_att, _s2_nume_att)
_cat_dis = calc_simpleMatching_dis(_s1_cat_att, _s2_cat_att)
_dist_array_bi.append(_euclian_dis + (kproto.gamma * _cat_dis))
_bi_arr.append(np.average(_dist_array_bi))
_identical_cluster_labels.append(_ai_cluster_label)
# min bi is determined as final bi variable
bi = min(_bi_arr)
# 3. calculate silhouette Index
if ai == bi:
_silhouette_i = 0
elif ai < bi:
_silhouette_i = 1 - (ai / bi)
elif ai > bi:
_silhouette_i = 1 - (bi / ai)
_silhouette_Index_arr.append(_silhouette_i)
silhouette_score = np.average(_silhouette_Index_arr)
print('_silhouette_Index = ' + str(silhouette_score))

Hei! I reimplemented your function by using the linear algebra operators for computing dissimilarities instead of using a lot of for loops:
It is way faster :-)
def euclidean_dissim(a, b, **_):
"""Euclidean distance dissimilarity function
b is the single point, a is the matrix of vectors"""
if np.isnan(a).any() or np.isnan(b).any():
raise ValueError("Missing values detected in numerical columns.")
return np.linalg.norm(a - b, axis=1)
def matching_dissim(a, b, **_):
"""Simple matching dissimilarity function
b is the single point, a is the matrix of all other vectors,
count how many matching values so difference = 0 """
# We are subtracting to dimension since is not similarity but a dissimilarity
dimension = len(b)
return dimension - np.sum((b-a)==0,axis=1)
def calc_silhouette_proto(dataset,numerical_pos, cat_pos,kproto_model):
'''------------- calculate _silhouette_Index -------------'''
# 1. Compute a(i)
silhouette_Index_arr = []
for i in dataset.itertuples():
# convert tuple to np array
i = np.array(i)
unique_cluster_labels = list(np.unique(dataset['cluster_labels']))
# We need each time to remove the considered tuple from the dataset since we don't compute distances from itself
data = dataset.copy()
ai_cluster = i[-1] # The cluster is in the last position of the tuple
# Removing the tuple from the dataset
tuple_index = dataset.index.isin([i[0]])
data = data[~tuple_index]
# Get samples of the same cluster
samples_of_cluster = data[data['cluster_labels'] == ai_cluster].loc[:,data.columns!='cluster_labels'].to_numpy()
# Compute the 2 distances among the single points and all the others
euclidian_distances = euclidean_dissim(samples_of_cluster[:,numerical_pos],i[np.array(numerical_pos)+1])
categ_distances = matching_dissim(samples_of_cluster[:,cat_pos],i[np.array(cat_pos)+1])
# Weighted average of the 2 distances
ai = np.average(euclidian_distances) + (kproto_model.gamma * np.average(categ_distances))
# 2. Calculate bi
unique_cluster_labels.remove(ai_cluster)
bi_arr = []
for ii in unique_cluster_labels:
# Get all the samples of cluster ii
samples = data[data['cluster_labels'] == ii].loc[:,data.columns!='cluster_labels'].to_numpy()
# Compute the 2 distances among the single points and all the others
euclidian_distances = np.linalg.norm(samples[:,numerical_pos] - i[np.array(numerical_pos)+1], axis=1)
categ_distances = matching_dissim(samples[:,cat_pos],i[np.array(cat_pos)+1])
distance_bi = np.average(euclidian_distances) + (kproto_model.gamma * np.average(categ_distances))
bi_arr.append(np.average(distance_bi))
# min bi is determined as final bi variable
if(len(bi_arr)==0):
bi = 0
else:
bi = min(bi_arr)
# 3. calculate silhouette Index
if ai == bi:
silhouette_i = 0
elif ai < bi:
silhouette_i = 1 - (ai / bi)
elif ai > bi:
silhouette_i = 1 - (bi / ai)
silhouette_Index_arr.append(silhouette_i)
silhouette_score = np.average(silhouette_Index_arr)
return silhouette_score

Hierarchical agglomerative clustering: how to update distance matrix?

I would like to implement the simple hierarchical agglomerative clustering according to the pseudocode:
I got stuck at the last part where I need to update the distance matrix. So far I have:
import numpy as np
X = np.array([[1, 2],
[0, 3],
[2, 3],])
# Clusters
C = np.zeros((X.shape[0], X.shape[0]))
# Keeps track of active clusters
I = np.zeros(X.shape[0])
# For all n datapoints
for n in range(X.shape[0]):
for i in range(X.shape[0]):
# Compute the similarity of all N x N pairs of images
C[n][i] = np.linalg.norm(X[n] - X[i])
I[n] = 1
# Collects clustering as a sequence of merges
A = []
In each of N iterations
for k in range(X.shape[0] - 1):
# TODO: Find the indices of the smallest distance
# Updated distance matrix
I would like to implement the single-linkage clustering, so I would like to find the argmin of the distance matrix. I originally thought about doing something like:
i, m = np.where(C == np.min(C[np.nonzero(C)]))
i, m = i[0], m[0]
A.append((i, m))
to find the argmin, but I think it is incorrect as it doesn't specify a condition on the active clusters in I. I am also confused because I should just be looking at the upper or lower triangle of the matrix, so if I use the above method I could get the same argmin twice due to symmetry.
I was also thinking about first creating the rows and columns of the new merged cluster:
C = np.vstack((C, np.zeros((1, C.shape[1]))))
C = np.hstack((C, np.zeros((C.shape[0], 1))))
Then somehow update it like:
for j in range(X.shape[0]):
C[i][j] = min(C[i][j], C[m][j])
C[j][i] = min(C[i][j], C[m][j])
I am not sure if this is right approach. Is there a simpler way to find the argmin, merge the rows and columns and update the values?

If you get confused when how to find row and column indexes of minimum dist error,
Firstly,
To avoid getting argmin twice due to symmetry you can construct your initial distance matrix in shape of lower triangle matrix.
def euclidean_distance(p1,p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
distance_matrix = np.zeros((len(X.shape[0]),len(X.shape[0])))
for i in range(len(distance_matrix)):
for j in range(i):
distance_matrix[i][j] = euclidean_distance(X[i],X[j])
Secondly,
You can do your min search in the given matrix by hand if you don't like to use np tools or you are looking for a simple way.
min_value = np.inf
for i in range(len(distance_matrix)):
for j in range(i):
if( distance_matrix[i][j] < min_value):
min_value = distance_matrix[i][j]
min_i = i
min_j = j
Finally,
Update the distance matrix and merge clusters as fallows:
for i in range(len(distance_matrix)):
if( i > min_i and i < min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[min_j][i])
elif( i > min_j ):
distance_matrix[i][min_i] = min(distance_matrix[i][min_i],distance_matrix[i][min_j])
for j in range(len(distance_matrix)):
if( j < min_i ):
distance_matrix[min_i][j] = min(distance_matrix[min_i][j],distance_matrix[min_j][j])
#remove one of the old clusters data from the distance matrix
distance_matrix = np.delete(distance_matrix, min_j, axis=1)
distance_matrix = np.delete(distance_matrix, min_j, axis=0)
A[min_i] = A[min_i] + A[min_j]
A.pop(min_j)

Maximum likelihood estimation for trajectories estimation in python

I am trying to predict trajectories using maximum likelihood estimation. How should I use the mean and variance from MLE to compute the parameters for my trajectory estimation?
Let's say I have a function representing the X coordinate of a gesture, where:
X(t)=a*X(t-1)+a1*X(t-2)+a2*Y(t-1)+ε and a2=da(2-1)+u, where ε and u are noise.
Where t represents the next time period, t-1 the current one and Y the Y coordinate of the hand. I need to estimate a, a1 and a1 using MLE in order to predict X(t).
Any suggestions since I am really new to this?
Currently using some code in python for mean and variance computation.
import pandas as pd
import numpy as np
def expectation_max(data, max_iter=1000):
data = pd.DataFrame(data)
mu0 = data.mean()
c0 = data.cov()
for j in range(max_iter):
w = []
# perform the E part of algorithm
for i in data:
wk = (5 + len(data))/(5 + np.dot(np.dot(np.transpose(i - mu0), np.linalg.inv(c0)), (i - mu0)))
w.append(wk)
w = np.array(w)
# perform the M part of the algorithm
mu = (np.dot(w, data))/(np.sum(w))
c = 0
for i in range(len(data)):
c += w[i] * np.dot((data[i] - mu0), (np.transpose(data[i] - mu0)))
cov = c/len(data)
mu0 = mu
c0 = cov
return mu0, c0

Is there any python function/library for calculate binomial confidence intervals?

I need to calculate binomial confidence intervals for large set of data within a script of python. Do you know any function or library of python that can do this?
Ideally I would like to have a function like this http://statpages.org/confint.html implemented on python.
Thanks for your time.

Just noting because it hasn't been posted elsewhere here that statsmodels.stats.proportion.proportion_confint lets you get a binomial confidence interval with a variety of methods. It only does symmetric intervals, though.

I would say that R (or another stats package) would probably serve you better if you have the option. That said, if you only need the binomial confidence interval you probably don't need an entire library. Here's the function in my most naive translation from javascript.
def binP(N, p, x1, x2):
p = float(p)
q = p/(1-p)
k = 0.0
v = 1.0
s = 0.0
tot = 0.0
while(k<=N):
tot += v
if(k >= x1 and k <= x2):
s += v
if(tot > 10**30):
s = s/10**30
tot = tot/10**30
v = v/10**30
k += 1
v = v*q*(N+1-k)/k
return s/tot
def calcBin(vx, vN, vCL = 95):
'''
Calculate the exact confidence interval for a binomial proportion
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
'''
vx = float(vx)
vN = float(vN)
#Set the confidence bounds
vTU = (100 - float(vCL))/2
vTL = vTU
vP = vx/vN
if(vx==0):
dl = 0.0
else:
v = vP/2
vsL = 0
vsH = vP
p = vTL/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, vx, vN) > p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
dl = v
if(vx==vN):
ul = 1.0
else:
v = (1+vP)/2
vsL =vP
vsH = 1
p = vTU/100
while((vsH-vsL) > 10**-5):
if(binP(vN, v, 0, vx) < p):
vsH = v
v = (vsL+v)/2
else:
vsL = v
v = (v+vsH)/2
ul = v
return (dl, ul)

While the scipy.stats module has a method .interval() to compute the equal tails confidence, it lacks a similar method to compute the highest density interval. Here is a rough way to do it using methods found in scipy and numpy.
This solution also assumes you want to use a Beta distribution as a prior. The hyper-parameters a and b are set to 1, so that the default prior is a uniform distribution between 0 and 1.
import numpy
from scipy.stats import beta
from scipy.stats import norm
def binomial_hpdr(n, N, pct, a=1, b=1, n_pbins=1e3):
"""
Function computes the posterior mode along with the upper and lower bounds of the
**Highest Posterior Density Region**.
Parameters
----------
n: number of successes
N: sample size
pct: the size of the confidence interval (between 0 and 1)
a: the alpha hyper-parameter for the Beta distribution used as a prior (Default=1)
b: the beta hyper-parameter for the Beta distribution used as a prior (Default=1)
n_pbins: the number of bins to segment the p_range into (Default=1e3)
Returns
-------
A tuple that contains the mode as well as the lower and upper bounds of the interval
(mode, lower, upper)
"""
# fixed random variable object for posterior Beta distribution
rv = beta(n+a, N-n+b)
# determine the mode and standard deviation of the posterior
stdev = rv.stats('v')**0.5
mode = (n+a-1.)/(N+a+b-2.)
# compute the number of sigma that corresponds to this confidence
# this is used to set the rough range of possible success probabilities
n_sigma = numpy.ceil(norm.ppf( (1+pct)/2. ))+1
# set the min and max values for success probability
max_p = mode + n_sigma * stdev
if max_p > 1:
max_p = 1.
min_p = mode - n_sigma * stdev
if min_p > 1:
min_p = 1.
# make the range of success probabilities
p_range = numpy.linspace(min_p, max_p, n_pbins+1)
# construct the probability mass function over the given range
if mode > 0.5:
sf = rv.sf(p_range)
pmf = sf[:-1] - sf[1:]
else:
cdf = rv.cdf(p_range)
pmf = cdf[1:] - cdf[:-1]
# find the upper and lower bounds of the interval
sorted_idxs = numpy.argsort( pmf )[::-1]
cumsum = numpy.cumsum( numpy.sort(pmf)[::-1] )
j = numpy.argmin( numpy.abs(cumsum - pct) )
upper = p_range[ (sorted_idxs[:j+1]).max()+1 ]
lower = p_range[ (sorted_idxs[:j+1]).min() ]
return (mode, lower, upper)

Just been trying this myself. If it helps here's my solution, which takes two lines of code and seems to give equivalent results to that JS page. This is the frequentist one-sided interval, I'm calling the input argument the MLE (maximum likelihood estimate) of the binomial parameter theta. I.e. mle = number of successes/number of trials. I find the upper bound of the one sided interval. The alpha value used here is therefore double the one in the JS page for the upper limit.
from scipy.stats import binom
from scipy.optimize import bisect
def binomial_ci( mle, N, alpha=0.05 ):
"""
One sided confidence interval for a binomial test.
If after N trials we obtain mle as the proportion of those
trials that resulted in success, find c such that
P(k/N < mle; theta = c) = alpha
where k/N is the proportion of successes in the set of trials,
and theta is the success probability for each trial.
"""
to_minimise = lambda c: binom.cdf(mle*N,N,c)-alpha
return bisect(to_minimise,0,1)
To find the two sided interval, call with (1-alpha/2) and alpha/2 as arguments.

The following gives exact (Clopper-Pearson) interval for binomial distribution in a simple way.
def binomial_ci(x, n, alpha=0.05):
#x is number of successes, n is number of trials
from scipy import stats
if x==0:
c1 = 0
else:
c1 = stats.beta.interval(1-alpha, x,n-x+1)[0]
if x==n:
c2=1
else:
c2 = stats.beta.interval(1-alpha, x+1,n-x)[1]
return c1, c2
You may check the code by e.g.:
p1,p2 = binomial_ci(2,7)
from scipy import stats
assert abs(stats.binom.cdf(1,7,p1)-.975)<1E-5
assert abs(stats.binom.cdf(2,7,p2)-.025)<1E-5
assert abs(binomial_ci(0,7, alpha=.1)[0])<1E-5
assert abs((1-binomial_ci(0,7, alpha=.1)[1])**7-0.05)<1E-5
assert abs(binomial_ci(7,7, alpha=.1)[1]-1)<1E-5
assert abs((binomial_ci(7,7, alpha=.1)[0])**7-0.05)<1E-5
I used the relation between the binomial proportion confidence interval and the regularized incomplete beta function, as described here:
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper%E2%80%93Pearson_interval

I needed to do this as well. I was using R and wanted to learn a way to work it out for myself. I would not say it is strictly pythonic.
The docstring explains most of it. It assumes you have scipy installed.
def exact_CI(x, N, alpha=0.95):
"""
Calculate the exact confidence interval of a proportion
where there is a wide range in the sample size or the proportion.
This method avoids the assumption that data are normally distributed. The sample size
and proportion are desctibed by a beta distribution.
Parameters
----------
x: the number of cases from which the proportion is calulated as a positive integer.
N: the sample size as a positive integer.
alpha : set at 0.95 for 95% confidence intervals.
Returns
-------
The proportion with the lower and upper confidence intervals as a dict.
"""
from scipy.stats import beta
x = float(x)
N = float(N)
p = round((x/N)*100,2)
intervals = [round(i,4)*100 for i in beta.interval(alpha,x,N-x+1)]
intervals.insert(0,p)
result = {'Proportion': intervals[0], 'Lower CI': intervals[1], 'Upper CI': intervals[2]}
return result

A numpy/scipy-free way of computing the same thing using the Wilson score and an approximation to the normal cumulative density function,
import math
def binconf(p, n, c=0.95):
'''
Calculate binomial confidence interval based on the number of positive and
negative events observed.
Parameters
----------
p: int
number of positive events observed
n: int
number of negative events observed
c : optional, [0,1]
confidence percentage. e.g. 0.95 means 95% confident the probability of
success lies between the 2 returned values
Returns
-------
theta_low : float
lower bound on confidence interval
theta_high : float
upper bound on confidence interval
'''
p, n = float(p), float(n)
N = p + n
if N == 0.0: return (0.0, 1.0)
p = p / N
z = normcdfi(1 - 0.5 * (1-c))
a1 = 1.0 / (1.0 + z * z / N)
a2 = p + z * z / (2 * N)
a3 = z * math.sqrt(p * (1-p) / N + z * z / (4 * N * N))
return (a1 * (a2 - a3), a1 * (a2 + a3))
def erfi(x):
"""Approximation to inverse error function"""
a = 0.147 # MAGIC!!!
a1 = math.log(1 - x * x)
a2 = (
2.0 / (math.pi * a)
+ a1 / 2.0
)
return (
sign(x) *
math.sqrt( math.sqrt(a2 * a2 - a1 / a) - a2 )
)
def sign(x):
if x < 0: return -1
if x == 0: return 0
if x > 0: return 1
def normcdfi(p, mu=0.0, sigma2=1.0):
"""Inverse CDF of normal distribution"""
if mu == 0.0 and sigma2 == 1.0:
return math.sqrt(2) * erfi(2 * p - 1)
else:
return mu + math.sqrt(sigma2) * normcdfi(p)

Astropy provides such a function (although installing and importing astropy may be a bit excessive):
astropy.stats.binom_conf_interval

I am not an expert on statistics, but binomtest is built into SciPy and produces the same results as the accepted answer:
from scipy.stats import binomtest
binomtest(13, 100).proportion_ci()
Out[11]: ConfidenceInterval(low=0.07107304618545972, high=0.21204067708744978)
binomtest(4, 7).proportion_ci()
Out[25]: ConfidenceInterval(low=0.18405156764007, high=0.9010117215575631)
It uses Clopper-Pearson exact method by default, which matches Curt's accepted answer, which gives these values, for comparison:
Usage:
>>> calcBin(13,100)
(0.07107391357421874, 0.21204372406005856)
>>> calcBin(4,7)
(0.18405151367187494, 0.9010086059570312)
It also has options for Wilson's method, with or without continuity correction, which matches TheBamf's astropy answer:
binomtest(4, 7).proportion_ci(method='wilson')
Out[32]: ConfidenceInterval(low=0.2504583645276572, high=0.8417801447485302)
binom_conf_interval(4, 7, 0.95, interval='wilson')
Out[33]: array([0.25045836, 0.84178014])
This also matches R's binom.test and statsmodels.stats.proportion.proportion_confint, according to cxrodgers' comment:
For 30 successes in 60 trials, both R's binom.test and statsmodels.stats.proportion.proportion_confint give (.37, .63) using Klopper-Pearson.
binomtest(30, 60).proportion_ci(method='exact')
Out[34]: ConfidenceInterval(low=0.3680620319424367, high=0.6319379680575633)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Divisive clustering from scratch - python

Related

Why non-linear response to random values is always positive?

Slow at running the Calculataion of silhouette score in k prototypes clustering algorithm for mixed catgorical and numerical data

Hierarchical agglomerative clustering: how to update distance matrix?

Maximum likelihood estimation for trajectories estimation in python

Is there any python function/library for calculate binomial confidence intervals?

Categories

Resources