Estimating parameters of binomial distribution to use as machine learning features

Estimating parameters of binomial distribution to use as machine learning features - python

I'm working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.
I want to include an estimate of each allele's frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate -- i.e. something about the likelihood of that estimate.
Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but
How can I estimate the parameters of the distribution in Python?
What representation of the distribution would be ideal as a feature for classical (non-NN) machine learning? A conservative estimate might be the 95% CI upper bound, but how would I calculate that, and is there a better way to featurize the distribution than just taking that one value?
Thanks!

I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.
MLE estimate of the parameter p of your Binomial distribution
Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the
Wald method:
where z is 1 - 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals.
95% CI for p̂ can be calculated as indicated above with z = 1.96.
As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.

Binoculars implements many methods for calculating binomial confidence intervals. (PS: i am the author of Binoculars).
pip install bincoulars
If N=(total chromosomes sequenced) and p=(number of times allele is observed / N), you can estimate the confidence interval straightforwardly:
from binoculars import binomial_confidence
N, p = 100, 0.2
binomial_confidence(p, N)
# (0.1307892803998113, 0.28628125447599173)

Related

Robust Gaussian Fit

I have tried to find some literature on robust gaussian fits, all I could find was good old EM gaussian mixtures.
The question is : given a mixture of gaussians, find the dominant one around a given point.
The problem with gaussian mixtures is that you need to know how many components you have beforehand. If you don't, there are algos that will run for a range of components and choose the one with the least BIC or AIC. For data with high absolute kurtosis, you can get two (or more) components with relatively equal means but different standard deviations. You can start merging the results, but hyperparameters get in the way and it becomes a mess.
So I tried my own approach by tweaking the EM algorithm a little bit, I have one hyperparameter (bw for bandwidth) (mu is mean and std is standard deviation):
Start with a mu and a reasonable std.
Expectation : find the points in [mu-bw.std,mu+bw.std]
Maximization : recalc the mu and std for those points. correct the std by dividing by the std of a trimmed standard normal on [-bw,bw].
continue until convergence,
the weight of the local dominant gaussian is the share of points in [mu-bw.std,mu+bw.std] (E-step) divided by the integral of a standard
normal on [-bw,bw].
Here you can find a notebook
https://colab.research.google.com/drive/1kFSD1JVPoLFkWjydNj_7tQ91Z6BJDZRD?usp=sharing
I'm obviously weighing the points by a rectangular function. I was thinking of weighing by the gaussian itself (self-weighted). The mean wouldn't need correcting, but the weight and the std would. The weight is corrected by multiplying by (2sqrt(pi)) and the std by sqrt(2).
The pros of the self-weighted are that there is no need for a hyperparameter, it's faster in terms of loops, and has less bias on highly overlapped components. The con is that it will always converge to the global dominant gaussian whatever the starting point.
The pros of the rectangular-weighted are that it will converge on a local dominant, given a small enough bw (compared to the overlapping of components), although a small bw will have larger standard error on the parameters.
Edit : by this time, I have tried different mixtures and the self_weighted fails to converge. The correcting coefficients are wrong and I'm looking for help.

Limiting density of discrete points (LDDP) in python

Shannon's entropy from information theory measures the uncertainty or disorder in a discrete random variable's empirical distribution, while differential entropy measures it for a continuous r.v. The classical definition of differential entropy was found to be wrong, however, and was corrected with the Limiting density of discrete points (LDDP). Does scipy or other compute the LDDP? How can I estimate LDDP in python?

Since LDDP is equivalent to the negative KL-divergence from your density function m(x) to your probability distribution p(x), you might be able to use one of the many implementations of KL-divergence, for example from scipy.stats.entropy.
An appropriate procedure (assuming you have finite support) is to approximate the continuous distribution with a discrete one by sampling over its support, and calculating the KL divergence.
If this is not possible, then your only option that I can think of is probably to use numerical (or possibly analytic?) integration methods, of which you should have plenty. An easy first step would be to try monte-carlo methods.

What exactly is the variance on the parameters of SciPy curve fit? (Python)

I'm currently using the curve_fit function of the scipy.optimize package in Python, and know that if you take the square root of the diagonal entries of the covariance matrix that you get from curve_fit, you get the standard deviation on the parameters that curve_fit calculated. What I'm not sure about, is what exactly this standard deviation means. It's an approximation using a Hesse matrix as far as I understand, but what would the exact calculation be? Standard deviation on the Gaussian Bell Curve tells you what percentage of area is within a certain range of the curve, so I assumed for curve_fit it tells you how many datapoints are between certain parameter values, but apparently that isn't right...
I'm sorry if this should be basic knowledge for curve fitting, but I really can't figure out what the standard deviations do, they express an error on the parameters, but those parameters are calculated as the best possible fit for the function, it's not like there's a whole collection of optimal parameters, and we get the average value of that collection and consequently also have a standard deviation. There's only one optimal value, what is there to compare it with? I guess my question really comes down to this: how can I manually and accurately calculate these standard deviations, and not just get an approximation using a Hesse matrix?

The variance in the fitted parameters represents the uncertainty in the best-fit value based on the quality of the fit of the model to the data. That is, it describes by how much the value could change away from the best-fit value and still have a fit that is almost as good as the best-fit value.
With standard definition of chi-square,
chi_square = ( ( (data - model)/epsilon )**2 ).sum()
and reduced_chi_square = chi_square / (ndata - nvarys) (where data is the array of the data values, model the array of the calculated model, epsilon is uncertainty in the data, ndata is the number of data points, and nvarys the number of variables), a good fit should have reduced_chi_square around 1 or chi_square around ndata-nvary. (Note: not 0 -- the fit will not be perfect as there is noise in the data).
The variance in the best-fit value for a variable gives the amount by which you can change the value (and re-optimize all other values) and increase chi-square by 1. That gives the so-called '1-sigma' value of the uncertainty.
As you say, these values are expressed in the diagonal terms of the covariance matrix returned by scipy.optimize.curve_fit (the off-diagonal terms give the correlations between variables: if a value for one variable is changed away from its optimal value, how would the others respond to make the fit better). This covariance matrix is built using the trial values and derivatives near the solution as the fit is being done -- it calculates the "curvature" of the parameter space (ie, how much chi-square changes when a variables value changes).
You can calculate these uncertainties by hand. The lmfit library (https://lmfit.github.io/lmfit-py/) has routines to more explicitly explore the confidence intervals of variables from least-squares minimization or curve-fitting. These are described in more detail at
https://lmfit.github.io/lmfit-py/confidence.html. It's probably easiest to use lmfit for the curve-fitting rather than trying to re-implement the confidence interval code for curve_fit.

python scipy.stats pdf and expect functions

I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise

rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.

The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))

Sampling methods

Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute diﬀerence
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?

There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.