I have tried to find some literature on robust gaussian fits, all I could find was good old EM gaussian mixtures.
The question is : given a mixture of gaussians, find the dominant one around a given point.
The problem with gaussian mixtures is that you need to know how many components you have beforehand. If you don't, there are algos that will run for a range of components and choose the one with the least BIC or AIC. For data with high absolute kurtosis, you can get two (or more) components with relatively equal means but different standard deviations. You can start merging the results, but hyperparameters get in the way and it becomes a mess.
So I tried my own approach by tweaking the EM algorithm a little bit, I have one hyperparameter (bw for bandwidth) (mu is mean and std is standard deviation):
Start with a mu and a reasonable std.
Expectation : find the points in [mu-bw.std,mu+bw.std]
Maximization : recalc the mu and std for those points. correct the std by dividing by the std of a trimmed standard normal on [-bw,bw].
continue until convergence,
the weight of the local dominant gaussian is the share of points in [mu-bw.std,mu+bw.std] (E-step) divided by the integral of a standard
normal on [-bw,bw].
Here you can find a notebook
https://colab.research.google.com/drive/1kFSD1JVPoLFkWjydNj_7tQ91Z6BJDZRD?usp=sharing
I'm obviously weighing the points by a rectangular function. I was thinking of weighing by the gaussian itself (self-weighted). The mean wouldn't need correcting, but the weight and the std would. The weight is corrected by multiplying by (2sqrt(pi)) and the std by sqrt(2).
The pros of the self-weighted are that there is no need for a hyperparameter, it's faster in terms of loops, and has less bias on highly overlapped components. The con is that it will always converge to the global dominant gaussian whatever the starting point.
The pros of the rectangular-weighted are that it will converge on a local dominant, given a small enough bw (compared to the overlapping of components), although a small bw will have larger standard error on the parameters.
Edit : by this time, I have tried different mixtures and the self_weighted fails to converge. The correcting coefficients are wrong and I'm looking for help.
Related
Assume the pdf(probability density function) of my dataset is as below, there are two distributions in the dataset and I want to fit them with a Gamma and a Gaussian distribution.
I have read this but this applies only for two Gaussian distribution.
How to use python to separate two gaussian curves?
Here is the steps that I would like to do
manually estimate the mean and variance of the Gaussian distribution
base on estimated mean and variance, create the pdf of the Gaussian distribution
substract the pdf of Gaussian from the original pdf
fit the pdf to a Gamma distribution
I am able to do 1~3, but for step 4 I do not know how to fit a Gamma distribution from a pdf
(not from data samples so stats.gamma.fit(data) does not work here).
Are the above steps reasonable for dealing with this kind of problem, and how to do step 4 in Python ? Appreciated for any help.
Interesting question. One issue I can see is that it will be sometimes difficult to disambiguate which mode is the Gamma and which is the Gaussian.
What I would perhaps do is try an expectation-maximization (EM) algorithm. Given the ambiguity stated above, I would do a few runs and select the best fit.
Perhaps, to speed things up a little, you could try EM with two possible starting points (somewhat similar to your idea):
estimate a mixture model of two Gaussians, say g0, g1.
run one EM to solve a mixture (Gamma, Gaussian), starting with an initial point that is (gamma_params_approx_gauss(g0), g1), where gamma_params_approx_gauss(g0) is a maximum-likelihood estimator of Gamma parameters given Gaussian parameters (see e.g. here).
run another EM, but starting with (gamma_params_approx_gauss(g1), g0).
select the best fit among the two EM results.
I'm working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.
I want to include an estimate of each allele's frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate -- i.e. something about the likelihood of that estimate.
Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but
How can I estimate the parameters of the distribution in Python?
What representation of the distribution would be ideal as a feature for classical (non-NN) machine learning? A conservative estimate might be the 95% CI upper bound, but how would I calculate that, and is there a better way to featurize the distribution than just taking that one value?
Thanks!
I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.
MLE estimate of the parameter p of your Binomial distribution
Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the
Wald method:
where z is 1 - 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals.
95% CI for p̂ can be calculated as indicated above with z = 1.96.
As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.
Binoculars implements many methods for calculating binomial confidence intervals. (PS: i am the author of Binoculars).
pip install bincoulars
If N=(total chromosomes sequenced) and p=(number of times allele is observed / N), you can estimate the confidence interval straightforwardly:
from binoculars import binomial_confidence
N, p = 100, 0.2
binomial_confidence(p, N)
# (0.1307892803998113, 0.28628125447599173)
Question: The best way to find out the Eps and MinPts parameters for DBSCAN algorithm?
Problem: The goal is to find the locations (clusters) based on coordinates (input data). The algorithm calculates the most visited areas and retrieves these clusters.
Approach:
I defined the epsilon (EPS) parameter as 1.5 km - converted to radians to be used by the DBSCAN algorithm: epsilon = 1.5 / 6371.0088 (ref to this 1.5 km: https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/).
If I define the MinPts to a low value (e.g. MinPts = 5, it will produce 2000 clusters), the DBSCAN will produce too many clusters and I want to limit the relevance/size of the clusters to an acceptable value.
I use the haversine metric and ball tree algorithm to calculate great-circle distances between points.
Suggestions:
knn approach to find EPS;
domain knowledge and to decide the best values for EPS and MinPts.
Data: I'm using 160k coordinates but the program should be capable to handle different data inputs.
As you may know, setting MinPts high will not only prevent small clusters from forming, but will also change the shape of larger clusters as its outskirts will be considered outliers.
Consider instead a third way to reduce the number of clusters; simply sort by descending size (number of coordinates) and limit that to 4 or 5. This way, you won't be shown all the small clusters if you're not interested in them, but you can instead treat all those points as noise.
You're essentially using DBSCAN for something it's not meant for, namely to find the n largest clusters, but that's fine - you just need to "tweak the algorithm" to fit your use case.
Update
If you know the entire dataset and it will not change in the future, I would just tweak minPts manually, based on your knowledge.
In scientific environments and with varying data sets, you consider the data as "generated from a stochastic process". However, that would mean that there is a chance - no matter how small - that there are minPts dogs in a remote forest somewhere at the same time, or minPts - 1 dogs in Central Park, where it's normally overcrowded.
What I mean by that is that if you go down the scientific road, you need to find a balance between the deterministic value of minPts and the probabilistic distribution of the points in your data set.
In my experience, it all comes down to whether or not you trust your knowledge, or would like to defer responsibility. In some government/scientific/large corporate positions, it's a safer choice to pin something on an algorithm than on gut feeling. In other situations, it's safe to use gut feeling.
I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise
rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.
The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))
Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute difference
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?
There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.