how to separate two distributions from a pdf(probability density function)? - python

Assume the pdf(probability density function) of my dataset is as below, there are two distributions in the dataset and I want to fit them with a Gamma and a Gaussian distribution.
I have read this but this applies only for two Gaussian distribution.
How to use python to separate two gaussian curves?
Here is the steps that I would like to do
manually estimate the mean and variance of the Gaussian distribution
base on estimated mean and variance, create the pdf of the Gaussian distribution
substract the pdf of Gaussian from the original pdf
fit the pdf to a Gamma distribution
I am able to do 1~3, but for step 4 I do not know how to fit a Gamma distribution from a pdf
(not from data samples so stats.gamma.fit(data) does not work here).
Are the above steps reasonable for dealing with this kind of problem, and how to do step 4 in Python ? Appreciated for any help.

Interesting question. One issue I can see is that it will be sometimes difficult to disambiguate which mode is the Gamma and which is the Gaussian.
What I would perhaps do is try an expectation-maximization (EM) algorithm. Given the ambiguity stated above, I would do a few runs and select the best fit.
Perhaps, to speed things up a little, you could try EM with two possible starting points (somewhat similar to your idea):
estimate a mixture model of two Gaussians, say g0, g1.
run one EM to solve a mixture (Gamma, Gaussian), starting with an initial point that is (gamma_params_approx_gauss(g0), g1), where gamma_params_approx_gauss(g0) is a maximum-likelihood estimator of Gamma parameters given Gaussian parameters (see e.g. here).
run another EM, but starting with (gamma_params_approx_gauss(g1), g0).
select the best fit among the two EM results.

Related

Robust Gaussian Fit

I have tried to find some literature on robust gaussian fits, all I could find was good old EM gaussian mixtures.
The question is : given a mixture of gaussians, find the dominant one around a given point.
The problem with gaussian mixtures is that you need to know how many components you have beforehand. If you don't, there are algos that will run for a range of components and choose the one with the least BIC or AIC. For data with high absolute kurtosis, you can get two (or more) components with relatively equal means but different standard deviations. You can start merging the results, but hyperparameters get in the way and it becomes a mess.
So I tried my own approach by tweaking the EM algorithm a little bit, I have one hyperparameter (bw for bandwidth) (mu is mean and std is standard deviation):
Start with a mu and a reasonable std.
Expectation : find the points in [mu-bw.std,mu+bw.std]
Maximization : recalc the mu and std for those points. correct the std by dividing by the std of a trimmed standard normal on [-bw,bw].
continue until convergence,
the weight of the local dominant gaussian is the share of points in [mu-bw.std,mu+bw.std] (E-step) divided by the integral of a standard
normal on [-bw,bw].
Here you can find a notebook
https://colab.research.google.com/drive/1kFSD1JVPoLFkWjydNj_7tQ91Z6BJDZRD?usp=sharing
I'm obviously weighing the points by a rectangular function. I was thinking of weighing by the gaussian itself (self-weighted). The mean wouldn't need correcting, but the weight and the std would. The weight is corrected by multiplying by (2sqrt(pi)) and the std by sqrt(2).
The pros of the self-weighted are that there is no need for a hyperparameter, it's faster in terms of loops, and has less bias on highly overlapped components. The con is that it will always converge to the global dominant gaussian whatever the starting point.
The pros of the rectangular-weighted are that it will converge on a local dominant, given a small enough bw (compared to the overlapping of components), although a small bw will have larger standard error on the parameters.
Edit : by this time, I have tried different mixtures and the self_weighted fails to converge. The correcting coefficients are wrong and I'm looking for help.

Find underlaying normal distribution of random vectors

I am trying to solve a statistics-related real world problem with Python and am looking for inputs on my ideas: I have N random vectors from a m-dimensional normal distribution. I have no information about the means and the covariance matrix of the underlying distribution, in fact also that it is a normal distribution is only an assumption, a very plausible one though. I want to compute an approximation of the mean vector and covariance matrix of the distribution. The number of random vectors is in the order of magnitude of 100 to 300, the dimensionality of the normal distribution is somewhere between 2 and 5. The time for the calculation should ideally not exceed 1 minute on a standard home computer.
I am currently thinking about three approaches and am happy about all suggestions for other approaches or preferences between those three:
Fitting: Make a multi dimensional histogram of all random vectors and fit a multi dimensional normal distribution to the histogram. Problem about that approach: The covariance matrix has many entries, this could possibly be a problem for the fitting process?
Invert cumulative distribution function: Make a multi dimensional histogram as approximation of the density function of the random vectors. Then integrate this to get a multi dimensional cumulative distribution function. For one dimension, this is invertible and one could use the cum-dist function to distribute random numbers like in the original distribution. Problem: For the multi-dimensional case the cum-dist function is not invertible(?) and I don't know if this approach still works then?
Bayesian: Use Bayesian Statistics with some normal distribution as prior and update for every observation. The result should always be again a normal distribution. Problem: I think this is computationally expensive? Also, I don't want the later updates have more impact on the resulting distribution than the earlier ones.
Also, maybe there is some library which has this task already implemented? I did not find exactly this in Numpy or Scipy, maybe someone has an idea where else to look?
If the simple estimates described in the section Parameter estimation of the wikipedia article on the multivariate normal distribution are sufficient for your needs, you can use numpy.mean to compute the mean and numpy.cov to compute the sample covariance matrix.

fitting curve: which model to describe distribution in weighted knowledge graphs

As a simple model to represent a knowledge network and learn about properties of weighted graphs, I computed the cosine similarity between Wikipedia articles.
I am looking now at the distribution of the similarity weights for each article (see pictures ).
In the pictures, you see that the curve changes derivative around a certain value (maybe from an exponential, to linear) : I would like to fit the curve and extract that value, where the derivate visibly (or expectedly) change, so that I can divide similar articles in two sets: the "most similar" (left side of the threshold) and the "others" (right side of the threshold).
I want to fit the curve for each article distribution; compare the distribution respect to the mean distribution of all the articles; compare the distribution respect to the distribution of a random weighted network.
(You're suggestions are most welcome in defining working procedure: you know I would like to use this model as a toy model to then train how a network, or an article, may evolve in time).
My background is User Experience with a twist for data science, I wish to comprehend better which model may describe the distribution of values I observed, a proper way to compare distributions, and python tools (or Mathematica 11) to fit the curve and obtain the derivative for each point.
which model do you suggest to describe distribution of observed values for similarity between objects in a weighted network (here, a collaborative knowledge base is represented as a weighted network, where weight is the similarity value of two given articles - should I expect an exponential? a poissonian ? why ?)
how to compute curve fit and extract derivative of the curve at a given point (python or Mathematica 11)
Working with Mathematica, suppose you data is in the list data. Then if you want to find the cubic polynomial that best fits your data, use the Fit function:
Fit[data, {1, x, x^2, x^3}, x]
In general the usage for the Fit command looks like
Fit["data set", "list of functions", "independent variable"]
where Mathematica tries to fit a linear combination of the functions in that list to your data set. I'm not sure what to say about what sort of curve we would expect this data to be best modeled by, but just remember that any smooth function can be approximated to arbitrary precision by a polynomial with sufficiently many terms. So if you have the computational power to spare, just let your list of functions be a long list of powers of x. Although it does look like you have an asymptote at x=0, so maybe allow there to be a 1/x term in there to capture that. And then of course you can use Plot to plot your curve on top of your data to compare them visually.
Now to get this best fit curve as a function in Mathematica that you can take a derivative of:
f[x_] := Fit[data, {1, x, x^2, x^3}, x]
And then the obvious change you are talking about occurs when the second derivative is zero, so to get that x value:
NSolve[f''[x] == 0, x]

Sampling methods

Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute difference
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?
There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.

Fitting a bimodal distribution to a set of values

Given a 1D array of values, what is the simplest way to figure out what the best fit bimodal distribution to it is, where each 'mode' is a normal distribution? Or in other words, how can you find the combination of two normal distributions that bests reproduces the 1D array of values?
Specifically, I'm interested in implementing this in python, but answers don't have to be language specific.
Thanks!
What you are trying to do is called a Gaussian Mixture model. The standard approach to solving this is using Expectation Maximization, scipy svn includes a section on machine learning and em called scikits. I use it a a fair bit.
I suggest using the awesome scipy package.
It provides a few methods for optimisation.
There's a big fat caveat with simply applying a pre-defined least square fit or something along those lines.
Here are a few problems you will run into:
Noise larger than second/both peaks.
Partial peak - your data is cut of at one of the borders.
Sampling - width of peaks are smaller than your sampled data.
It isn't normal - you'll get some result ...
Overlap - If peaks overlap you'll find that often one peak is fitted correctly but the second will apporach zero...

Categories