Scipy - Inverse Sampling Method from custom probability density function - python

I am trying to perform an inverse sampling from a custom probability density function (PDF). I am just wondering if this even possible, i.e. integrating the PDF, inverting the result and then solving it for a given uniform number. The PDF has the shape f(x, alpha, mean(x))=(1/Gamma(alpha+1)(x))((x*(alpha+1)/mean(x))^(alpha+1))exp(-(alpha+1)*(x/mean(x)) where x > 0. From the shape the only values sub-150 are relevant, and for what I am trying to do the sub-80 values are good enough. Extending the range shouldnt be too hard though.
I have tried to do the inversion method, but only found a numerical way to do the integral, which isnt necessarily helpful considering that I need to invert the function to solve:
u = integral(f(x, alpha, mean(x))dx) from 0 to y, where y is unknown and u is uniform random variable between 0 and 1.
The integral has a gamma function and an incomplete gamma function, so trying to invert it is kind of a mess. Any help is welcome.
Thanks a bunch in advance.
Cheers

Assuming you mean that you're trying to randomly choose values which will be distributed according to your PDF, then yes, it is possible. This is described on Wikipedia as inverse transform sampling. Basically, it's just what you said: integrate the PDF to produce the cumulative distribution (CDF), invert it (which can be done ahead of time), and then choose a random number and run it through the inverted CDF.
If your domain is 0 to positive infinity, your distribution appears to match the gamma distribution which is built into Numpy and Scipy, with theta = 1/alpha and k = alpha+1.

Related

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

Pareto distribution and whether a chart conforms to it

I have a figure as shown below, I want to know whether it conforms to the Pareto distribution, or not? Its a cumulative plot.
And, I want to find out the point in x axis which marks the point for the 80-20 rule, i.e the x-axis point which bifurcates the plot into 20 percent having 80 percent of the wealth.
Also, I'm really confused by the scipy.stats Pareto function, would be great if someone can give some intuitive explanation on that, since the documentation is pretty confusing.
scipy.stats.pareto provides a random draw from the Pareto distribution.
To know if your distribution conform to Pareto distribution you should perform a Kolmogorov-Smirnov test.
Draw a random sample from the Pareto distribution using pareto.rvs(shape, size=1000), where shape is the estimated shape parameter of your Pareto distribution, and use scipy.stats.kstest to perform the test :
pareto_smp = pareto.rvs(shape, size=1000)
D, p_value = scipy.stats.kstest(pareto_smp, values)
nobody can simply determine if an observation dataset follows a particular distribution. based on your situation, what you need:
fit empirical distribution using:
statsmodels.ECDF
then, compare (nonparametric) this with your data to see if the Null hypothesis can be rejected
for 20/80 rule:
rescale your X to range [0,1] and simply pick up 0.2 on x axis
source: https://arxiv.org/pdf/1306.0100.pdf

How to generate random numbers with predefined probability distribution?

I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.

Python: Gaussian Copula or invers of cdf

Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable

Generating perfect random gaussian numbers

I tried to generate the series of random numbers with gaussian distribution. So, I used numpy.random.normal(mean,standard deviation,size). However, when I converted these numbers into probabability density function using numpy.histogram, this was not same as Gaussian distribution with same mean and standard deviation made by matplotlib.mlab.normpdf.
I understand it may be because numpy.random.normal is random sampling. So, the PDF of these numbers can't be perfectly Gaussian.
Would you please give any advice about how to get the series of random numbers with mean and standard deviation which would have a Gaussian PDF, if it is possible?
The size of the numbers which I tried to get is 660.
I will really appreciate any advice and help.
Best regards,
Isaac
Well, you could "z-score" the sample, by subtracting the sample mean and then dividing by the sample standard deviation:
x = np.random.normal(0, 1, size=660)
x = (x - x.mean()) / x.std()
That will make your vector have a mean of 0 and a standard deviation of 1. But that doesn't mean you will have "perfectly gaussian random numbers." I don't think that's really a concept that makes sense.
It would be helpful to know what application you want to use this for, maybe then it would be easier to suggest alternatives.

Categories