How to simulate from an (arbitrary) continuous probability distribution? [duplicate] - python

This question already has answers here:
Fast arbitrary distribution random sampling (inverse transform sampling)
(5 answers)
Closed 5 years ago.
I have a probability density function like this:
def p1(x):
return ( sin(x) ** (-0.75) ) / (4.32141 * (x ** (1/5)))
I want to denerate random value on [0; 1] with this pdf. How can I do random value?

As mentioned by Francis you'd better know the cdf of your distribution.
Anyway scipy provides a handy way to define custom distributions.
It looks pretty much like that
from scipy import stats
class your_distribution(stats.rv_continuous):
def _pdf(self, x):
return ( sin(x) ** (-0.75) ) / (4.32141 * (x ** (1/5)))
distribution = your_distribution()
distribution.rvs()

Without using scipy and given a numerical sampling of your PDF, you can sample using a cumulative distribution and linear interpolation. The code below assumes equal spacing in x. It could be modified to do an integration for an arbitrarily sampled PDF. Note it renormalises the PDF to 1 within the range of x.
import numpy as np
def randdist(x, pdf, nvals):
"""Produce nvals random samples from pdf(x), assuming constant spacing in x."""
# get cumulative distribution from 0 to 1
cumpdf = np.cumsum(pdf)
cumpdf *= 1/cumpdf[-1]
# input random values
randv = np.random.uniform(size=nvals)
# find where random values would go
idx1 = np.searchsorted(cumpdf, randv)
# get previous value, avoiding division by zero below
idx0 = np.where(idx1==0, 0, idx1-1)
idx1[idx0==0] = 1
# do linear interpolation in x
frac1 = (randv - cumpdf[idx0]) / (cumpdf[idx1] - cumpdf[idx0])
randdist = x[idx0]*(1-frac1) + x[idx1]*frac1
return randdist

Related

np.random.choice not producing expected histogram

I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:
Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).
However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).
So with that I wrote the following script:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:
However, if we use these samples as choices with given probabilities, we get the following histogram:
I have no idea why this isn't working correctly.
I've tried other (smaller) examples like the ones listed on the numpy website, and they produce histograms of the according to the given probabilities array.
I'd really appreciate some advice/intuition if at all possible :).
It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities). The first argument should be x, not samples.
ADDITION BY AUTHOR:
The reason for this is the samples are the values of the curve (i.e. the y-axis and NOT the x-axis).
Thus the values with the highest probabilities (i.e. the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.
Changing this to x gives us the following graphs (for 10e3 samples):
Working as expected, very nice.

Remapping Points for a Growing Exponential Distribution

I am trying to take the data points from an array that currently range from 0 to 1 and remap them according to a few different distributions. For example, I am remapping the data to a decaying exponential (lambda * e^(-lambda * x)) with a standard deviation of .06 below.
# Import the packages I need
from pyDOE import lhs
from scipy.stats.distributions import norm
from scipy.stats.distributions import expon
import matplotlib.pyplot as plt
# CREATING THE LHC
n = 3 # The number of parameters to generate. Columns
samples = 40 # The number of sample points for each parameter. Rows
criterion = 'maximin' # The spacing between pararameters. maximin for our purposes
lhd = lhs(n, samples=samples, criterion=criterion) # Making the Latin-Hyper-Square
# print(lhd) # Show the array
# plt.hist(lhd, bins=20) # Plot the array
# Trying the transformation with exponentials
lhd1 = lhd # Create an identical array so I can compare and contrast
mean = [0]
stdv = [.06]
for i in range(n):
lhd1[:, i] = expon(loc=mean, scale=stdv).ppf(lhd1[:, i])
print(lhd1) # Show the Transformed array
plt.hist(lhd1,bins=20) # Plot the array
I would like to do the same thing but for growing exponentials(lambda * e^(lambda * x)). Everything I can find online and in the documentation speaks about the decaying exponential probability distribution, but there is almost nothing about a positive exponential.
Can I just alter the "expon" distribution? Is there another distribution that I should be using instead? Any advice is welcome.

How to plot the pdf and cdf for an arbitrary function in python? [duplicate]

The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.
I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:
def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f
# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?
You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].
After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:
[inverted_function(random.random()) for x in range(1000)]
More on Inverse Transform Sampling:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
Also, there is a good question on StackOverflow related to the topic:
Pythonic way to select list elements with different probability
This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.
class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function
the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,
the overhead is minimal
a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape = pdf.shape
self.pdf = pdf.ravel()
self.sort = sort
self.interpolation = interpolation
self.transform = transform
#a pdf can not be negative
assert(np.all(pdf>=0))
#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
#property
def ndim(self):
return len(self.shape)
#property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)
if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
And as a more real-world relevant example:
x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:] #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1) #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
Here is a rather nice way of performing inverse transform sampling with a decorator.
import numpy as np
from scipy.interpolate import interp1d
def inverse_sample_decorator(dist):
def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
return wrapper
Using this decorator on a Gaussian distribution, for example:
#inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.
samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),
def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
This function requires a function target_density which takes in a data-point and computes its probability.
For details check-out this detailed answer of mine.
import numpy as np
import scipy.interpolate as interpolate
def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.

Generating 3D Gaussian Data [duplicate]

This question already has answers here:
Generating 3D Gaussian distribution in Python
(2 answers)
Closed 2 years ago.
I'm trying to generate a 3D distribution, where x, y represents the surface plane, and z is the magnitude of some value, distributed over a range.
I'm looking at numpy's multivariate_normal, but it only lets me get a number of samples. I'd like the ability to specify some x, y coordinate, and get back what the z value should be; so I'd be able to query gp(x, y) and get back a z value that adheres to some mean and covariance.
Perhaps a more illustrative (toy) example: assume I have some temperature distribution that can be modeled as a gaussian process. So I might have a mean temperature of 20 at (0, 0), and some covariance [[1, 0], [0, 1]]. I'd like to be able to create a model that I can then query at different x, y locations to get the temperature at that position (so, at (5, 5) I might get back something like 7 degrees).
How to best accomplish this?
I assume that your data can be copied to a single np.array, which I will refer to as X in my code, with shape X.shape = (n,2), where n is the number of data points you have and you can have n = 1, if you wish to test a single point at a time. 2, of course, refers to the 2D space spanned by your coordinates (x and y) base. Then:
def estimate_gaussian(X):
return X.mean(axis=0), np.cov(X.T)
def mva_gaussian( X, mu, sigma2 ):
k = len(mu)
# check if sigma2 is a vector and, if yes, use as the diagonal of the covariance matrix
if sigma2.ndim == 1 :
sigma2 = np.diag(sigma2)
X = X - mu
return (2 * np.pi)**(-k/2) * np.linalg.det(sigma2)**(-0.5) * \
np.exp( -0.5 * np.sum( np.multiply( X.dot( np.linalg.inv(sigma2) ), X ), axis=1 ) ).reshape( ( X.shape[0], 1 ) )
will do what you want - that is, given data points you will get the value of the gaussian function at those points (or a single point). This is actually a generalized version of what you need, as this function can describe a multivariate gaussian. You seem to be interested in the k = 2 case and a diagonal covariance matrix sigma2.
Moreover, this is also a probability distribution - which you say you don't want. We don't have enough info to know what exactly it is you're trying to fit to (i.e. what you expect the three parameters of the gaussian function to be. Usually, people are interested in a normal distribution). Nevertheless, you can simply change the parameters in the return statement of the mva_gaussian function according to your needs and ignore the estimate gaussian function if you don't want a normalized distribution (although a normalized function would still give you what you seek - a real valued temperature - as long as you know the normalization process - which you do :-) ).
You can create a multivariate normal using scipy.stats.multivariate_normal.
>>> import scipy.stats
>>> dist = scipy.stats.multivariate_normal(mean=[2,3], cov=[[1,0],
[0,1]])
Then to find p(x,y) you can use pdf
>>> dist.pdf([2,3])
0.15915494309189535
>>> dist.pdf([1,1])
0.013064233284684921
Which represents the probability (which you called z) given any [x,y]

python: random sampling from self-defined probability function [duplicate]

This question already has answers here:
Fast arbitrary distribution random sampling (inverse transform sampling)
(5 answers)
Closed 5 years ago.
I have a piecewise quartic distribution with a probability density function:
p(x)= c(x/a)^2 if 0≤x<a;
c((b+a-x)^2/b)^2 if a≤x≤b;
0 otherwise
Suppose c, a, b are known, I am trying to draw 100 random samples from the distribution. How can I do it with numpy/scipy?
One standard way is to find an explicit formula, G = F^-1 for the inverse of the cumulative distribution function. That is doable here (although it will naturally be piecewise defined) and then use G(U) where U is uniform on [0,1] to generate your samples.
In this case, I think that I worked out the details, but you will need to check the Calculus/Algebra.
First of all, to streamline things it helps to introduce a couple of new parameters. Let
f(a,b,c,d,x) = c*x**2 #if 0 <= x <= a
and
f(a,b,c,d,x) = d*(x-e)**4 #if a < x <= b
Then your p(x) is given by
p(x) = f(a,b,c/a**2,c/b**2,a+b)
I integrated f to find the cumulative distribution and then inverted and got the following:
def Finverse(a,b,c,d,e,x):
if x <= (c*a**3)/3:
return (3*x/c)**(1/3)
else:
return e + ((a-e)**5 - (5*c*a**3)/(3*d))**(1/5)
Assuming this is right, then simply:
def randX(a,b,c):
u = random.random()
return Finverse(a,b,c/a**2,c/b**2,a+b,u)
In this case it was possible to work out an explicit formula. When you can't work out such a formula for the inverse, consider using the Monte Carlo methods described by #lucianopaz
As your function is bounded both in x and p(x), I recommend that you use Monte Carlo rejection sampling. The basic principle is that you draw two uniform random numbers, one representing a candidate x in the x space bounds [0,b] and another representing y. If y is lower or equal to the normalized p(x), then the sampled x is returned, if not it continues to the next iteration
import numpy as np
def rejection_sampler(p,xbounds,pmax):
while True:
x = np.random.rand(1)*(xbounds[1]-xbounds[0])+xbounds[0]
y = np.random.rand(1)*pmax
if y<=p(x):
return x
Here, p should be a callable to your normalized piecewise probability density, xbounds can be a list or tuple containing the lower and upper bounds, and pmax the maximum of the probability density in the x interval.

Categories