When I sample from a distribution in PyTorch, both sample and rsample appear to give similar results:
import torch, seaborn as sns
x = torch.distributions.Normal(torch.tensor([0.0]), torch.tensor([1.0]))
sns.distplot(x.sample((100000,)))
sns.distplot(x.rsample((100000,)))
When should I use sample(), and when should I use rsample()?
Using rsample allows for pathwise derivatives:
The other way to implement these stochastic/policy gradients would be to use the reparameterization trick from the rsample() method, where the parameterized random variable can be constructed via a parameterized deterministic function of a parameter-free random variable. The reparameterized sample therefore becomes differentiable.
sample(): random sampling from the probability distribution. So, we cannot backpropagate, because it is random! (the computation graph is cut off).
See the source code of sample in torch.distributions.normal.Normal:
def sample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
with torch.no_grad():
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
torch.normal returns a tensor of random numbers. Also, torch.no_grad() context prevents the computation graph from growing any further.
You see, we cannot backprop. The returned tensor of sample() contains just some numbers, not the whole computational graph.
So, what is rsample()?
By using rsample, we can backpropagate, because it keeps the computation graph alive.
How? By putting the randomness aside in a separate parameter. This is called the "reparameterization trick".
rsample: sampling using reparameterization trick.
There is eps in the source code:
def rsample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
eps = _standard_normal(shape, dtype=self.loc.dtype, device=self.loc.device)
return self.loc + eps * self.scale
# `self.loc` is the mean and `self.scale` is the standard deviation.
eps is the separate parameter responsible for the randomness of the sampling.
Look at the return: mean + eps * standard deviation
eps does not depend on the parameters you want to differentiate with respect to.
So, now you can freely backpropagate(=differentiate) because eps does not change when the parameters change.
(If we change the parameters, the distribution of the reparameterized samples does change because self.loc and self.scale change, but the distribution of the eps does not change.)
Note that the randomness of the sampling comes from the random sampling of the eps. There is no randomness in the computation graph itself. Once eps is chosen, it is fixed. (the distribution of the elements of the eps is fixed, after they are sampled.)
For example, in an implementation of the SAC(Soft Actor-Critic) algorithm in reinforcement learning, eps may consist of elements corresponding to a single minibatch of actions (and one action may consist of many elements).
Related
Ultimately I am trying to visualise the copula between two PDFs which are estimated from data (both via a KDE). Suppose, for one of the KDEs, I have discrete x,y data sorted in a tuple called data. I need to generate random variables with this distribution in order to perform the probability integral transform (and ultimately to obtain the uniform distribution). My methodology to generate random variables is as follows:
import scipy.stats as st
from scipy import interpolate, integrate
pdf1 = interpolate.interp1d(data[0], data[1])
class pdf1_class(st.rv_continuous):
def _pdf(self,x):
return pdf1(x)
pdf1_rv = pdf1_class(a = data[0][0], b= data[0][-1], name = 'pdf1_class')
pdf1_samples = pdf1_rv.rvs(size=10000)
However, this method is extremely slow. I also get the following warnings:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
warnings.warn(msg, IntegrationWarning)
IntegrationWarning: The occurrence of roundoff error is detected, which prevents
the requested tolerance from being achieved. The error may be
underestimated.
warnings.warn(msg, IntegrationWarning)
Is there a better way to generate the random variables?
As per suggestion by #unutbu I implemented _cdf and _ppf, which makes the calculation of 10000 samples instantaneous. To do this I added the following to the above code:
discrete_cdf1 = integrate.cumtrapz(y=data[1], x = data[0])
cdf1 = interpolate.interp1d(data[0][1:], discrete_cdf1)
ppf1 = interpolate.interp1d(discerete_cdf1, data[0][:-1])
I then add the following two methods to pdf1_class
def _cdf(self,x):
return cdf1(x)
def _ppf(self,x):
return ppf1(x)
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
I want to add Gaussian random noise to a variable in my model for each separate time-step and not to generate a noise array and add it to my signal afterwards. In this way I want to examine a standard dynamic effect of my system.
So, I want to generate each time-step a random noise (i.e. a single value) and it to my signal (e.g. add noise then it calculates the next state, add noise it calculates the next state, etc.). I thought to do this via NumPy using the following in the dynamic section of my model over a set of time-steps:
self.x = self.x + self.a * ((d-f)/100)
self.x = self.x + np.random.normal(0, 0.5, None)`
the second line is drawing random samples from a normal distribution and adds it to my variable.
0 is the mean of the normal distribution I am choosing from, 0.5 is the standard deviation of the normal distribution and the third argument is the size.
I am wondering if numpy.random.normal is the correct way to do it and, if so, what parameter I should use for the size argument?
To generate a single number, use the 2-argument form of np.random.normal:
In [47]: np.random.normal(0, 0.5)
Out[47]: 0.6138972867165546
You may need to scale this number (multiply it by a small number epsilon) so the noise is small compared to self.x.
I try to convert matlab code to numpy and figured out that numpy has a different result with the std function.
in matlab
std([1,3,4,6])
ans = 2.0817
in numpy
np.std([1,3,4,6])
1.8027756377319946
Is this normal? And how should I handle this?
The NumPy function np.std takes an optional parameter ddof: "Delta Degrees of Freedom". By default, this is 0. Set it to 1 to get the MATLAB result:
>>> np.std([1,3,4,6], ddof=1)
2.0816659994661326
To add a little more context, in the calculation of the variance (of which the standard deviation is the square root) we typically divide by the number of values we have.
But if we select a random sample of N elements from a larger distribution and calculate the variance, division by N can lead to an underestimate of the actual variance. To fix this, we can lower the number we divide by (the degrees of freedom) to a number less than N (usually N-1). The ddof parameter allows us change the divisor by the amount we specify.
Unless told otherwise, NumPy will calculate the biased estimator for the variance (ddof=0, dividing by N). This is what you want if you are working with the entire distribution (and not a subset of values which have been randomly picked from a larger distribution). If the ddof parameter is given, NumPy divides by N - ddof instead.
The default behaviour of MATLAB's std is to correct the bias for sample variance by dividing by N-1. This gets rid of some of (but probably not all of) of the bias in the standard deviation. This is likely to be what you want if you're using the function on a random sample of a larger distribution.
The nice answer by #hbaderts gives further mathematical details.
The standard deviation is the square root of the variance. The variance of a random variable X is defined as
An estimator for the variance would therefore be
where denotes the sample mean. For randomly selected , it can be shown that this estimator does not converge to the real variance, but to
If you randomly select samples and estimate the sample mean and variance, you will have to use a corrected (unbiased) estimator
which will converge to . The correction term is also called Bessel's correction.
Now by default, MATLABs std calculates the unbiased estimator with the correction term n-1. NumPy however (as #ajcr explained) calculates the biased estimator with no correction term by default. The parameter ddof allows to set any correction term n-ddof. By setting it to 1 you get the same result as in MATLAB.
Similarly, MATLAB allows to add a second parameter w, which specifies the "weighing scheme". The default, w=0, results in the correction term n-1 (unbiased estimator), while for w=1, only n is used as correction term (biased estimator).
For people who aren't great with statistics, a simplistic guide is:
Include ddof=1 if you're calculating np.std() for a sample taken from your full dataset.
Ensure ddof=0 if you're calculating np.std() for the full population
The DDOF is included for samples in order to counterbalance bias that can occur in the numbers.
I'm trying to infer models parameters with PyMC. In particular the observed data is modeled as a sum of two different random variables: a negative binomial and a poisson.
In PyMC, an algebraic composition of random variables is described by a "deterministic" object. Is it possible to assign the observed data to this deterministic object?
If not possible, we still know that the PDF of the sum is the convolution the PDF of the components. Is there any trick to compute this convolution efficiently?
It is not possible to make a deterministic node observed in PyMC2, but you can achieve an equivalent model by making one part of your convolution a latent variable. Here is a small example:
def model(values):
# priors for model parameters
mu_A = pm.Exponential('mu_A', beta=1, value=1)
alpha_A = pm.Exponential('alpha_A', beta=1, value=1)
mu_B_minus_A = pm.Uninformative('mu_B_minus_A', value=1)
# latent variable for negative binomial
A = pm.NegativeBinomial('A', mu=mu_A, alpha=alpha_A, value=0)
# observed variable for conditional poisson
B = pm.Poisson('B', mu=mu_B_minus_A+A, value=values, observed=True)
return locals()
Here is a notebook that tests it out. It seems like it will be tough to fit without some additional information on the model parameters. Perhaps there is a clever way to calculate or approximate the convolution of a NB and a Poisson that you could use as a custom observed stochastic instead.