Estimate the following integral with Monte Carlo integration:
I am trying to do Monte Carlo Integration on the problem below, where p(x) is a Gaussian distribution with a mean of 1 and a variance of 2. (see image).
I was told that once we draw samples from a normal distribution the pdf vanishes in the integral. Please explain this concept and how do I solve this in Python. Below is my attempt.
def func(x):
return (math.exp(x))*x
mu = 1
sigma = sqrt(2)
N = 1000
areas = []
for i in range(N):
xrand = np.zeros(N)
for i in range (len(xrand)):
xrand[i] = np.random.normal(mu, sigma)
integral = 0.0
for i in range (N):
integral += func(xrand[i])/N
answer = integral
areas.append(answer)
plt.title("Distribution of areas calculated")
plt.hist(areas, 60, ec = 'black')
plt.xlabel("Areas")
integral
Monte Carlo integration is a way of approximating complex integrals without computing their closed form solution. To answer your question, the PDF vanishes because all you need to do is to 1) sample some random value from the specified normal distribution, 2) calculate the value of the function in the integrand, and 3) compute the average of these values. Note that the PDF becomes irrelevant in the computation; it’s only relevant insofar as assuring that more likely values are more frequently sampled. You might understand this as taking the weighted average, if that makes things more intuitive.
Here is a Python implementation based on your original source code.
def func(x):
return x * math.exp(x)
def monte_carlo(n_sample, mu, sigma):
val_lst = []
for _ in range(n_sample):
x = np.random.normal(mu, sigma)
val_lst.append(func(x))
return mean(val_lst)
You can change func to be any function of your choice to perform a Monte Carlo approximation of that function. You can also edit the parameters of the monte_carlo function if you are given a different probability distribution.
Here is a function you can use to visualize the gradual convergence of the Monte Carlo approximation. As you might expect, the values will converge with larger iterations, i.e. as you increase the value of n_sample.
MAX_SAMPLE = 200 # Adjust this value as you need
x = np.arange(MAX_SAMPLE)
y = [monte_carlo(i, 1, sqrt(2)) for i in x]
plt.plot(x, y)
plt.show()
The resulting plot will show you the value of convergence, which is approximation of the value computed from the closed form solution of the definite integral.
Related
scipy.stats.entropy calculates the differential entropy for a continuous random variable. By which estimation method, and which formula, exactly is it calculating differential entropy? (i.e. the differential entropy of a norm distribution versus that of the beta distribution)
Below is its github code. Differential entropy is the negative integral sum of the p.d.f. multiplied by the log p.d.f., but nowhere do I see this or the log written. Could it be in the call to integrate.quad?
def _entropy(self, *args):
def integ(x):
val = self._pdf(x, *args)
return entr(val)
# upper limit is often inf, so suppress warnings when integrating
_a, _b = self._get_support(*args)
with np.errstate(over='ignore'):
h = integrate.quad(integ, _a, _b)[0]
if not np.isnan(h):
return h
else:
# try with different limits if integration problems
low, upp = self.ppf([1e-10, 1. - 1e-10], *args)
if np.isinf(_b):
upper = upp
else:
upper = _b
if np.isinf(_a):
lower = low
else:
lower = _a
return integrate.quad(integ, lower, upper)[0]
Source (lines 2501 - 2524): https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py
You have to store a continuous random variable in some parametrized way anyway, unless you work with an approximation. In that case, you usually work with distribution objects; and for known distributions, formulae for the differential entropy in terms of the parameters exist.
Scipy accordingly provides an entropy method for rv_continuous that calculates the differential entropy where possible:
In [5]: import scipy.stats as st
In [6]: rv = st.beta(0.5, 0.5)
In [7]: rv.entropy()
Out[7]: array(-0.24156448)
The actual question here is how do you store a continuous variable in memory. You might use some discretization techniques and calculate entropy for a discrete random variable.
You also may check Tensorflow Probability, which treats distributions essentially as tensors and has a method entropy() for a Distribution class.
I am trying to find the correlation function of the following stochastic process:
where beta and D are constants and xi(t) is a Gaussian noise term.
After simulating this process with the Euler method, I want to find the auto correlation function of this process. First of all, I have found an analytical solution for the correlation function and already used the definition of correlation function to simulate it and the two results were pretty close (please see the photo, the corresponding code is at the end of this post).
(Figure 1)
Now I want to use the Wiener-Khinchin theorem (using fft) to find the correlation function by taking the fft of the realizations, multiply it with its conjugate and then find take the ifft to get the correlation function. But obviously I am getting results that are way off the expected correlation function, so I am pretty sure there is something I misunderstood in the code to get this wrong results..
Here is my code for the solution of the stochastic process (which I am sure it is right although my code might be sloppy) and my attempt to find the autocorrelaion with the fft:
N = 1000000
dt=0.01
gamma = 1
D=1
v_data = []
v_factor = math.sqrt(2*D*dt)
v=1
for t in range(N):
F = random.gauss(0,1)
v = v - gamma*dt + v_factor*F
if v<0: ###boundary conditions.
v=-v
v_data.append(v)
def S(x,dt): ### power spectrum
N=len(x)
fft=np.fft.fft(x)
s=fft*np.conjugate(fft)
# n=N*np.ones(N)-np.arange(0,N) #divide res(m) by (N-m)
return s.real/(N)
c=np.fft.ifft(S(v_data,0.01)) ### correlation function
t=np.linspace(0,1000,len(c))
plt.plot(t,c.real,label='fft method')
plt.xlim(0,20)
plt.legend()
plt.show()
And this is what I would get using this method for the correlation function,
And this is my code for the correlation function using the definition:
def c_theo(t,b,d): ##this was obtained by integrating the solution of the SDE
I1=((-t*d)+((d**2)/(b**2))-((1/4)*(b**2)*(t**2)))*special.erfc(b*t/(2*np.sqrt(d*t)))
I2=(((d/b)*(np.sqrt(d*t/np.pi)))+((1/2)*(b*t)*(np.sqrt(d*t/np.pi))))*np.exp(-((b**2)*t)/(4*d))
return I1+I2
## this is the correlation function that was plotted in the figure 1 using the definition of the autocorrelation.
Ntau = 500
sum2=np.zeros(Ntau)
c=np.zeros(Ntau)
v_mean=0
for i in range (0,N):
v_mean=v_mean+v_data[i]
v_mean=v_mean/N
for itau in range (0,Ntau):
for i in range (0,N-10*itau):
sum2[itau]=sum2[itau]+v_data[i]*v_data[itau*10+i]
sum2[itau]=sum2[itau]/(N-itau*10)
c[itau]=sum2[itau]-v_mean**2
t=np.arange(Ntau)*dt*10
plt.plot(t,c,label='numericaly')
plt.plot(t,c_theo(t,1,1),label='analyticaly')
plt.legend()
plt.show()
so would someone please point out where is the mistake in my code, and how could I simulate it better to get the right correlation function?
There are two issues with the code that I can see.
As francis said in a comment, you need to subtract the mean from your signal to get the autocorrelation to reach zero.
You plot your autocorrelation function with a wrong x-axis values.
v_data is defined with:
N = 1000000 % 1e6
dt = 0.01 % 1e-2
meaning that t goes from 0 to 1e4. However:
t = np.linspace(0,1000,len(c))
meaning that you plot with t from 0 to 1e3. You should probably define t with
t = np.arange(N) * dt
Looking at the plot, I'd say that stretching the blue line by a factor 10 would make it line up with the red line quite well.
This question already has answers here:
Fast arbitrary distribution random sampling (inverse transform sampling)
(5 answers)
Closed 5 years ago.
I have a piecewise quartic distribution with a probability density function:
p(x)= c(x/a)^2 if 0≤x<a;
c((b+a-x)^2/b)^2 if a≤x≤b;
0 otherwise
Suppose c, a, b are known, I am trying to draw 100 random samples from the distribution. How can I do it with numpy/scipy?
One standard way is to find an explicit formula, G = F^-1 for the inverse of the cumulative distribution function. That is doable here (although it will naturally be piecewise defined) and then use G(U) where U is uniform on [0,1] to generate your samples.
In this case, I think that I worked out the details, but you will need to check the Calculus/Algebra.
First of all, to streamline things it helps to introduce a couple of new parameters. Let
f(a,b,c,d,x) = c*x**2 #if 0 <= x <= a
and
f(a,b,c,d,x) = d*(x-e)**4 #if a < x <= b
Then your p(x) is given by
p(x) = f(a,b,c/a**2,c/b**2,a+b)
I integrated f to find the cumulative distribution and then inverted and got the following:
def Finverse(a,b,c,d,e,x):
if x <= (c*a**3)/3:
return (3*x/c)**(1/3)
else:
return e + ((a-e)**5 - (5*c*a**3)/(3*d))**(1/5)
Assuming this is right, then simply:
def randX(a,b,c):
u = random.random()
return Finverse(a,b,c/a**2,c/b**2,a+b,u)
In this case it was possible to work out an explicit formula. When you can't work out such a formula for the inverse, consider using the Monte Carlo methods described by #lucianopaz
As your function is bounded both in x and p(x), I recommend that you use Monte Carlo rejection sampling. The basic principle is that you draw two uniform random numbers, one representing a candidate x in the x space bounds [0,b] and another representing y. If y is lower or equal to the normalized p(x), then the sampled x is returned, if not it continues to the next iteration
import numpy as np
def rejection_sampler(p,xbounds,pmax):
while True:
x = np.random.rand(1)*(xbounds[1]-xbounds[0])+xbounds[0]
y = np.random.rand(1)*pmax
if y<=p(x):
return x
Here, p should be a callable to your normalized piecewise probability density, xbounds can be a list or tuple containing the lower and upper bounds, and pmax the maximum of the probability density in the x interval.
I'm trying to understand the Spherical harmonics expansion in order to solve a more complex problem but the result I'm expecting from a very simple calculation is not correct. I have no clue why this is happening.
A bit of theory: It is well known that a function on the surface of a sphere () can be defined as an infinite sum of some constant coefficients and the spherical harmonics :
The spherical harmonics are defined as :
where are the associated Legendre polynomials.
An finally, the constant coefficients can be calculated (similarly to the Fourier transform) as follow:
The problem: Let's assume we have a sphere centered in where the function on the surface is equal to for all points . We want to calculate the constant coefficients and then calculate back the surface function by approximation. Since the calculation of the constant coefficients reduces to :
which numerically (in Python) can be approximated using:
def Ylm(l,m,theta,phi):
return scipy.special.sph_harm(m,l,theta,phi)
def flm(l,m):
phi, theta = np.mgrid[0:pi:101j, 0:2*pi:101j]
return Ylm(l,m,theta,phi).sum()
Then, by computing a band limited sum over I'm expecting to see when for any given point .
L = 20
f = 0
theta0, phi0 = 0.0, 0.0
for l in xrange(0,L+1):
for m in xrange(-l,l+1):
f += flm(l,m)*Ylm(l,m,theta0,phi0)
print f
but for it gives me and not . For it gives me
I know it seems more a Mathematics problem but the formulas should be correct. The problem seems being on my computation. It could be a really stupid mistake but I cannot spot it. Any suggestion?
Thanks
The spherical harmonics are orthonormal with the inner product
<f|g> = Integral( f(theta,phi)*g(theta,phi)*sin(theta)*dphi*dtheta)
So you should calulate the coefficients by
clm = Integral( Ylm( theta, phi) * sin(theta)*dphi*dtheta)
The MWE below shows two ways of integrating the same 2D kernel density estimate, obtained for this data using the stats.gaussian_kde() function.
The integration is performed for all (x, y) below the threshold point (x1, y1), which defines the upper integration limits (lower integration limits are -infinity; see MWE).
The int1 function uses simple a Monte Carlo approach.
The int2 function uses the scipy.integrate.nquad function.
The issue is that int1 (ie: the Monte Carlo method) gives systematically larger values for the integral than int2. I don't know why this happens.
Here's an example of the integral values obtained after 200 runs of int1 (blue histogram) versus the integral result given by int2 (red vertical line):
What is the origin of this difference in the resulting integral value?
MWE
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import integrate
def int1(kernel, x1, y1):
# Compute the point below which to integrate
iso = kernel((x1, y1))
# Sample KDE distribution
sample = kernel.resample(size=50000)
# Filter the sample
insample = kernel(sample) < iso
# The integral is equivalent to the probability of drawing a
# point that gets through the filter
integral = insample.sum() / float(insample.shape[0])
return integral
def int2(kernel, x1, y1):
def f_kde(x, y):
return kernel((x, y))
# 2D integration in: (-inf, x1), (-inf, y1).
integral = integrate.nquad(f_kde, [[-np.inf, x1], [-np.inf, y1]])
return integral
# Obtain data from file.
data = np.loadtxt('data.dat', unpack=True)
# Perform a kernel density estimate (KDE) on the data
kernel = stats.gaussian_kde(data)
# Define the threshold point that determines the integration limits.
x1, y1 = 2.5, 1.5
i2 = int2(kernel, x1, y1)
print i2
int1_vals = []
for _ in range(200):
i = int1(kernel, x1, y1)
int1_vals.append(i)
print i
Add
Notice that this question originated from this answer. At first I didn't notice that the answer was mistaken in the integration limits used, which explains why the results between int1 and int2 are different.
int1 is integrating in the domain f(x,y)<f(x1,y1) (where f is the kernel density estimate), while int2 integrates in the domain (x,y)<(x1,y1).
You resample the distribution
sample = kernel.resample(size=50000)
and then compute the probability for each sampled point is less than the probability at the bound
insample = kernel(sample) < iso
This is incorrect. Consider the bounds (0,100) and assume your data has u=(0,0) and cov=[[100,0],[0,100]]. Points (0,50) and (50,0) have the same probability in this kernel, but only one of them is in the bounds. Since both pass the test, you are over sampling.
You should be testing whether each point in sample is inside the bounds, then compute the probability. Something like
def int1(kernel, x1, y1):
# Sample KDE distribution
sample = kernel.resample(size=100)
include = (sample < np.repeat([[x1],[y1]],sample.shape[1],axis=1)).all(axis=0)
integral = include.sum() / float(sample.shape[1])
return integral
I tested this using the following code
def measure(n):
m1 = np.random.normal(size=n)
m2 = np.random.normal(size=n)
return m1,m2
a = scipy.stats.gaussian_kde( np.vstack(measure(1000)) )
print(int1(a,-10,-10))
print(int2(a,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
Yields
0.0
(4.304674927251112e-232, 4.6980863813551415e-230)
0.26
(0.25897626178338407, 1.4536217446381293e-08)
Monte Carlo integration should work like this
Sample N random values (uniformly, not from your distribution) over some subset of the possible values of x/y (below I bound it by 10 SDs from mean).
For each random value compute kernel(rand_x,rand_y)
Compute the sum and multiply by (volume)/N_samples
In code:
def mc_wo_sample(kernel,x1,y1,lboundx,lboundy):
nsamples = 50000
volume = (x1-lboundx)*(y1-lboundy)
# generate uniform points in range
xrand = np.random.rand(nsamples,1)*(x1-lboundx) + lboundx
yrand = np.random.rand(nsamples,1)*(y1-lboundy) + lboundy
randvals = np.hstack((xrand,yrand)).transpose()
print randvals.shape
return (volume*kernel(randvals).sum())/nsamples
Running the following
print(int1(a,-9,-9))
print(int2(a,-9,-9))
print(mc_wo_sample(a,-9,-9,-10,-10))
print(int1(a,0,0))
print(int2(a,-0,-0))
print(mc_wo_sample(a,0,0,-10,-10))
yields
0.0
(4.012958496109042e-70, 6.7211236076277e-71)
4.08538890986e-70
0.36
(0.37101621760650216, 1.4670898180664756e-08)
0.361614657674