Differential entropy is calculated with integrate.quad in scipy.stats?

Differential entropy is calculated with integrate.quad in scipy.stats? - python

scipy.stats.entropy calculates the differential entropy for a continuous random variable. By which estimation method, and which formula, exactly is it calculating differential entropy? (i.e. the differential entropy of a norm distribution versus that of the beta distribution)
Below is its github code. Differential entropy is the negative integral sum of the p.d.f. multiplied by the log p.d.f., but nowhere do I see this or the log written. Could it be in the call to integrate.quad?
def _entropy(self, *args):
def integ(x):
val = self._pdf(x, *args)
return entr(val)
# upper limit is often inf, so suppress warnings when integrating
_a, _b = self._get_support(*args)
with np.errstate(over='ignore'):
h = integrate.quad(integ, _a, _b)[0]
if not np.isnan(h):
return h
else:
# try with different limits if integration problems
low, upp = self.ppf([1e-10, 1. - 1e-10], *args)
if np.isinf(_b):
upper = upp
else:
upper = _b
if np.isinf(_a):
lower = low
else:
lower = _a
return integrate.quad(integ, lower, upper)[0]
Source (lines 2501 - 2524): https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py

You have to store a continuous random variable in some parametrized way anyway, unless you work with an approximation. In that case, you usually work with distribution objects; and for known distributions, formulae for the differential entropy in terms of the parameters exist.
Scipy accordingly provides an entropy method for rv_continuous that calculates the differential entropy where possible:
In [5]: import scipy.stats as st
In [6]: rv = st.beta(0.5, 0.5)
In [7]: rv.entropy()
Out[7]: array(-0.24156448)

The actual question here is how do you store a continuous variable in memory. You might use some discretization techniques and calculate entropy for a discrete random variable.
You also may check Tensorflow Probability, which treats distributions essentially as tensors and has a method entropy() for a Distribution class.

Related

scipy rv_continuous very slow

I am using a custom function f(x) to define a custom distribution using copy's rv_continuous class. My code is
class my_pdf_gen(rv_continuous):
def _pdf(self, x, integral):
return f(x)/integral
where integral ensure the normalisation. I am able to create an instance of it with
my_pdf = my_pdf_gen(my_int,a = a, b = b, name = 'my pdf')
with a,b the upper and lower limit of the value's range, and my_int= scipy.integrate.quad(f, a, b)[0].
I am also able to create a random sample of data using my_pdf.rvs(my_int, size = 5), but this is very slow. (Up to 6 seconds when size=9).
I read that one should also overwrite some other methods in the class (like _ppf), but from the examples I found it isn't clear to me how to achieve it in my case.
Thanks a lot!

It's expected to be slow since the generic implementation does root-solving for cdf, which itself uses numerical integration.
So your best bet is to provide a _ppf or _rvs implementation. How to do this greatly depends on the details of f(x). If you cannot solve f(x) = r analytically, consider tabulating / inverse interpolation or rejection sampling.

I solved the problem by changing approach and using Monte Carlo's rejection sampler method
def rejection_sampler(p,xbounds,pmax):
while True:
x = np.random.rand(1)*(xbounds[1]-xbounds[0])+xbounds[0]
y = np.random.rand(1)*pmax
if y<=p(x):
return x
where p is the probability density function, xbounds is a tuple containing the upper and lower limits of of the pdf and pmax is the maximum value of the pdf on the domain.
Monte Carlo's rejection sampler was suggested here: python: random sampling from self-defined probability function

Solver tolerance and residual error when using sweep function in FiPy

I was trying to use FiPy to solve a set of PDEs when I realized the command sweep was not working the way I thought it would. Here goes a sample with part of my code:
from pylab import *
import sys
from fipy import *
viscosity = 5.55555555556e-06
Pe =5.
pfi=100.
lfi=0.01
Ly=1.
Nx =200
Ny=100
Lx=Ly*Nx/Ny
dL=Ly/Ny
mesh = PeriodicGrid2DTopBottom(nx=Nx, ny=Ny, dx=dL, dy=dL)
x, y = mesh.cellCenters
xVelocity = CellVariable(mesh=mesh, hasOld=True, name='X velocity')
xVelocity.constrain(Pe, mesh.facesLeft)
xVelocity.constrain(Pe, mesh.facesRight)
rad=0.1
var1 = DistanceVariable(name='distance to center', mesh=mesh, value=numerix.sqrt((x-Nx*dL/2.)**2+(y-Ny*dL/2.)**2))
pi_fi= CellVariable(mesh=mesh, value=0.,name='Fluid-interface energy map')
pi_fi.setValue(pfi*exp(-1.*(var1-rad)/lfi), where=(var1 > rad) )
pi_fi.setValue(pfi, where=(var1 <= rad))
xVelocityEq = DiffusionTerm(coeff=viscosity) - ImplicitSourceTerm(pi_fi)
xres=10.
while (xres > 1.e-6) :
xVelocity.updateOld()
mySolver = LinearGMRESSolver(iterations=1000,tolerance=1.e-6)
xres = xVelocityEq.sweep(var=xVelocity,solver=mySolver)
print 'Result = ', xres
#Thats it
In short, I am declaring a function called xVelocityEq and solving it using sweep. Here is my output:
Result = 0.0007856742013190237
Result = 6.414470433257661e-07
As you can see, the while loop ends after two iterations. My first question is: why is my first residual error (=0.0007856742013190237) higher than the solver's tolerance? I thought that, since xVelocityEq corresponds to a linear system, solver tolerance and residual error would mean the same thing.
If I increase the no. of iterations in mySolver from 1000 to 10000, I get the following output:
Result = 0.0007856742013190237
Result = 2.4619110931978988e-09
Why did the second residual change, given that the first remained the same?
If I increase the tolerance in mySolver from 1.e-6 to 7.e-4, I get the following output:
Result = 0.0007856742013190237
Result = 6.414470433257661e-07
Note that these residuals are the same as in the first output. Now if I try to further increase the tolerance to 8.e-4, here's what I get as output:
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
Result = 0.0007856742013190237
...
At this point I was completely lost. Why the residuals have the same values for all solver tolerances smaller than 7.e-4? And why these residuals are constant and equal to 0.0007856742013190237 for solver tolerances higher than 7.e-4?
If I change the mySolver to LinearLUSolver (iterations=1000, tolerance=1.e-6), here's what I get:
Result = 0.0007856742013190237
Result = 1.6772757200988522e-18
Why in the world is my first residual the same as before, even though I have changed the solver?

why is my first residual error (=0.0007856742013190237) higher than the solver's tolerance?
The residual calculated by .sweep() is calculated before the solver is invoked to calculated a new solution vector. The matrix L and right-hand-side vector b are calculated based on the initial value of the solution vector x.
The residual is a measure of how well the current solution vector satisfies the non-linear PDE. The solver tolerance places a limit on how hard the solver should work to satisfy the linear system of equations discretized from the PDE.
Even if the PDE is linear (e.g., the diffusion coefficient is not a function of the solution variable), the initial value presumably doesn't solve the PDE, so the residual is large. After the solver is invoked, then x should solve the PDE, to within the solver tolerance. If the PDE is non-linear, then a well-converged solution to the linear algebra is still probably not a good solution to the PDE; that's what sweeping is for.
I thought that, since xVelocityEq corresponds to a linear system, solver tolerance and residual error would mean the same thing.
There wouldn't be any utility in keeping track of both. In addition to the residual being before the solve and the solver tolerance being used to terminate the solve, there are different normalizations that can be used and a lot of the solver documentation can be kind of sketchy. FiPy uses |L x - b|_2 as its residual. Solvers may normalize by the magnitude of b, the diagonal of L, or the phase of the moon, all of which can make it hard to directly compare the residual with the tolerance.
Why did the second residual change, given that the first remained the same?
By allowing 1000 iterations instead of 100, the solver was able to drive to a more exacting tolerance which, in turn, led to a smaller residual for the next sweep.
Why the residuals have the same values for all solver tolerances smaller than 7.e-4? And why these residuals are constant and equal to 0.0007856742013190237 for solver tolerances higher than 7.e-4?
Probably because the solver is failing and so not changing the value of the solution vector. Some solvers don't report this. In other cases, we should be doing a better job of reporting that fact to you.
Why in the world is my first residual the same as before, even though I have changed the solver?
The residual is not a property of the solver. It is a property of the discretized system of equations that approximates your PDE. Those linear algebra equations are then the input to the solver.

Stochastic integration with python

I want to numerically solve integrals that contain white noise.
Mathematically white noise can be described by a variable X(t), which is a random variable with a time average, Avg[X(t)] = 0 and the correlation function, Avg[X(t), X(t')] = delta_distribution(t-t').
A simple example would be to calculate the integral over X(t) from t=0 to t=1. On average this is of course zero, but what I need are different realizations of this integral.
The problem is that this does not work with numpy.integrate.quad().
Are there any packages for python that deal with stochastic integrals?

This is a good starting point for numerical SDE methods: http://math.gmu.edu/~tsauer/pre/sde.pdf.
Here is a simple numpy solver for the stochastic differential equation dX_t = a(t,X_t)dt + b(t,X_t)dW_t which I wrote for a class project last year. It is based on the forward euler method for regular differential equations, and in practice is fairly widely used when solving SDEs.
def euler_maruyama(a,b,x0,t):
N = len(t)
x = np.zeros((N,len(x0)))
x[0] = x0
for i in range(N-1):
dt = t[i+1]-t[i]
dWt = np.random.normal(0,dt)
x[i+1] = x[i] + a(t[i],x[i])*dt + b(t[i],x[i])*dWt
return x
Essentially, at each timestep, the deterministic part of the function is integrated using forward Euler, and the stochastic part is integrated by generating a normal random variable dWt with mean 0 and variance dt and integrating the stochastic part with respect to this.
The reason we generate dWt like this is based on the definition of Brownian motions. In particular, if $W$ is a Brownian motion, then $(W_t-W_s)$ is normally distributed with mean 0 and variance $t-s$. So dWt is a discritization of the change in $W$ over a small time interval.
This is a the docstring from the function above:
Parameters
----------
a : callable a(t,X_t),
t is scalar time and X_t is vector position
b : callable b(t,X_t),
where t is scalar time and X_t is vector position
x0 : ndarray
the initial position
t : ndarray
list of times at which to evaluate trajectory
Returns
-------
x : ndarray
positions of trajectory at each time in t

Slow scipy double quadrature integration

I'm trying to obtain the function expected_W or H that is the result of an integration:
where:
theta is a vector with two elements: theta_0 and theta_1
f(beta | theta) is a normal density for beta with mean theta_0 and variance theta_1
q(epsilon) is a normal density for epsilon with mean zero and variance sigma_epsilon (set to 1 by default).
w(p, theta, eps, beta) is a function I take as input, so I cannot predict exactly how it looks. It will likely be non-linear, but not particularly nasty.
This is the way I implement the problem. I'm sure the wrapper functions I make are a mess, so I'd be happy to receive any help on that too.
from __future__ import division
from scipy import integrate
from scipy.stats import norm
import math
import numpy as np
def exp_w(w_B, sigma_eps = 1, **kwargs):
'''
Integrates the w_B function
Input:
+ w_B : the function to be integrated.
+ sigma_eps : variance of the epsilon term. Set to 1 by default
'''
#The integrand function gives everything under the integral:
# w(B(p, \theta, \epsilon, \beta)) f(\beta | \theta ) q(\epsilon)
def integrand(eps, beta, p, theta_0, theta_1, sigma_eps=sigma_eps):
q_e = norm.pdf(eps, loc=0, scale=math.sqrt(sigma_eps))
f_beta = norm.pdf(beta, loc=theta_0, scale=math.sqrt(theta_1))
return w_B(p = p,
theta_0 = theta_0, theta_1 = theta_1,
eps = eps, beta=beta)* q_e *f_beta
#limits of integration. Using limited support for now.
eps_inf = lambda beta : -10 # otherwise: -np.inf
eps_sup = lambda beta : 10 # otherwise: np.inf
beta_inf = -10
beta_sup = 10
def integrated_f(p, theta_0, theta_1):
return integrate.dblquad(integrand, beta_inf, beta_sup,
eps_inf, eps_sup,
args = (p, theta_0, theta_1))
# this integrated_f is the H referenced at the top of the question
return integrated_f
I tested this function with a simple w function for which I know the analytic solution (this won't usually be the case).
def test_exp_w():
def w_B(p, theta_0, theta_1, eps, beta):
return 3*(p*eps + p*(theta_0 + theta_1) - beta)
# Function that I get
integrated = exp_w(w_B, sigma_eps = 1)
# Function that I should get
def exp_result(p, theta_0, theta_1):
return 3*p*(theta_0 + theta_1) - 3*theta_0
args = np.random.rand(3)
d_args = {'p' : args[0], 'theta_0' : args[1], 'theta_1' : args[2]}
if not (np.allclose(
integrated(**d_args)[0], exp_result(**d_args)) ):
raise Exception("Integration procedure isn't working!")
Hence, my implementation seems to be working, but it's very slow for my purpose. I need to repeat this process with tens or hundreds of thousands of times (this is a step in a Value function iteration. I can give more info if people think it's relevant).
With scipy version 0.14.0 and numpy version 1.8.1, this integral takes 15 seconds to compute.
Does anybody have any suggestion on how to go about this?
To start with, tt probably would help to get bounded domains of integration, but I haven't figure out how to do that or if the gaussian quadrature in SciPy takes care of it in a good way (does it use Gauss-Hermite?).
Thanks for your time.
---- Edit: adding profiling times -----
%lprun results gives that most of the time is spent in
_distn_infraestructure.py:1529(pdf) and
_continuous_distns.py:97(_norm_pdf)
each with a whopping 83244 number calls.

The time taken to integrate your function sounds very long if the function is not a nasty one.
First thing I suggest you do is to profile where the time is spent. Is it spent in dblquad or elsewhere? How many calls are made to w_B during the integration? If the time is spent in dblquad and the number of calls is very high, could you use looser tolerances in the integration?
It seems that the multiplication by the gaussians actually enables you to limit the integration limits a great deal, as most of the energy of the gaussian is within a very small area. You might want to try and calculate reasonable tighter bounds. You have already limited the area into -10..10; is there any significant performance change between -100..100, -10..10, and -1..1?
If you know your functions are relatively smooth, then there is a Mickey-Mouse version of the integration:
determine reasonable upper and lower limits in both axes (by the gaussians)
calculate a reasonable grid density (e.g. 100 points in each direction)
calculate the w_B for each of these points (and this will be much faster, if it is possible to require a vectorized version of w_B)
sum it all together
This is very low-tech but also very fast. Whether or not it gives you results which are good enough for the outer iteration is an interesting question. It just might.

How to make sure that solution is global minimum while using python scipy.optimize.minimize

I was implementing logistic regression in python. To find theta , I was struggling to decide which is the best algorithm that always guarantees global optima without bothering about initial parameter theta.
import numpy as np
import scipy.optimize as op
def Sigmoid(z):
return 1/(1 + np.exp(-z));
def Gradient(theta,x,y):
m , n = x.shape
theta = theta.reshape((n,1));
y = y.reshape((m,1))
sigmoid_x_theta = Sigmoid(x.dot(theta));
grad = ((x.T).dot(sigmoid_x_theta-y))/m;
return grad.flatten();
def CostFunc(theta,x,y):
m,n = x.shape;
theta = theta.reshape((n,1));
y = y.reshape((m,1));
term1 = np.log(Sigmoid(x.dot(theta)));
term2 = np.log(1-Sigmoid(x.dot(theta)));
term1 = term1.reshape((m,1))
term2 = term2.reshape((m,1))
term = y * term1 + (1 - y) * term2;
J = -((np.sum(term))/m);
return J;
data = np.loadtxt('ex2data1.txt',delimiter=',');
# m training samples and n attributes
m , n = data.shape
X = data[:,0:n-1]
y = data[:,n-1:]
X = np.concatenate((np.ones((m,1)), X),axis = 1)
initial_theta = np.zeros((n,1))
m , n = X.shape;
Result = op.minimize(fun = CostFunc,
x0 = initial_theta,
args = (X,y),
method = 'TNC',
jac = Gradient);
theta = Result.x;
where content of ex2data1.txt is:
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Above code gives theta = Result.x value as [-25.87282405 0.21193078 0.20722013]. This is global minimum if initial_theta = np.zeros((n,1)). But if initial_theta = np.ones((n,1)), it gives error. So in this case our result depends on initial values of parameter theta. So can this be automated in any way to avoid this issue.
Also I tried using 'BFGS' method instead of 'TNC' method in minimize function call as shown below, then I get RuntimeWarning.
initial_theta = np.zeros((n,1))
result = op.minimize(fun = CostFunc,
x0 = intial_theta,
args = (X,y),
method = 'BFGS',
jac = Gradient);
optimal_theta = result.x
I have called above function several times with different initial values of initial_theta and I found that BFGS maximum time converges to local minima. When I called BFGS with
initial_theta = np.array([-25,0.2,0.2])
which is nearer to global minima, it converged. So it seems that TNC is better than BFGS because with intial_theta being same in both cases, TNC converges to global minima while BFGS converges to local minima. So
Is this true in all cases or it depends on particular problem?
Which is better BFGS or TNC?
Is there any difference between scipy.optimize.fmin_bfgs and scipy.optimize.minimize with method parameter = 'BFGS' or both are same?
Any help or insight will be helpful. Thank you.

There is no practical algorithm that is guaranteed to find a global optimum. However, there are some heuristics like DIRECT (see e.g. here) that work very well in practice for given bounds. These can be used to find a good initialization for an algorithm that finds the local optimum in the vicinity of the initialization and works more efficiently.
However, logistic regression is a convex optimization problem. That means there is only one minimum of the objective function (error function), i.e. the local minimum is always the global minimum. Hence, you can use any local optimizer (Gradient Descent, L-BFGS, Conjugate Gradient, ...). The only problem is that you cannot compute the minimum directly because of the nonlinear logistic function. There is a similar problem called linear regression without that logistic function. In this case the global minimum of the error function can be computed directly without any complex optimization algorithm.
A comparison of optimizers for logistic regression can be found in Fabian Pedregosa's blog. My first guess would be that you have an error in your gradient computation. Maybe you should compare it to the numerical approximation of the gradient with scipy.optimize.check_grad.
scipy.optimize.minimize calls scipy.optimize.fmin_bfgs

This isn't possible with an efficient, general algorithm. You'll never really know what the cost function looked like on the inputs you didn't try. Perhaps there was some miracle trench running through a high plateau you ignored. Perhaps the cost function starts with if arg1 == secret: return -1e100. Who can say? If you really, absolutely need a global minimum, you either need to take advantage of extra knowledge about the cost function, or you need to try each and every single possible input.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.