Formula for partial expectation in scipy lognorm - python

I have just been trying to match the scipy outputs of the lognormal distribution to the formulas on wikipedia.
And I am stuck on the partial expectation with a lower bound.
If I use this simple lognormal distribution:
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = scist.lognorm(s=sigma, scale=np.exp(mu))
where k is the lower bound,
the partial expectation, as I understand it, is given by:
Fine. So we are simply talking about the mean of the lognormal distribution and a CDF with a z-score. scipy provides the partial
lnorm.expect(lambda x:x, lb=k)
>>> 1.25199...
Indeed, we can confirm this is the partial by checking it against the conditional expectation. Computing it directly or using the partial above yield the same result:
lnorm.expect(lambda x:x, lb=k) / (1 - lnorm.cdf(k))
>>> 1.25385...
lnorm.expect(lambda x:x, lb=k, conditional=True)
>>> 1.25385...
However, scipy's cdf function takes the x variable, not the z-score and I am uncertain how to transform this:
Into an x value. I have tried many different flavors.
I would have thought:
would do the trick to account for the subtraction of mu that must occur when scipy's cdf (presumably) computes the z-score internally.
Any formulation I use ends up with a very small or 0 value.
Any help would be greatly appreciated.

IIUC, you can simply compute the CDF of a Normal distribution N(0,1) in (mu+sigma^2-ln(k))/2, i.e.
import numpy as np
import scipy.stats as sps
def partial_expectation(mu, sigma, k):
"""
Returns partial expectation given
mean, standard deviation and k.
https://en.wikipedia.org/wiki/Log-normal_distribution
"""
# compute cumulative density function
# of Normal distribution N(0,1) in x=x_phi
x_phi = (mu + sigma**2 - np.log(k))/sigma
phi = sps.norm.cdf(x_phi, loc=0, scale=1)
# mean of lognormal
lognorm_mu = np.exp(mu + .5*(sigma**2))
# result
return lognorm_mu * phi
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = sps.lognorm(s=sigma, scale=np.exp(mu))
print('from def:', partial_expectation(mu, sigma, k))
print('from sps:', lnorm.expect(lb=k))
from def: 1.251999952174895
from sps: 1.2519999521748952

Related

How to calculate integral for very very small y values (SciPy quad)

Here is a probability density function of a lognormal distribution:
from scipy.stats import lognorm
def f(x): return lognorm.pdf(x, s=0.2, loc=0, scale=np.exp(10))
This function has very small y values (max ~ 1E-5) and distributes over x value ~1E5. We know that the integral of a PDF should be 1, but when using the following codes to directly calculate integral, the answer is round 1E-66 since the computation accuracy is not enough.
from scipy.integrate import quad
import pandas as pd
ans, err = quad(f, -np.inf, np.inf)
Could you kindly help me to correctly calculate an integral like this? Thank you.
The values that you are using correspond to the underlying normal distribution having mean mu = 10 and standard deviation sigma = 0.2. With those values, the mode of the distribution (i.e. the location of the maximum of the PDF) is at exp(mu - sigma**2) = 21162.795717500194. The function quad works pretty well, but it can be fooled. In this case, apparently quad only samples the function where the values are extremely small--it never "sees" the higher values way out around 20000.
You can fix this by computing the integral over two intervals, say [0, mode] and [mode, np.inf]. (There is no need to compute the integral over the negative axis, since the PDF is 0 there.)
For example, this script prints 1.0000000000000004
import numpy as np
from scipy.stats import lognorm
from scipy.integrate import quad
def f(x, mu=0, sigma=1):
return lognorm.pdf(x, s=sigma, loc=0, scale=np.exp(mu))
mu = 10
sigma = 0.2
mode = np.exp(mu - sigma**2)
ans1, err1 = quad(f, 0, mode, args=(mu, sigma))
ans2, err2 = quad(f, mode, np.inf, args=(mu, sigma))
integral = ans1 + ans2
print(integral)

Fitting For Discrete Data: Negative Binomial, Poisson, Geometric Distribution

In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.
For example if i have an array like below:
x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]
I couldn't apply for this array:
from scipy.stats import nbinom
param = nbinom.fit(x)
But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?
You can use Method of Moments to fit any particular distribution.
Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.
So, in all these cases we only need two moments. Let's get them:
import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom
x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {} # we'll use it later
Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.
Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:
# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:
p = 1 - mean / var # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p
UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:
p = mean / var
r = p * mean / (1-p)
END OF UPD
UPD2:
I'd suggest using #thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.
END OF UPD2
Calculate likelihood:
likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()
Same for Poisson, there is only one parameter:
# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()
Same for Geometric distribution:
# mean = 1 / p # this form fits the scipy definition
p = 1 / mean
likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()
Finally, let's get the best fit:
best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])
Let me know if you have any questions
Great answer by Marat.
In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
I would rewrite the code for Negative Binomial to-
log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()
Note that I have used-
logpmf instead of pmf
sum instead of product
And to find out the best distribution-
best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])

Get reverse cumulative density function with NumPy?

I am interested in a particular density, and I need to sample it "regularly" in a way that represent its shape (not random).
Formally, f is my density function, F is the corresponding cumulative density function (F' = f), whose reverse function rF = F^-1 does exist. I am interested in casting a regular sample from [0, 1] into my variable domain through F^-1. Something like:
import numpy as np
uniform_sample = np.linspace(0., 1., 256 + 2)[1:-1] # source sample
shaped_sample = rF(uniform_sample) # this is what I want to get
Is there a dedicated way to do this with numpy, or should I do this by hand? Here is the 'by hand' way for exponential law:
l = 5. # exponential parameter
# f = lambda x: l * np.exp(-l * x) # density function, not used
# F = lambda x: 1 - np.exp(-l * x) # cumulative density function, not used either
rF = lambda y: np.log(1. / (1. - y)) / l # reverse `F^-1` function
# What I need is:
shaped_sample = rF(uniform_sample)
I know that, in theory, rF is internally used for drawing random samples when np.random.exponential is called, for example (a uniform, random sample from [0, 1] is transformed by rF to get the actual result). So my guess is that numpy.random does know the rF function for each distribution it offers.
How do I access it? Does numpy provide functions like:
np.random.<any_numpy_distribution>.rF
or
np.random.get_reverse_F(<any_custom_density_function>)
.. or should I derive / approximate them myself?
scipy has probability distribution objects for all (I think) of the probability distributions in numpy.random.
http://docs.scipy.org/doc/scipy/reference/stats.html
The all have a ppf() method that does what you want.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.ppf.html
In your example:
import scipy.stats as st
l = 5. # exponential parameter
dist = st.expon(0., l) # distribution object provided by scipy
f = dist.pdf # probability density function
F = dist.cdf # cumulative density function
rF = dist.ppf # percent point function : reverse `F^-1` function
shaped_sample = rF(uniform_sample)
# and much more!
As far as I'm aware there isn't a way to do this directly in numpy. For the case of functions where the cumulative distribution is analytic but it's inverse isn't I generally use a spline to do the inversion numerically.
from scipy.interpolate import UnivariateSpline
x = np.linspace(0.0, 1.0, 1000)
F = cumulative_distn(x) #This we know and is analytic
rF = UnivariateSpline(F, x) #This will then be the inverse
Note that if you can do the inversion of F to rF by hand then you should. This method is only for the case where the inverse cannot be found in a closed form.

Highest Posterior Density Region and Central Credible Region

Given a posterior p(Θ|D) over some parameters Θ, one can define the following:
Highest Posterior Density Region:
The Highest Posterior Density Region is the set of most probable values of Θ that, in total, constitute 100(1-α) % of the posterior mass.
In other words, for a given α, we look for a p* that satisfies:
and then obtain the Highest Posterior Density Region as the set:
Central Credible Region:
Using the same notation as above, a Credible Region (or interval) is defined as:
Depending on the distribution, there could be many such intervals. The central credible interval is defined as a credible interval where there is (1-α)/2 mass on each tail.
Computation:
For general distributions, given samples from the distribution, are there any built-ins in to obtain the two quantities above in Python or PyMC?
For common parametric distributions (e.g. Beta, Gaussian, etc.) are there any built-ins or libraries to compute this using SciPy or statsmodels?
From my understanding "central credible region" is not any different from how confidence intervals are calculated; all you need is the inverse of cdf function at alpha/2 and 1-alpha/2; in scipy this is called ppf ( percentage point function ); so as for Gaussian posterior distribution:
>>> from scipy.stats import norm
>>> alpha = .05
>>> l, u = norm.ppf(alpha / 2), norm.ppf(1 - alpha / 2)
to verify that [l, u] covers (1-alpha) of posterior density:
>>> norm.cdf(u) - norm.cdf(l)
0.94999999999999996
similarly for Beta posterior with say a=1 and b=3:
>>> from scipy.stats import beta
>>> l, u = beta.ppf(alpha / 2, a=1, b=3), beta.ppf(1 - alpha / 2, a=1, b=3)
and again:
>>> beta.cdf(u, a=1, b=3) - beta.cdf(l, a=1, b=3)
0.94999999999999996
here you can see parametric distributions that are included in scipy; and I guess all of them have ppf function;
As for highest posterior density region, it is more tricky, since pdf function is not necessarily invertible; and in general such a region may not even be connected; for example, in the case of Beta with a = b = .5 ( as can be seen here);
But, in the case of Gaussian distribution, it is easy to see that "Highest Posterior Density Region" coincides with "Central Credible Region"; and I think that is is the case for all symmetric uni-modal distributions ( i.e. if pdf function is symmetric around the mode of distribution)
A possible numerical approach for the general case would be binary search over the value of p* using numerical integration of pdf; utilizing the fact that the integral is a monotone function of p*;
Here is an example for mixture Gaussian:
[ 1 ] First thing you need is an analytical pdf function; for mixture Gaussian that is easy:
def mix_norm_pdf(x, loc, scale, weight):
from scipy.stats import norm
return np.dot(weight, norm.pdf(x, loc, scale))
so for example for location, scale and weight values as in
loc = np.array([-1, 3]) # mean values
scale = np.array([.5, .8]) # standard deviations
weight = np.array([.4, .6]) # mixture probabilities
you will get two nice Gaussian distributions holding hands:
[ 2 ] now, you need an error function which given a test value for p* integrates pdf function above p* and returns squared error from the desired value 1 - alpha:
def errfn( p, alpha, *args):
from scipy import integrate
def fn( x ):
pdf = mix_norm_pdf(x, *args)
return pdf if pdf > p else 0
# ideally integration limits should not
# be hard coded but inferred
lb, ub = -3, 6
prob = integrate.quad(fn, lb, ub)[0]
return (prob + alpha - 1.0)**2
[ 3 ] now, for a given value of alpha we can minimize the error function to obtain p*:
alpha = .05
from scipy.optimize import fmin
p = fmin(errfn, x0=0, args=(alpha, loc, scale, weight))[0]
which results in p* = 0.0450, and HPD as below; the red area represents 1 - alpha of the distribution, and the horizontal dashed line is p*.
To calculate HPD you can leverage pymc3, Here is an example
import pymc3
from scipy.stats import norm
a = norm.rvs(size=10000)
pymc3.stats.hpd(a)
Another option (adapted from R to Python) and taken from the book Doing bayesian data analysis by John K. Kruschke) is the following:
from scipy.optimize import fmin
from scipy.stats import *
def HDIofICDF(dist_name, credMass=0.95, **args):
# freeze distribution with given arguments
distri = dist_name(**args)
# initial guess for HDIlowTailPr
incredMass = 1.0 - credMass
def intervalWidth(lowTailPr):
return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)
# find lowTailPr that minimizes intervalWidth
HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
# return interval as array([low, high])
return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])
The idea is to create a function intervalWidth that returns the width of the interval that
starts at lowTailPr and has credMass mass. The minimum of the intervalWidth function is founded by using the fmin minimizer from scipy.
For example the result of:
print HDIofICDF(norm, credMass=0.95, loc=0, scale=1)
is
[-1.95996398 1.95996398]
The name of the distribution parameters passed to HDIofICDF, must be exactly the same used in scipy.
PyMC has a built in function for computing the hpd. In v2.3 it's in utils. See the source here. As an example of a linear model and it's HPD
import pymc as pc
import numpy as np
import matplotlib.pyplot as plt
## data
np.random.seed(1)
x = np.array(range(0,50))
y = np.random.uniform(low=0.0, high=40.0, size=50)
y = 2*x+y
## plt.scatter(x,y)
## priors
emm = pc.Uniform('m', -100.0, 100.0, value=0)
cee = pc.Uniform('c', -100.0, 100.0, value=0)
#linear-model
#pc.deterministic(plot=False)
def lin_mod(x=x, cee=cee, emm=emm):
return emm*x + cee
#likelihood
llhy = pc.Normal('y', mu=lin_mod, tau=1.0/(10.0**2), value=y, observed=True)
linearModel = pc.Model( [llhy, lin_mod, emm, cee] )
MCMClinear = pc.MCMC( linearModel)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()
## pc.Matplot.plot(MCMClinear)
## print HPD using the trace of each parameter
print(pc.utils.hpd(MCMClinear.trace('m')[:] , 1.- 0.95))
print(pc.utils.hpd(MCMClinear.trace('c')[:] , 1.- 0.95))
You may also consider calculating the quantiles
print(linear_output['m']['quantiles'])
print(linear_output['c']['quantiles'])
where I think if you just take the 2.5% to 97.5% values you get your 95% central credible interval.
I stumbled across this post trying to find a way to estimate an HDI from an MCMC sample but none of the answers worked for me.
Like aloctavodia, I adapted an R example from the book Doing Bayesian Data Analysis to Python. I needed to compute a 95% HDI from an MCMC sample. Here's my solution:
import numpy as np
def HDI_from_MCMC(posterior_samples, credible_mass):
# Computes highest density interval from a sample of representative values,
# estimated as the shortest credible interval
# Takes Arguments posterior_samples (samples from posterior) and credible mass (normally .95)
sorted_points = sorted(posterior_samples)
ciIdxInc = np.ceil(credible_mass * len(sorted_points)).astype('int')
nCIs = len(sorted_points) - ciIdxInc
ciWidth = [0]*nCIs
for i in range(0, nCIs):
ciWidth[i] = sorted_points[i + ciIdxInc] - sorted_points[i]
HDImin = sorted_points[ciWidth.index(min(ciWidth))]
HDImax = sorted_points[ciWidth.index(min(ciWidth))+ciIdxInc]
return(HDImin, HDImax)
The method above is giving me logical answers based on the data I have!
You can get the central credible interval in two ways: Graphically, when you call summary_plot on variables in your model, there is an bpd flag that is set to True by default. Changing this to False will draw the central intervals. The second place you can get it is when you call the summary method on your model or a node; it will give you posterior quantiles, and the outer ones will be 95% central interval by default (which you can change with the alpha argument).
In R you can use the stat.extend package
If you are dealing with standard parametric distributions, and you don't mind using R, then you can use the HDR functions in the stat.extend package. This package has HDR functions for all the base distributions and some of the distributions in extension packages. It computes the HDR using the quantile function for the distribution, and automatically adjusts for the shape of the distribution (e.g., unimodal, bimodal, etc.). Here are some examples of HDRs computed with this package for standard parametric distributions.
#Load library
library(stat.extend)
#---------------------------------------------------------------
#Compute HDR for gamma distribution
HDR.gamma(cover.prob = 0.9, shape = 3, scale = 4)
Highest Density Region (HDR)
90.00% HDR for gamma distribution with shape = 3 and scale = 4
Computed using nlm optimisation with 6 iterations (code = 1)
[1.76530758147504, 21.9166988492762]
#---------------------------------------------------------------
#Compute HDR for (unimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 3.2, shape2 = 3.0)
Highest Density Region (HDR)
90.00% HDR for beta distribution with shape1 = 3.2 and shape2 = 3
Computed using nlm optimisation with 4 iterations (code = 1)
[0.211049233508331, 0.823554556452285]
#---------------------------------------------------------------
#Compute HDR for (bimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 0.3, shape2 = 0.4)
Highest Density Region (HDR)
90.00% HDR for beta distribution with shape1 = 0.3 and shape2 = 0.4
Computed using nlm optimisation with 6 iterations (code = 1)
[0, 0.434124342324438] U [0.640580807770818, 1]

Apply kurtosis to a distribution in python

I have a dataset which is in the format of
frequency, direction, normalised power spectral density, spread, skewness, kurtosis
I am able to visualise the distribution of a specific record using the code from the top answer in skew normal distribution in scipy but I am not sure how to apply a kurtosis value to a distribution?
from scipy import linspace
from scipy import pi,sqrt,exp
from scipy.special import erf
from pylab import plot,show
def pdf(factor, x):
return (100*factor)/sqrt(2*pi) * exp(-x**2/2)
def cdf(x):
return (1 + erf(x/sqrt(2))) / 2
def skew(x,e=0,w=1,a=0, norm_psd=1):
t = (x-e) / w
return 2 / w * pdf(norm_psd, t) * cdf(a*t)
n = 540
e = 341.9 # direction
w = 59.3 # spread
a = 3.3 # skew
k = 4.27 # kurtosis
n_psd = 0.5 # normalised power spectral density
x = linspace(-90, 450, n)
p = skew(x, e, w, a, n_psd)
print max(p)
plot(x,p)
show()
Edit: I removed skew normal from my title as I don't think it is actually possible to apply a kurtosis value to the above distribution, I think a different distribution is necessary, as direction is involved a distribution from circular statistics may be more appropriate?
Thanks to the answer below I can apply kurtosis using the pdf_mvsk function demonstrated in the code below, unfortunately my skew values cause a negative y value, but the answer satisfies my question.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.sandbox.distributions.extras as extras
pdffunc = extras.pdf_mvsk([341.9, 59.3, 3.3, 4.27])
range = np.arange(0, 360, 0.1)
plt.plot(range, pdffunc(range))
plt.show()
If you have mean, standard deviation, skew and kurtosis, then you can build an approximately normal distribution with those moments using Gram-Charlier expansion.
I looked into this some time ago, scipy.stats had a function that was wrong and was removed.
I don't remember what the status of this is, since it was a long time ago that I put this in the statsmodels sandbox
http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.distributions.extras.pdf_mvsk.html#statsmodels.sandbox.distributions.extras.pdf_mvsk

Categories