Given a posterior p(Θ|D) over some parameters Θ, one can define the following:
Highest Posterior Density Region:
The Highest Posterior Density Region is the set of most probable values of Θ that, in total, constitute 100(1-α) % of the posterior mass.
In other words, for a given α, we look for a p* that satisfies:
and then obtain the Highest Posterior Density Region as the set:
Central Credible Region:
Using the same notation as above, a Credible Region (or interval) is defined as:
Depending on the distribution, there could be many such intervals. The central credible interval is defined as a credible interval where there is (1-α)/2 mass on each tail.
Computation:
For general distributions, given samples from the distribution, are there any built-ins in to obtain the two quantities above in Python or PyMC?
For common parametric distributions (e.g. Beta, Gaussian, etc.) are there any built-ins or libraries to compute this using SciPy or statsmodels?
From my understanding "central credible region" is not any different from how confidence intervals are calculated; all you need is the inverse of cdf function at alpha/2 and 1-alpha/2; in scipy this is called ppf ( percentage point function ); so as for Gaussian posterior distribution:
>>> from scipy.stats import norm
>>> alpha = .05
>>> l, u = norm.ppf(alpha / 2), norm.ppf(1 - alpha / 2)
to verify that [l, u] covers (1-alpha) of posterior density:
>>> norm.cdf(u) - norm.cdf(l)
0.94999999999999996
similarly for Beta posterior with say a=1 and b=3:
>>> from scipy.stats import beta
>>> l, u = beta.ppf(alpha / 2, a=1, b=3), beta.ppf(1 - alpha / 2, a=1, b=3)
and again:
>>> beta.cdf(u, a=1, b=3) - beta.cdf(l, a=1, b=3)
0.94999999999999996
here you can see parametric distributions that are included in scipy; and I guess all of them have ppf function;
As for highest posterior density region, it is more tricky, since pdf function is not necessarily invertible; and in general such a region may not even be connected; for example, in the case of Beta with a = b = .5 ( as can be seen here);
But, in the case of Gaussian distribution, it is easy to see that "Highest Posterior Density Region" coincides with "Central Credible Region"; and I think that is is the case for all symmetric uni-modal distributions ( i.e. if pdf function is symmetric around the mode of distribution)
A possible numerical approach for the general case would be binary search over the value of p* using numerical integration of pdf; utilizing the fact that the integral is a monotone function of p*;
Here is an example for mixture Gaussian:
[ 1 ] First thing you need is an analytical pdf function; for mixture Gaussian that is easy:
def mix_norm_pdf(x, loc, scale, weight):
from scipy.stats import norm
return np.dot(weight, norm.pdf(x, loc, scale))
so for example for location, scale and weight values as in
loc = np.array([-1, 3]) # mean values
scale = np.array([.5, .8]) # standard deviations
weight = np.array([.4, .6]) # mixture probabilities
you will get two nice Gaussian distributions holding hands:
[ 2 ] now, you need an error function which given a test value for p* integrates pdf function above p* and returns squared error from the desired value 1 - alpha:
def errfn( p, alpha, *args):
from scipy import integrate
def fn( x ):
pdf = mix_norm_pdf(x, *args)
return pdf if pdf > p else 0
# ideally integration limits should not
# be hard coded but inferred
lb, ub = -3, 6
prob = integrate.quad(fn, lb, ub)[0]
return (prob + alpha - 1.0)**2
[ 3 ] now, for a given value of alpha we can minimize the error function to obtain p*:
alpha = .05
from scipy.optimize import fmin
p = fmin(errfn, x0=0, args=(alpha, loc, scale, weight))[0]
which results in p* = 0.0450, and HPD as below; the red area represents 1 - alpha of the distribution, and the horizontal dashed line is p*.
To calculate HPD you can leverage pymc3, Here is an example
import pymc3
from scipy.stats import norm
a = norm.rvs(size=10000)
pymc3.stats.hpd(a)
Another option (adapted from R to Python) and taken from the book Doing bayesian data analysis by John K. Kruschke) is the following:
from scipy.optimize import fmin
from scipy.stats import *
def HDIofICDF(dist_name, credMass=0.95, **args):
# freeze distribution with given arguments
distri = dist_name(**args)
# initial guess for HDIlowTailPr
incredMass = 1.0 - credMass
def intervalWidth(lowTailPr):
return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)
# find lowTailPr that minimizes intervalWidth
HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
# return interval as array([low, high])
return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])
The idea is to create a function intervalWidth that returns the width of the interval that
starts at lowTailPr and has credMass mass. The minimum of the intervalWidth function is founded by using the fmin minimizer from scipy.
For example the result of:
print HDIofICDF(norm, credMass=0.95, loc=0, scale=1)
is
[-1.95996398 1.95996398]
The name of the distribution parameters passed to HDIofICDF, must be exactly the same used in scipy.
PyMC has a built in function for computing the hpd. In v2.3 it's in utils. See the source here. As an example of a linear model and it's HPD
import pymc as pc
import numpy as np
import matplotlib.pyplot as plt
## data
np.random.seed(1)
x = np.array(range(0,50))
y = np.random.uniform(low=0.0, high=40.0, size=50)
y = 2*x+y
## plt.scatter(x,y)
## priors
emm = pc.Uniform('m', -100.0, 100.0, value=0)
cee = pc.Uniform('c', -100.0, 100.0, value=0)
#linear-model
#pc.deterministic(plot=False)
def lin_mod(x=x, cee=cee, emm=emm):
return emm*x + cee
#likelihood
llhy = pc.Normal('y', mu=lin_mod, tau=1.0/(10.0**2), value=y, observed=True)
linearModel = pc.Model( [llhy, lin_mod, emm, cee] )
MCMClinear = pc.MCMC( linearModel)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()
## pc.Matplot.plot(MCMClinear)
## print HPD using the trace of each parameter
print(pc.utils.hpd(MCMClinear.trace('m')[:] , 1.- 0.95))
print(pc.utils.hpd(MCMClinear.trace('c')[:] , 1.- 0.95))
You may also consider calculating the quantiles
print(linear_output['m']['quantiles'])
print(linear_output['c']['quantiles'])
where I think if you just take the 2.5% to 97.5% values you get your 95% central credible interval.
I stumbled across this post trying to find a way to estimate an HDI from an MCMC sample but none of the answers worked for me.
Like aloctavodia, I adapted an R example from the book Doing Bayesian Data Analysis to Python. I needed to compute a 95% HDI from an MCMC sample. Here's my solution:
import numpy as np
def HDI_from_MCMC(posterior_samples, credible_mass):
# Computes highest density interval from a sample of representative values,
# estimated as the shortest credible interval
# Takes Arguments posterior_samples (samples from posterior) and credible mass (normally .95)
sorted_points = sorted(posterior_samples)
ciIdxInc = np.ceil(credible_mass * len(sorted_points)).astype('int')
nCIs = len(sorted_points) - ciIdxInc
ciWidth = [0]*nCIs
for i in range(0, nCIs):
ciWidth[i] = sorted_points[i + ciIdxInc] - sorted_points[i]
HDImin = sorted_points[ciWidth.index(min(ciWidth))]
HDImax = sorted_points[ciWidth.index(min(ciWidth))+ciIdxInc]
return(HDImin, HDImax)
The method above is giving me logical answers based on the data I have!
You can get the central credible interval in two ways: Graphically, when you call summary_plot on variables in your model, there is an bpd flag that is set to True by default. Changing this to False will draw the central intervals. The second place you can get it is when you call the summary method on your model or a node; it will give you posterior quantiles, and the outer ones will be 95% central interval by default (which you can change with the alpha argument).
In R you can use the stat.extend package
If you are dealing with standard parametric distributions, and you don't mind using R, then you can use the HDR functions in the stat.extend package. This package has HDR functions for all the base distributions and some of the distributions in extension packages. It computes the HDR using the quantile function for the distribution, and automatically adjusts for the shape of the distribution (e.g., unimodal, bimodal, etc.). Here are some examples of HDRs computed with this package for standard parametric distributions.
#Load library
library(stat.extend)
#---------------------------------------------------------------
#Compute HDR for gamma distribution
HDR.gamma(cover.prob = 0.9, shape = 3, scale = 4)
Highest Density Region (HDR)
90.00% HDR for gamma distribution with shape = 3 and scale = 4
Computed using nlm optimisation with 6 iterations (code = 1)
[1.76530758147504, 21.9166988492762]
#---------------------------------------------------------------
#Compute HDR for (unimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 3.2, shape2 = 3.0)
Highest Density Region (HDR)
90.00% HDR for beta distribution with shape1 = 3.2 and shape2 = 3
Computed using nlm optimisation with 4 iterations (code = 1)
[0.211049233508331, 0.823554556452285]
#---------------------------------------------------------------
#Compute HDR for (bimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 0.3, shape2 = 0.4)
Highest Density Region (HDR)
90.00% HDR for beta distribution with shape1 = 0.3 and shape2 = 0.4
Computed using nlm optimisation with 6 iterations (code = 1)
[0, 0.434124342324438] U [0.640580807770818, 1]
Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I have just been trying to match the scipy outputs of the lognormal distribution to the formulas on wikipedia.
And I am stuck on the partial expectation with a lower bound.
If I use this simple lognormal distribution:
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = scist.lognorm(s=sigma, scale=np.exp(mu))
where k is the lower bound,
the partial expectation, as I understand it, is given by:
Fine. So we are simply talking about the mean of the lognormal distribution and a CDF with a z-score. scipy provides the partial
lnorm.expect(lambda x:x, lb=k)
>>> 1.25199...
Indeed, we can confirm this is the partial by checking it against the conditional expectation. Computing it directly or using the partial above yield the same result:
lnorm.expect(lambda x:x, lb=k) / (1 - lnorm.cdf(k))
>>> 1.25385...
lnorm.expect(lambda x:x, lb=k, conditional=True)
>>> 1.25385...
However, scipy's cdf function takes the x variable, not the z-score and I am uncertain how to transform this:
Into an x value. I have tried many different flavors.
I would have thought:
would do the trick to account for the subtraction of mu that must occur when scipy's cdf (presumably) computes the z-score internally.
Any formulation I use ends up with a very small or 0 value.
Any help would be greatly appreciated.
IIUC, you can simply compute the CDF of a Normal distribution N(0,1) in (mu+sigma^2-ln(k))/2, i.e.
import numpy as np
import scipy.stats as sps
def partial_expectation(mu, sigma, k):
"""
Returns partial expectation given
mean, standard deviation and k.
https://en.wikipedia.org/wiki/Log-normal_distribution
"""
# compute cumulative density function
# of Normal distribution N(0,1) in x=x_phi
x_phi = (mu + sigma**2 - np.log(k))/sigma
phi = sps.norm.cdf(x_phi, loc=0, scale=1)
# mean of lognormal
lognorm_mu = np.exp(mu + .5*(sigma**2))
# result
return lognorm_mu * phi
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = sps.lognorm(s=sigma, scale=np.exp(mu))
print('from def:', partial_expectation(mu, sigma, k))
print('from sps:', lnorm.expect(lb=k))
from def: 1.251999952174895
from sps: 1.2519999521748952
Given some time series data:
np.random.seed(123)
r = pd.Series(np.random.beta(a=0.5, b=0.5, size=1000),
index=pd.date_range('2013', periods=1000))
and the distributions within scipy.stats._continuous_distns._distn_names:
import scipy.stats as scs
dists = scs._continuous_distns._distn_names
I would like to be able to establish a new distribution, and then call it's .ppf (percent point function), while incorporating exponential weights into the building of the distribution.
For example, with a normal distribution, this would just entail estimating an exponentially-weighted mean and standard deviation:
All continuous distributions take loc and scale as keyword parameters
to adjust the location and scale of the distribution, e.g. for the
standard normal distribution the location is the mean and the scale is
the standard deviation. [source]
ewm = r.ewm(span=60)
loc = ewm.mean().iloc[-1]
scale = ewm.std().iloc[-1]
print(scs.norm.ppf(q=0.05, loc=loc, scale=scale))
-0.196734019969
But I would like to be able to extend this to the broader family of continuous distributions where other parameters (shape) are often involved. For instance,
johnsonsu has parameters a, b, loc, scale;
bradford has parameters c, loc, scale;
burr has parameters c, d, loc, scale.
How could I extend this process to distributions that have parameters besides loc and scale?
Combined snippets from above:
import scipy.stats as scs
import numpy as np
import pandas as pd
np.random.seed(123)
r = pd.Series(np.random.beta(a=0.5, b=0.5, size=1000),
index=pd.date_range('2013', periods=1000))
ewm = r.ewm(span=60)
loc = ewm.mean().iloc[-1]
scale = ewm.std().iloc[-1]
print(scs.norm.ppf(q=0.05, loc=loc, scale=scale))
# -0.196734019969
Here is my implementation:
Given an empirical distribution x, assign exponential weights to each x.
Use these weights to bootstrap sample a new distribution. The weights are the p parameter to np.random.choice.
The .fit method of any distribution can then be called on that bootstrapped data.
Code:
def ewm_weights(i, alpha):
w = (1 - alpha) ** np.arange(i)[::-1]
w /= w.sum()
return w
def bootstrap(a, alpha, size=None):
p = ewm_weights(i=len(a), alpha=alpha)
return np.random.choice(a=a, size=size, p=p)
The definition of ewm_weights follows:
http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows
with adjust=True.
Example:
# Create a nonstationary `x` variable with larger mean and stdev in period 2
x1 = np.random.normal(loc=4, scale=3, size=1000)
x2 = np.random.normal(loc=10, scale=5, size=1000)
x = np.hstack((x1,x2))
The histogram of x looks like this:
plt.hist(x, bins=25)
While a bootstrapped b with alpha=0.03 looks like:
b = bootstrap(x, alpha=0.03, size=int(1e6))
plt.hist(b, bins=25)
Any continuous distribution from scipy.stats._continuous_distns._distn_names can then be fit to b.
Issues:
A softmax function might make ewm_weights safer.
This approach ignores autocorrelation in x.
Despite having searched for two day in related questions, I have not really found an answer to this Problem yet...
In the following code, I generate n normally distributed random variables, which are then represented in a histogram:
import numpy as np
import matplotlib.pyplot as plt
n = 10000 # number of generated random variables
x = np.random.normal(0,1,n) # generate n random variables
# plot this in a non-normalized histogram:
plt.hist(x, bins='auto', normed=False)
# get the arrays containing the bin counts and the bin edges:
histo, bin_edges = np.histogram(x, bins='auto', normed=False)
number_of_bins = len(bin_edges)-1
After that, a curve fitting function and its parameters are found.
It is normally distributed with the parameters a1 and b1, and scaled with scaling_factor to meet the fact that the sample is unnormalized.
It indeed fits the histogram quite well:
import scipy as sp
a1, b1 = sp.stats.norm.fit(x)
scaling_factor = n*(x.max()-x.min())/number_of_bins
plt.plot(x_achse,scaling_factor*sp.stats.norm.pdf(x_achse,a1,b1),'b')
Here's the plot of the histogram with the fitting function in red.
After that, I want to test how well this function fits the histogram using the chi-squared test.
This test uses the observed values and the expected values in those points. To calculate the expected values, I first calculate the location of the middle of each bin, this information is contained in the array x_middle. I then calculate the value of the fitting function at the middle point of each bin, which gives the expected_value array:
observed_values = histo
bin_width = bin_edges[1] - bin_edges[0]
# array containing the middle point of each bin:
x_middle = np.linspace( bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins)*bin_width,
num = number_of_bins)
expected_values = scaling_factor*sp.stats.norm.pdf(x_middle,a1,b1)
Plugging this into the chisquare function of Scipy, I get p-values of approximately e-5 to e-15 order of magnitude, which tells me the fitting function does not describe the histogram:
print(sp.stats.chisquare(observed_values,expected_values,ddof=2))
But this is not true, the function fits the histogram very well!
Does anybody know where I made a mistake?
Thanks a lot!!
Charles
p.s.: I set the number of delta degrees of freedom to 2, because the 2 parameters a1 and b1 are estimated from the sample. I tried using other ddof, but the results were still as poor!
Your calculation of the end-point of the array x_middle is off by one; it should be:
x_middle = np.linspace(bin_edges[0] + 0.5*bin_width,
bin_edges[0] + (0.5 + number_of_bins - 1)*bin_width,
num=number_of_bins)
Note the extra - 1 in the second argument of linspace().
A more concise version is
x_middle = 0.5*(bin_edges[1:] + bin_edges[:-1])
A different (and possibly more accurate) approach to computing expected_values is to use the differences of the CDF, instead of approximating those differences using the PDF in the middle of each interval:
In [75]: from scipy import stats
In [76]: cdf = stats.norm.cdf(bin_edges, a1, b1)
In [77]: expected_values = n * np.diff(cdf)
With that calculation, I get the following result from the chi-squared test:
In [85]: stats.chisquare(observed_values, expected_values, ddof=2)
Out[85]: Power_divergenceResult(statistic=61.168393496775181, pvalue=0.36292223875686402)
From the docs
The probability mass function for zipf is:
zipf.pmf(k, a) = 1/(zeta(a) * k**a)
for k >= 1.
zipf takes a as shape parameter.
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, zipf.pmf(k, a, loc) is identically equivalent to zipf.pmf(k - loc, a).
But what does the a and k refer to? What does "shape parameter" mean?
Additionally, in scipy.stats.zipf.interval, there's an alpha parameter.
The description of the .interval() method is simply:
Endpoints of the range that contains alpha percent of the distribution
What does the alpha parameter mean? Is that the "confidence interval"?
What does "shape parameter" mean?
As the name suggests, a shape parameter determines the shape of a distribution. This is probably easiest to explain when starting with what a shape parameter is not:
A location parameter shifts the distribution but leaves it otherwise unchanged. For example, the mean of a normal distribution is a location parameter. If X is normally distributed with mean mu, then X + a is normally distributed with mean mu + a.
A scale parameter makes the distribution wider or narrower. For example, the standard deviation of a normal distribution is a scale parameter. If X is normally distributed with standard deviation sigma, then X * a is normally distributed with standard deviation sigma * a.
Finally, a shape parameter changes the shape of the distribution. For example, the Gamma distribution has a shape parameter k that determines how skewed the distribution is (= how much it "leans" to one side).
But what does the a and k refer to?
k is the variable parameterized by the distribution. With zipf.pmf you can compute the probability of any k, given shape parameter a. Below is a plot that demonstrates how achanges the shape of the distribution (the individual probabilities of different k).
A high a makes large values of k very unlikely, while a low a makes small k less likely and larger kare possible.
What does the alpha parameter mean? Is that the "confidence interval"?
It is wrong to say that alpha is the confidence interval. It is the confidence level. I guess that is what you meant. For example, alpha=0.95 Means that you have a 95% confidence interval. If you generate random ks from the particular distribution, 95% of them will be in the range returned by zipf.interval.
Code for the plot:
from scipy.stats import zipf
import matplotlib.pyplot as plt
import numpy as np
k = np.linspace(0, 10, 101)
for a in [1.3, 2.6]:
p = zipf.pmf(k, a=a)
plt.plot(k, p, label='a={}'.format(a), linewidth=2)
plt.xlabel('k')
plt.ylabel('probability')
plt.legend()
plt.show()