Confidence Interval for Inverse Gauss distribution with scipy.stats

Confidence Interval for Inverse Gauss distribution with scipy.stats - python

I am attempting to fit an inverse gauss distribution to data using the scipy.stats toolbox. The data fits well using the following code:
import scipy.stats
dist = stats.invgauss
# fit a distribution to the data
dist_fit = dist.fit(data);
dist_model = dist(*dist_fit);
# find the distribution mean
dist_mu = dist_model.mean();
# find the distribution standard deviation
dist_std = dist_model.std();
Which produces a fit to the distribution that looks like this: inverse_gaussian_fit.
I am trying to determine the confidence interval for the mean of this distribution. From my understanding, the confidence interval of the mean is equal to the standard error of the mean (which is equal to the standard deviation divided by the square root of the number of tests) multiplied by the percent point function (which is equal to the inverse of the cumulative distribution function) at the confidence level desired. I can do this using the following code:
# find the inverse gaussian standard error/confidence interval
dist_se = dist_std / np.sqrt(n);
dist_ci_l = dist_se * dist_model.ppf(0.05);
dist_ci_h = dist_se * dist_model.ppf(0.95);
Unfortunately, this produces unrealistic results like this:
inverse_gaussian_running_averages.
How can I generate the asymmetric confidence interval for an inverse gauss function? I have seen many applications where one assumes the confidence interval from a normal distribution, but that creates symmetric confidence intervals.

Related

Obtaining the percentile from a distribution

How can I obtain the percentiles (for example the mean, or the 10% and 90% percentile) of a distribution received from some program or experiments? In the sample below I generate a normal distribution just for illustration.
from scipy.stats import norm
x = np.linspace(1,10,1001)
count = norm.pdf(x,5,1)
This will be a gaussian curve (for this particular illustration case) if plotted as plt.plot(x,count). Note that this is not the data points but the distribution (which you can obtain with, e.g., x,count = plt.hist(data)), so I can't use p10 = np.percentile(count,10)
but I would want something similar, such as
p10 = module.percentile(x,dist,10)
Does any of you know of such a module, or do you know of some other means of obtaining the percentile?

I am not sure if this is what you are looking for, but scipy.stats distributions have ppf method that computes their percentiles. For example, to get the 30th percentile of the normal distribution with mean 5 and standard deviation 1 you can use:
from scipy.stats import norm
norm.ppf(0.3, loc=5, scale=1)
This gives:
4.475599487291959
Then, you can select elements of an array x which are in this percentile:
x[x < norm.ppf(0.3, loc=5, scale=1)]

Discretize normal distribution to get prob of a random variable

Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".

here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.

How to use norm.ppf()?

I couldn't understand how to properly use this function, could someone please explain it to me?
Let's say I have:
a mean of 172.7815
a standard deviation of 4.1532
N = 50 (50 samples)
When I'm asked to calculate the (95%) margin of error using norm.ppf() will the code look like below?
norm.ppf(0.95, loc=172.78, scale=4.15)
or will it look like this?
norm.ppf(0.95, loc=0, scale=1)
Because I know it's calculating the area of the curve to the right of the confidence interval (95%, 97.5% etc...see image below), but when I have a mean and a standard deviation, I get really confused as to how to use the function.

The method norm.ppf() takes a percentage and returns a standard deviation multiplier for what value that percentage occurs at.
It is equivalent to a, 'One-tail test' on the density plot.
From scipy.stats.norm:
ppf(q, loc=0, scale=1) Percent point function (inverse of cdf — percentiles).
Standard Normal Distribution
The code:
norm.ppf(0.95, loc=0, scale=1)
Returns a 95% significance interval for a one-tail test on a standard normal distribution (i.e. a special case of the normal distribution where the mean is 0 and the standard deviation is 1).
Our Example
To calculate the value for OP-provided example at which our 95% significance interval lies (For a one-tail test) we would use:
norm.ppf(0.95, loc=172.7815, scale=4.1532)
This will return a value (that functions as a 'standard-deviation multiplier') marking where 95% of data points would be contained if our data is a normal distribution.
To get the exact number, we take the norm.ppf() output and multiply it by our standard deviation for the distribution in question.
A Two-Tailed Test
If we need to calculate a 'Two-tail test' (i.e. We're concerned with values both greater and less than our mean) then we need to split the significance (i.e. our alpha value) because we're still using a calculation method for one-tail. The split in half symbolizes the significance level being appropriated to both tails. A 95% significance level has a 5% alpha; splitting the 5% alpha across both tails returns 2.5%. Taking 2.5% from 100% returns 97.5% as an input for the significance level.
Therefore, if we were concerned with values on both sides of our mean, our code would input .975 to represent a 95% significance level across two-tails:
norm.ppf(0.975, loc=172.7815, scale=4.1532)
Margin of Error
Margin of error is a significance level used when estimating a population parameter with a sample statistic. We want to generate our 95% confidence interval using the two-tailed input to norm.ppf() since we're concerned with values both greater and less than our mean:
ppf = norm.ppf(0.975, loc=172.7815, scale=4.1532)
Next, we'd take the ppf and multiply it by our standard deviation to return the interval value:
interval_value = std * ppf
Finally, we'd mark the confidence intervals by adding & subtracting the interval value from the mean:
lower_95 = mean - interval_value
upper_95 = mean + interval_value
Plot with a vertical line:
_ = plt.axvline(lower_95, color='r', linestyle=':')
_ = plt.axvline(upper_95, color='r', linestyle=':')

James' statement that norm.ppf returns a "standard deviation multiplier" is wrong. This feels pertinent as his post is the top google result when one searches for norm.ppf.
'norm.ppf' is the inverse of 'norm.cdf'. In the example, it simply returns the value at the 95% percentile. There is no "standard deviation multiplier" involved.
A better answer exists here:
How to calculate the inverse of the normal cumulative distribution function in python?

You can figure out the confidence interval with norm.ppf directly, without calculating margin of error
upper_of_interval = norm.ppf(0.975, loc=172.7815, scale=4.1532/np.sqrt(50))
lower_of_interval = norm.ppf(0.025, loc=172.7815, scale=4.1532/np.sqrt(50))
4.1532 is sample standard deviation, not the standard deviation of the sampling distribution of the sample mean. So, scale in norm.ppf will be specified as scale = 4.1532 / np.sqrt(50), which is the estimator of standard deviation of the sampling distribution.
(The value of standard deviation of the sampling distribution is equal to population standard deviation / np.sqrt(sample size). Here, we did not know the population standard deviation and the sample size is more than 30, so sample standard deviation / np.sqrt(sample size) can be used as a good estimator).
Margin of error can be calculated with (upper_of_interval - lower_of_interval) / 2.

calculate the amount for the 95% percentile and draw a vertical line and an annotation with the amount
mean=172.7815
std=4.1532
N = 50
results=norm.rvs(mean,std, size=N)
pct_5 = norm.ppf(.95,mean,std)
plt.hist(results,bins=10)
plt.axvline(pct_5)
plt.annotate(pct_5,xy=(pct_5,6))
plt.show()

As other answers pointed out, norm.ppf(1-alpha) returns the value on the (1-alpha)x100-th percentile of a normal distribution specified by the parameters passed to the it. For example in the OP, it returns the 95th percentile of a normal distribution with mean 172.78 and standard deviation 4.15.
If you're looking for a function that returns the same value (N-th percentile on the normal distribution) as a function of alpha instead, there's the inverse survival function, norm.isf(alpha), which tells you the number at which (1-alpha) is above it.
from scipy.stats import norm
alpha = 0.05
v1 = norm.isf(alpha)
v2 = norm.ppf(1-alpha)
np.isclose(v1, v2) # True

Can normal distribution prob density be greater than 1?... based on python code checkup

I have a question:
Given mean and variance I want to calculate the probability of a sample using a normal distribution as probability basis.
The numbers are:
mean = -0.546369
var = 0.006443
curr_sample = -0.466102
prob = 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) )
I get a probability which is larger than 1! I get prob = 3.014558...
What is causing this? The fact that the variance is too small messes something up? It's a totally legal input to the formula and should give something small not greater than 1! Any suggestions?

Ok, what you compute is not a probability, but a probability density (which may be larger than one). In order to get 1 you have to integrate over the normal distribution like so:
import numpy as np
mean = -0.546369
var = 0.006443
curr_sample = np.linspace(-10,10,10000)
prob = np.sum( 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) ) * (curr_sample[1]-curr_sample[0]) )
print prob
witch results in
0.99999999999961509

The formula you give is a probability density, not a probability. The density formula is such that when you integrate it between two values of x, you get the probability of being in that interval. However, this means that the probability of getting any particular sample is, in fact, 0 (it's the density times the infinitesimally small dx).
So what are you actually trying to calculate? You probably want something like the probability of getting your value or larger, the so-called tail probability, which is often used in statistics (it so happens that this is given by the error function when you're talking about a normal distribution, although you need to be careful of exactly how it's defined).

When considering the bell-shaped probability distribution function (PDF) of given mean and variance, the peak value of the curve (height of mode) is 1/sqrt(2*pi*var). It is 1 for standard normal distribution (mean 0 and var 1). Hence when trying to calculate a specific value of a general normal distribution pdf, values larger than 1 are possible.

Probability to z-score and vice versa

How do I calculate the z score of a p-value and vice versa?
For example if I have a p-value of 0.95 I should get 1.96 in return.
I saw some functions in scipy but they only run a z-test on an array.
I have access to numpy, statsmodel, pandas, and scipy (I think).

>>> import scipy.stats as st
>>> st.norm.ppf(.95)
1.6448536269514722
>>> st.norm.cdf(1.64)
0.94949741652589625
As other users noted, Python calculates left/lower-tail probabilities by default. If you want to determine the density points where 95% of the distribution is included, you have to take another approach:
>>>st.norm.ppf(.975)
1.959963984540054
>>>st.norm.ppf(.025)
-1.960063984540054

Starting in Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the zscore for which x% of the area under a normal curve lies (ignoring both tails).
We can obtain one from the other and vice versa using the inv_cdf (inverse cumulative distribution function) and the cdf (cumulative distribution function) on the standard normal distribution:
from statistics import NormalDist
NormalDist().inv_cdf((1 + 0.95) / 2.)
# 1.9599639845400536
NormalDist().cdf(1.9599639845400536) * 2 - 1
# 0.95
An explanation for the '(1 + 0.95) / 2.' formula can be found in this wikipedia section.

If you are interested in T-test, you can do similar:
z-statistics (z-score) is used when the data follows a normal distribution, population standard deviation sigma is known and the sample size is above 30. Z-Score tells you how many standard deviations from the mean your result is. The z-score is calculated using the formula:
z_score = (xbar - mu) / sigma
t-statistics (t-score), also known as Student's T-Distribution, is used when the data follows a normal distribution, population standard deviation (sigma) is NOT known, but the sample standard deviation (s) is known or can be calculated, and the sample size is below 30. T-Score tells you how many standard deviations from the mean your result is. The t-score is calculated using the formula:
t_score = (xbar - mu) / (s/sqrt(n))
Summary: If the sample sizes are larger than 30, the z-distribution and the t-distributions are pretty much the same and either one can be used. If the population standard deviation is available and the sample size is greater than 30, t-distribution can be used with the population standard deviation instead of the sample standard deviation.
teststatistics
lookuptable
lookupvalues
criticalvalue
normaldistribution
populationstandarddeviation (sigma)
samplesize
z-statistics
z-table
z-score
z-critical is z-score at a specific confidence level
yes
known
> 30
t-statistics
t-table
t-score
t-critical is t-score at a specific confidence level
yes
not known
< 30
Python Percent Point Function is used to calculate the critical values at a specific confidence level:
z-critical = stats.norm.ppf(1 - alpha) (use alpha = alpha/2 for two-sided)
t-critical = stats.t.ppf(alpha/numOfTails, ddof)
Codes
import numpy as np
from scipy import stats
# alpha to critical
alpha = 0.05
n_sided = 2 # 2-sided test
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(z_crit) # 1.959963984540054
# critical to alpha
alpha = stats.norm.sf(z_crit) * n_sided
print(alpha) # 0.05

Z-score to probability :
The code snippet below maps the negative of the absolute value of the z-score to cdf of a Std Normal Distribution and multiplies by 2 . This will give the prob of finding the probability of Area1 + Area2 shaded in the picture here :
import numpy as np
from scipy.stats import norm
norm(0, 1).cdf(-np.absolute(zscore)) * 2
Ref: https://mathbitsnotebook.com/Algebra2/Statistics/STzScores.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.