Obtaining the percentile from a distribution - python

How can I obtain the percentiles (for example the mean, or the 10% and 90% percentile) of a distribution received from some program or experiments? In the sample below I generate a normal distribution just for illustration.
from scipy.stats import norm
x = np.linspace(1,10,1001)
count = norm.pdf(x,5,1)
This will be a gaussian curve (for this particular illustration case) if plotted as plt.plot(x,count). Note that this is not the data points but the distribution (which you can obtain with, e.g., x,count = plt.hist(data)), so I can't use p10 = np.percentile(count,10)
but I would want something similar, such as
p10 = module.percentile(x,dist,10)
Does any of you know of such a module, or do you know of some other means of obtaining the percentile?

I am not sure if this is what you are looking for, but scipy.stats distributions have ppf method that computes their percentiles. For example, to get the 30th percentile of the normal distribution with mean 5 and standard deviation 1 you can use:
from scipy.stats import norm
norm.ppf(0.3, loc=5, scale=1)
This gives:
4.475599487291959
Then, you can select elements of an array x which are in this percentile:
x[x < norm.ppf(0.3, loc=5, scale=1)]

Related

Generating non-random normally distributed values between two points

I've stumbled across this code in an answer to a question and I'd like to automate the process of getting the distribution to fit neatly between two bounds.
import numpy as np
from scipy import stats
bounds = [0, 100]
n = np.mean(bounds)
# your distribution:
distribution = stats.norm(loc=n, scale=20)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf(bounds)
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Let's say I have the values [720, 965], or any other bounds, that I would like to fit my distribution across. Is there a way to soft-code the adjustment of scale in stats.norm to fit this distribution across my bounds without any unreasonable gaps? Or are there any functions that have this type of functionality?
A scale of ~20 works well for the example code, but I have to adjust it to ~50 for the example of [720, 965]
I am not sure, but truncated normal distribution should be exactly what you are looking for.
from scipy.stats import truncnorm
distr_ab = truncnorm(a, b) # truncated normal distribution in the interval [a, b]
distr_ab.rvs(size=100) # get 100 samples from the distribution
# distr_ab.cdf, distr_ab.ppf etc... all accessible

How to correctly calculate the MEDIAN of a probability function?

I am trying to calculate the exact median of a simple standard normal PDF in Python 36. The code looks like this:
from scipy.stats import norm
from pynverse import inversefunc
mean = 'some_number'
standard_deviation = 1
inverse_normal_pdf = inversefunc(lambda x: norm.pdf(x, mean, standard_deviation))
median = inverse_normal_pdf(norm.pdf(float('-inf'), mean, standard_deviation)+.5)
I use the pynverse library to get the inverse of the normal PDF and use the solver for upper limit of integration from here to arrive to the solution for the median. But this method works for only means in the range [-8.6:11.2], and any other mean outside this range gives me exactly the number 2.6180339603380443 for some reason. I can't figure out what's happening here? What is this number?
If your distribution is symmetrical (which is the case of the normal distribution), then the theoretical median, has the same value as the average.
Otherwise, the median probability, is the one corresponding with 0.5 in CDF distribution.

Confidence Interval for Inverse Gauss distribution with scipy.stats

I am attempting to fit an inverse gauss distribution to data using the scipy.stats toolbox. The data fits well using the following code:
import scipy.stats
dist = stats.invgauss
# fit a distribution to the data
dist_fit = dist.fit(data);
dist_model = dist(*dist_fit);
# find the distribution mean
dist_mu = dist_model.mean();
# find the distribution standard deviation
dist_std = dist_model.std();
Which produces a fit to the distribution that looks like this: inverse_gaussian_fit.
I am trying to determine the confidence interval for the mean of this distribution. From my understanding, the confidence interval of the mean is equal to the standard error of the mean (which is equal to the standard deviation divided by the square root of the number of tests) multiplied by the percent point function (which is equal to the inverse of the cumulative distribution function) at the confidence level desired. I can do this using the following code:
# find the inverse gaussian standard error/confidence interval
dist_se = dist_std / np.sqrt(n);
dist_ci_l = dist_se * dist_model.ppf(0.05);
dist_ci_h = dist_se * dist_model.ppf(0.95);
Unfortunately, this produces unrealistic results like this:
inverse_gaussian_running_averages.
How can I generate the asymmetric confidence interval for an inverse gauss function? I have seen many applications where one assumes the confidence interval from a normal distribution, but that creates symmetric confidence intervals.

numpy.random.normal different distribution: selecting values from distribution

I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.

Probability to z-score and vice versa

How do I calculate the z score of a p-value and vice versa?
For example if I have a p-value of 0.95 I should get 1.96 in return.
I saw some functions in scipy but they only run a z-test on an array.
I have access to numpy, statsmodel, pandas, and scipy (I think).
>>> import scipy.stats as st
>>> st.norm.ppf(.95)
1.6448536269514722
>>> st.norm.cdf(1.64)
0.94949741652589625
As other users noted, Python calculates left/lower-tail probabilities by default. If you want to determine the density points where 95% of the distribution is included, you have to take another approach:
>>>st.norm.ppf(.975)
1.959963984540054
>>>st.norm.ppf(.025)
-1.960063984540054
Starting in Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the zscore for which x% of the area under a normal curve lies (ignoring both tails).
We can obtain one from the other and vice versa using the inv_cdf (inverse cumulative distribution function) and the cdf (cumulative distribution function) on the standard normal distribution:
from statistics import NormalDist
NormalDist().inv_cdf((1 + 0.95) / 2.)
# 1.9599639845400536
NormalDist().cdf(1.9599639845400536) * 2 - 1
# 0.95
An explanation for the '(1 + 0.95) / 2.' formula can be found in this wikipedia section.
If you are interested in T-test, you can do similar:
z-statistics (z-score) is used when the data follows a normal distribution, population standard deviation sigma is known and the sample size is above 30. Z-Score tells you how many standard deviations from the mean your result is. The z-score is calculated using the formula:
z_score = (xbar - mu) / sigma
t-statistics (t-score), also known as Student's T-Distribution, is used when the data follows a normal distribution, population standard deviation (sigma) is NOT known, but the sample standard deviation (s) is known or can be calculated, and the sample size is below 30. T-Score tells you how many standard deviations from the mean your result is. The t-score is calculated using the formula:
t_score = (xbar - mu) / (s/sqrt(n))
Summary: If the sample sizes are larger than 30, the z-distribution and the t-distributions are pretty much the same and either one can be used. If the population standard deviation is available and the sample size is greater than 30, t-distribution can be used with the population standard deviation instead of the sample standard deviation.
teststatistics
lookuptable
lookupvalues
criticalvalue
normaldistribution
populationstandarddeviation (sigma)
samplesize
z-statistics
z-table
z-score
z-critical is z-score at a specific confidence level
yes
known
> 30
t-statistics
t-table
t-score
t-critical is t-score at a specific confidence level
yes
not known
< 30
Python Percent Point Function is used to calculate the critical values at a specific confidence level:
z-critical = stats.norm.ppf(1 - alpha) (use alpha = alpha/2 for two-sided)
t-critical = stats.t.ppf(alpha/numOfTails, ddof)
Codes
import numpy as np
from scipy import stats
# alpha to critical
alpha = 0.05
n_sided = 2 # 2-sided test
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(z_crit) # 1.959963984540054
# critical to alpha
alpha = stats.norm.sf(z_crit) * n_sided
print(alpha) # 0.05
Z-score to probability :
The code snippet below maps the negative of the absolute value of the z-score to cdf of a Std Normal Distribution and multiplies by 2 . This will give the prob of finding the probability of Area1 + Area2 shaded in the picture here :
import numpy as np
from scipy.stats import norm
norm(0, 1).cdf(-np.absolute(zscore)) * 2
Ref: https://mathbitsnotebook.com/Algebra2/Statistics/STzScores.html

Categories