How can write this formula to Python code? [duplicate] - python

I am looking for a function in Numpy or Scipy (or any rigorous Python library) that will give me the cumulative normal distribution function in Python.

Here's an example:
>>> from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
In other words, approximately 95% of the standard normal interval lies within two standard deviations, centered on a standard mean of zero.
If you need the inverse CDF:
>>> norm.ppf(norm.cdf(1.96))
array(1.9599999999999991)

It may be too late to answer the question but since Google still leads people here, I decide to write my solution here.
That is, since Python 2.7, the math library has integrated the error function math.erf(x)
The erf() function can be used to compute traditional statistical functions such as the cumulative standard normal distribution:
from math import *
def phi(x):
#'Cumulative distribution function for the standard normal distribution'
return (1.0 + erf(x / sqrt(2.0))) / 2.0
Ref:
https://docs.python.org/2/library/math.html
https://docs.python.org/3/library/math.html
How are the Error Function and Standard Normal distribution function related?

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the cumulative distribution function (cdf - probability that a random sample X will be less than or equal to x) for a given mean (mu) and standard deviation (sigma):
from statistics import NormalDist
NormalDist(mu=0, sigma=1).cdf(1.96)
# 0.9750021048517796
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
NormalDist().cdf(1.96)
# 0.9750021048517796
NormalDist().cdf(-1.96)
# 0.024997895148220428

Adapted from here http://mail.python.org/pipermail/python-list/2000-June/039873.html
from math import *
def erfcc(x):
"""Complementary error function."""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def ncdf(x):
return 1. - 0.5*erfcc(x/(2**0.5))

To build upon Unknown's example, the Python equivalent of the function normdist() implemented in a lot of libraries would be:
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y

Alex's answer shows you a solution for standard normal distribution (mean = 0, standard deviation = 1). If you have normal distribution with mean and std (which is sqr(var)) and you want to calculate:
from scipy.stats import norm
# cdf(x < val)
print norm.cdf(val, m, s)
# cdf(x > val)
print 1 - norm.cdf(val, m, s)
# cdf(v1 < x < v2)
print norm.cdf(v2, m, s) - norm.cdf(v1, m, s)
Read more about cdf here and scipy implementation of normal distribution with many formulas here.

Taken from above:
from scipy.stats import norm
>>> norm.cdf(1.96)
0.9750021048517795
>>> norm.cdf(-1.96)
0.024997895148220435
For a two-tailed test:
Import numpy as np
z = 1.96
p_value = 2 * norm.cdf(-np.abs(z))
0.04999579029644087

Simple like this:
import math
def my_cdf(x):
return 0.5*(1+math.erf(x/math.sqrt(2)))
I found the formula in this page https://www.danielsoper.com/statcalc/formulas.aspx?id=55

Related

Better way to calculate λ in a Poisson distribution if the probability of occurrence P(X>0) is known

Currently I am using the following function but I am wondering if there is a more efficient way or a simple formula to accomplish this?
from scipy.stats import poisson
def calc_expected_value(event_proba):
x = 0.01
while round(1 - poisson.pmf(0, x), 2) != round(event_proba, 2):
x += 0.01
return x
P(X = 0) = exp(-lambda), hence P(X > 0) = 1 - exp(-lambda). If you call this probability event_proba, then
exp(-lambda) = 1 - event_proba
hence
lambda = -log(1 - event_proba)
Of course, in actual Python code you should avoid the name lambda since it has a built-in meaning.
Here it is as a simple formula:
from math import log
def calc_lambda(p_gt_0):
return -log(1.0 - p_gt_0)
This is derived from the fact that P{X=0} = 1 - P{X>0}.
Plug in the formula for a Poisson probability and solve to yield the implementation given above.

Optimize the rejection method for generating variables

I have a problem with optimization of the rejection method of generating continuous random variables. I've got a density: f(x) = 3/2 (1-x^2). Here's my code:
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import scipy.stats as ss
a=0 # xmin
b=1 # xmax
m=3/2 # ymax
variables = [] #list for variables
def f(x):
return 3/2 * (1 - x**2) #probability density function
reject = 0 # number of rejections
start = time.time()
while len(variables) < 100000: #I want to generate 100 000 variables
u1 = random.uniform(a,b)
u2 = random.uniform(0,m)
if u2 <= f(u1):
variables.append(u1)
else:
reject +=1
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()
ss.probplot(variables, plot=plt)
plt.show()
My first question: Is my probability plot made properly?
And the second, what is in the title. How to optimize that method? I would like to get some advice to optimize the code. Now that code takes about 0.5 seconds and there are about 50 000 rejections. Is it possible to reduce the time and number of rejections? If it's needed I can optimize using a different method of generating variables.
My first question: Is my probability plot made properly?
No. It is made versus default normal distribution. You have to pack your function f(x) into class derived from stats.rv_continuous, make it into _pdf method, and pass it to probplot
And the second, what is in the title. How to optimise that method? Is it possible to reduce the time and number of rejections?
Sure, you have the power of NumPy vector abilities at your hands. Don't ever write explicit loops - vectoriz, vectorize and vectorize!
Look at modified code below, not a single loop, everything is done via NumPy vectors. Time went down on my computer for 100000 samples (Xeon, Win10 x64, Anaconda Python 3.7) from 0.19 to 0.003.
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
import time
a = 0. # xmin
b = 1. # xmax
m = 3.0/2.0 # ymax
def f(x):
return 1.5 * (1.0 - x*x) # probability density function
start = time.time()
N = 100000
u1 = np.random.uniform(a, b, N)
u2 = np.random.uniform(0.0, m, N)
negs = np.empty(N)
negs.fill(-1)
variables = np.where(u2 <= f(u1), u1, negs) # accepted samples are positive or 0, rejected are -1
end = time.time()
accept = np.extract(variables>=0.0, variables)
reject = N - len(accept)
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a, b, 1000)
plt.hist(accept, 50, density=True)
plt.plot(x, f(x))
plt.show()
ss.probplot(accept, plot=plt) # against normal distribution
plt.show()
Concerning reducing number of rejections, you could sample with 0 rejects doing inverse method, it is cubic equation so it could work with easy
UPDATE
Here is the code to use for probplot:
class my_pdf(ss.rv_continuous):
def _pdf(self, x):
return 1.5 * (1.0 - x*x)
ss.probplot(accept, dist=my_pdf(a=a, b=b, name='my_pdf'), plot=plt)
and you should get something like
Regarding your first question, scipy.stats.probplot compares your sample against the quantiles of the normal distribution. If you'd like it to compare against the quantiles of your f(x) distribution, check out the dist parameter of probplot.
In terms of making this sampling procedure faster, avoiding loops is generally the way to go. Replacing the code between start = ... and end = ... with the following resulted in a >20x speedup for me.
n_before_accept_reject = 150000
u1 = np.random.uniform(a, b, size=n_before_accept_reject)
u2 = np.random.uniform(0, m, size=n_before_accept_reject)
variables = u1[u2 <= f(u1)]
reject = n_before_accept_reject - len(variables)
Note that this will give you approximately 100000 accepted samples each time you run it. You could raise the value of n_before_accept_reject slightly to effectively guarantee that variables will always have >100000 accepted values, and then just cap the size of variables to return exactly 100000 if necessary.
Others have spoken to the probability plotting, I'm going to address the efficiency of the rejection algorithm.
Acceptance/rejection schemes are based on m(x), a "majorizing function". A majorizing function should have two properties: 1) m(x)≥ f(x) ∀ x; and 2) m(x), when scaled to be a distribution, should be easy to generate values from.
You went with the constant function m = 3/2, which meets both requirements but does not bound f(x) very closely. Integrated from zero to one, that has an area of 3/2. Your f(x), being a valid density function, has an area of 1. Consequently, ∫f(x)) / ∫m(x)) = 1 / (3/2) = 2/3. In other words, 2/3 of the values you generate from the majorizing function are accepted, and you are rejecting 1/3 of the attempts.
You need an m(x) which provides a tighter bound for f(x). I went with a line which is tangent to f(x) at x = 1/2. With a little bit of calculus to get the slope, I derived m(x) = 15/8 - 3x/2.
This choice of m(x) has an area of 9/8, so only 1/9 of the values will be rejected. A bit more calculus yielded the inverse transform generator for x's based on this m(x) is x = (5 - sqrt(25 - 24U)) / 4, where U is a uniform(0,1) random varible.
Here's an implementation, based off your original version. I wrapped the rejection scheme in a function, and created the values with a list comprehension rather than appending to a list. As you'll see if you run this, it produces a lot fewer rejections than your original version.
import random
import matplotlib.pyplot as plt
import numpy as np
import time
import math
import scipy.stats as ss
a = 0 # xmin
b = 1 # xmax
reject = 0 # number of rejections
def f(x):
return 3.0 / 2.0 * (1.0 - x**2) #probability density function
def m(x):
return 1.875 - 1.5 * x
def generate_x():
global reject
while True:
x = (5.0 - math.sqrt(25.0 - random.uniform(0.0, 24.0))) / 4.0
u = random.uniform(0, m(x))
if u <= f(x):
return x
reject += 1
start = time.time()
variables = [generate_x() for _ in range(100000)]
end = time.time()
print("Time: ", end-start)
print("Rejection: ", reject)
x = np.linspace(a,b,1000)
plt.hist(variables,50, density=1)
plt.plot(x, f(x))
plt.show()

Solving implicit function and passing in three arguments

In the equation above I want to solve for f and pass in Re, D, and epsilon. Here is my code below:
import math
from scipy.optimize import fsolve
# Colebrook Turbulent Friction Factor Correlation
def colebrook(Re, D, eps):
return fsolve(-(1 / math.sqrt(f)) - 2 * math.log10(((eps / D) / 3.7) + (2.51 / Re * math.sqrt(f))), f)
Would I use fsolve() or solve()? I read up on fsolve() on Python's main site, however I don't understand some of the inputs it wants. Thank you in advance!
Also, I am using Spyder (Python 3.6)
The wikipedia page on "Darcy friction factor formulae" has a section on the Colebrook equation, and shows how f can be expressed in terms of the other parameters using the Lambert W function.
SciPy has an implementation of the Lambert W function, so you can use that to compute f without using a numerical solver:
import math
from scipy.special import lambertw
def colebrook(Re, D, eps):
"""
Solve the Colebrook equation for f, given Re, D and eps.
See
https://en.wikipedia.org/wiki/Darcy_friction_factor_formulae#Colebrook%E2%80%93White_equation
for more information.
"""
a = 2.51 / Re
b = eps / (3.7*D)
p = 1/math.sqrt(10)
lnp = math.log(p)
x = -lambertw(-lnp/a * math.pow(p, -b/a))/lnp - b/a
if x.imag != 0:
raise ValueError('x is complex')
f = 1/x.real**2
return f
For example,
In [84]: colebrook(125000, 0.315, 0.00015)
Out[84]: 0.019664137795383934
For comparison, the calculator at https://www.engineeringtoolbox.com/colebrook-equation-d_1031.html gives 0.0197.

scipy.stats.ttest_ind without array (python)

I have done a number of calculations to estimate μ, σ and N for my two samples. Due to a number of approximations I don't have the arrays that are expected as input to scipy.stats.ttest_ind. Unless I am mistaken I only need μ, σ and N to do a welch's t test. Is there a way to do this in python?
Here’s a straightforward implementation based on this:
import scipy.stats as stats
import numpy as np
def welch_t_test(mu1, s1, N1, mu2, s2, N2):
# Construct arrays to make calculations more succint.
N_i = np.array([N1, N2])
dof_i = N_i - 1
v_i = np.array([s1, s2]) ** 2
# Calculate t-stat, degrees of freedom, use scipy to find p-value.
t = (mu1 - mu2) / np.sqrt(np.sum(v_i / N_i))
dof = (np.sum(v_i / N_i) ** 2) / np.sum((v_i ** 2) / ((N_i ** 2) * dof_i))
p = stats.distributions.t.sf(np.abs(t), dof) * 2
return t, p
It yields virtually identical results:
sample1 = np.random.rand(10)
sample2 = np.random.rand(15)
result_test = welch_t_test(np.mean(sample1), np.std(sample1, ddof=1), sample1.size,
np.mean(sample2), np.std(sample2, ddof=1), sample2.size)
result_scipy = stats.ttest_ind(sample1, sample2,equal_var=False)
np.allclose(result_test, result_scipy)
True
As an update
The function is now available in scipy.stats, since version 0.16.0
http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.ttest_ind_from_stats.html
scipy.stats.ttest_ind_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, equal_var=True)
T-test for means of two independent samples from descriptive statistics.
This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values.
I have written t-test and z-test functions that take the summary statistics for statsmodels.
Those were intended mainly as internal shortcuts to avoid code duplication, and are not well documented.
For example http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats._tstat_generic.html
The list of related functions is here:
http://statsmodels.sourceforge.net/devel/stats.html#basic-statistics-and-t-tests-with-frequency-weights
edit: in reply to comment
The function just does the core calculation, the actual calculation of the standard deviation of the difference under different assumptions is added to the calling method.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/stats/weightstats.py#L713
edit
Here is an example how to use the methods of the CompareMeans class that includes the t-test based on summary statistics. We need to create a class that holds the relevant summary statistic as attributes. At the end there is a function that just wraps the relevant calls.
"""
Created on Wed Jul 23 05:47:34 2014
Author: Josef Perktold
License: BSD-3
"""
import numpy as np
from scipy import stats
from statsmodels.stats.weightstats import CompareMeans, ttest_ind
class SummaryStats(object):
def __init__(self, nobs, mean, std):
self.nobs = nobs
self.mean = mean
self.std = std
self._var = std**2
np.random.seed(123)
nobs = 20
x1 = 1 + np.random.randn(nobs)
x2 = 1 + 1.5 * np.random.randn(nobs)
print stats.ttest_ind(x1, x2, equal_var=False)
print ttest_ind(x1, x2, usevar='unequal')
s1 = SummaryStats(x1.shape[0], x1.mean(0), x1.std(0))
s2 = SummaryStats(x2.shape[0], x2.mean(0), x2.std(0))
print CompareMeans(s1, s2).ttest_ind(usevar='unequal')
def ttest_ind_summ(summ1, summ2, usevar='unequal'):
"""t-test for equality of means based on summary statistic
Parameters
----------
summ1, summ2 : tuples of (nobs, mean, std)
summary statistic for the two samples
"""
s1 = SummaryStats(*summ1)
s2 = SummaryStats(*summ2)
return CompareMeans(s1, s2).ttest_ind(usevar=usevar)
print ttest_ind_summ((x1.shape[0], x1.mean(0), x1.std(0)),
(x2.shape[0], x2.mean(0), x2.std(0)),
usevar='unequal')
''' result
(array(1.1590347327654558), 0.25416326823881513)
(1.1590347327654555, 0.25416326823881513, 35.573591346616553)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
(1.1590347327654558, 0.25416326823881513, 35.57359134661656)
'''

Calculating Probability of a Random Variable in a Distribution in Python

Given a mean and standard-deviation defining a normal distribution, how would you calculate the following probabilities in pure-Python (i.e. no Numpy/Scipy or other packages not in the standard library)?
The probability of a random variable r where r < x or r <= x.
The probability of a random variable r where r > x or r >= x.
The probability of a random variable r where x > r > y.
I've found some libraries, like Pgnumerics, that provide functions for calculating these, but the underlying math is unclear to me.
Edit: To show this isn't homework, posted below is my working code for Python<=2.6, albeit I'm not sure if it handles the boundary conditions correctly.
from math import *
import unittest
def erfcc(x):
"""
Complementary error function.
"""
z = abs(x)
t = 1. / (1. + 0.5*z)
r = t * exp(-z*z-1.26551223+t*(1.00002368+t*(.37409196+
t*(.09678418+t*(-.18628806+t*(.27886807+
t*(-1.13520398+t*(1.48851587+t*(-.82215223+
t*.17087277)))))))))
if (x >= 0.):
return r
else:
return 2. - r
def normcdf(x, mu, sigma):
t = x-mu;
y = 0.5*erfcc(-t/(sigma*sqrt(2.0)));
if y>1.0:
y = 1.0;
return y
def normpdf(x, mu, sigma):
u = (x-mu)/abs(sigma)
y = (1/(sqrt(2*pi)*abs(sigma)))*exp(-u*u/2)
return y
def normdist(x, mu, sigma, f):
if f:
y = normcdf(x,mu,sigma)
else:
y = normpdf(x,mu,sigma)
return y
def normrange(x1, x2, mu, sigma, f=True):
"""
Calculates probability of random variable falling between two points.
"""
p1 = normdist(x1, mu, sigma, f)
p2 = normdist(x2, mu, sigma, f)
return abs(p1-p2)
All these are very similar: If you can compute #1 using a function cdf(x), then the solution to #2 is simply 1 - cdf(x), and for #3 it's cdf(x) - cdf(y).
Since Python includes the (gauss) error function built in since version 2.7 you can do this by calculating the cdf of the normal distribution using the equation from the article you linked to:
import math
print 0.5 * (1 + math.erf((x - mean)/math.sqrt(2 * standard_dev**2)))
where mean is the mean and standard_dev is the standard deviation.
Some notes since what you asked seemed relatively straightforward given the information in the article:
CDF of a random variable (say X) is the probability that X lies between -infinity and some limit, say x (lower case). CDF is the integral of the pdf for continuous distributions. The cdf is exactly what you described for #1, you want some normally distributed RV to be between -infinity and x (<= x).
< and <= as well as > and >= are same for continuous random variables as the probability that the rv is any single point is 0. So whether or not x itself is included doesn't actually matter when calculating the probabilities for continuous distributions.
Sum of probabilities is 1, if its not < x then it's >= x so if you have the cdf(x). then 1 - cdf(x) is the probability that the random variable X >= x. Since >= is equivalent for continuous random variables to >, this is also the probability X > x.

Categories