How to compute the percentiles from a normal distribution in python? - python

Problem Statement - A random variable X is N(25, 4). Find the indicated percentile for X:
a. The 10th percentile
b. The 90th percentile
c. The 80th percentile
d. The 50th percentile
Attempt 1
My code:
import numpy as np
import math
import scipy.stats
mu=25
sigma=4
a=mu-(1.282*4)
b=mu+(1.282*4)
... like that. I got the values from the Zscore table given in
https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/bs704_probability10.html
Attempt 2
X=np.random.normal(25,4,10000) # sample size not mentioned in
problem. I just assumed it
a_9 = np.percentile(X,10)
b_9 = np.percentile(X,90)
c_9 = np.percentile(X,80)
d_9 = np.percentile(X,50)
But the answers are incorrect as per the hidden test cases of the practice platform. Can anyone please tell me the right way to compute the answers? Is there any scipy.stats function for this?

You can use scipy.stats and built-in ppf function (look documentation)
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
mu = 25
sigma = 4
# define the normal distribution and PDF
dist = sps.norm(loc=mu, scale=sigma)
x = np.linspace(dist.ppf(.001), dist.ppf(.999))
y = dist.pdf(x)
# calculate PPFs
ppfs = {}
for ppf in [.1, .5, .8, .9]:
p = dist.ppf(ppf)
ppfs.update({ppf*100: p})
# plot results
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(x, y, color='k')
for i, ppf in enumerate(ppfs):
ax.axvline(ppfs[ppf], color=f'C{i}', label=f'{ppf:.0f}th: {ppfs[ppf]:.1f}')
ax.legend()
plt.show()
that gives

Use the ppf method from scipy.stats.norm (normal distribution).
scipy.stats.norm.ppf(0.1, loc=25, scale=4)
This function is analogous to the qnorm function in r. The ppf method gives the value of the random variable at the given percentile.

a_9 = 19.88
b_9 = 30.12
c_9 = 28.36
d_9 = 25.00
X = np.random.normal(25,4,10000000)

Related

Kernel Density Estimation using scipy's gaussian_kde and sklearn's KernelDensity leads to different results

I created some data from two superposed normal distributions and then applied sklearn.neighbors.KernelDensity and scipy.stats.gaussian_kde to estimate the density function. However, using the same bandwith (1.0) and the same kernel, both methods produce a different outcome. Can someone explain me the reason for this? Thanks for help.
Below you can find the code to reproduce the issue:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1.0, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method=1.0)
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
If I change the scipy bandwith to 0.25, the result of both methods look approximately the same.
What is meant by bandwidth in scipy.stats.gaussian_kde and sklearn.neighbors.KernelDensity is not the same. Scipy.stats.gaussian_kde uses a bandwidth factor https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. For a 1-D kernel density estimation the following formula is applied:
the bandwidth of sklearn.neighbors.KernelDensity = bandwidth factor of the scipy.stats.gaussian_kde * standard deviation of the sample
For your estimation this probably means that your standard deviation equals 4.
I would like to refer to Getting bandwidth used by SciPy's gaussian_kde function for more information.
To be honest, I don't know why, but using scipy hyperparameter bw_method='scott' makes it work exactly the same as seaborn.
So, it seems to be all about the hyperparameters. We could find out why by understanding them in depth, but in the meantime just use ‘scott’ or ‘silverman’ instead of using a random scalar.
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
import seaborn as sns
from sklearn.neighbors import KernelDensity
n = 10000
dist_frac = 0.1
x1 = np.random.normal(-5,2,int(n*dist_frac))
x2 = np.random.normal(5,3,int(n*(1-dist_frac)))
x = np.concatenate((x1,x2))
np.random.shuffle(x)
eval_points = np.linspace(np.min(x), np.max(x))
kde_sk = KernelDensity(bandwidth=1, kernel='gaussian')
kde_sk.fit(x.reshape([-1,1]))
y_sk = np.exp(kde_sk.score_samples(eval_points.reshape(-1,1)))
kde_sp = gaussian_kde(x, bw_method='scott') ### I MEAN HERE! ###
y_sp = kde_sp.pdf(eval_points)
sns.kdeplot(x)
plt.plot(eval_points, y_sk)
plt.plot(eval_points, y_sp)
plt.legend(['seaborn','scikit','scipy'])
Increase the size of 'random normal'. your data points are too few.
try with n=500000 and check the results.

Write a random number generator that, based on uniformly distributed numbers between 0 and 1, samples from a Lévy-distribution?

I'm completely new to Python. Could someone show me how can I write a random number generator which samples from the Levy Distribution? I've written the function for the distribution, but I'm confused about how to proceed further!
The random numbers generated by this distribution I want to use them to simulate a 2D random walk.
I'm aware that from scipy.stats I can use the Levy class, but I want to write the sampler myself.
import numpy as np
import matplotlib.pyplot as plt
# Levy distribution
"""
f(x) = 1/(2*pi*x^3)^(1/2) exp(-1/2x)
"""
def levy(x):
return 1 / np.sqrt(2*np.pi*x**3) * np.exp(-1/(2*x))
N = 50
foo = levy(N)
#pjs code looks ok to me, but there is a discrepancy between his code and what SciPy thinks about Levy - basically, sampling is different from PDF.
Code, Python 3.8 Windows 10 x64
import numpy as np
from scipy.stats import levy
from scipy.stats import norm
import matplotlib.pyplot as plt
rng = np.random.default_rng(312345)
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2.0 * (norm.ppf(1.0 - u))**2)
fig, ax = plt.subplots()
rnge=(0, 20.0)
x = np.linspace(rnge[0], rnge[1], 1001)
N = 200000
q = np.empty(N)
for k in range(0, N):
u = rng.random()
q[k] = my_levy(u)
nrm = levy.cdf(rnge[1])
ax.plot(x, levy.pdf(x)/nrm, 'r-', lw=5, alpha=0.6, label='levy pdf')
ax.hist(q, bins=100, range=rnge, density=True, alpha=0.2)
plt.show()
produce graph
UPDATE
Well, I tried to use home-made PDF, same output, same problem
# replace levy.pdf(x) with PDF(x)
def PDF(x):
return np.where(x <= 0.0, 0.0, 1.0 / np.sqrt(2*np.pi*x**3) * np.exp(-1./(2.*x)))
UPDATE II
After applying #pjs corrected sampling routine, sampling and PDF are aligned perfectly. New graph
Here's a straightforward implementation of the generating algorithm for the Levy distribution found on Wikipedia:
import random
from scipy.stats import norm
# Arguments
# u: a uniform[0,1) random number
# c: scale parameter for Levy distribution (defaults to 1)
# mu: location parameter (offset) for Levy (defaults to 0)
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (2 * norm.ppf(1.0 - u)**2)
# Generate a handful of samples
for _ in range(10):
print(my_levy(random.random()))
I don't normally use Python, so please suggest improvements.
ADDENDUM
Kudos to Severin Pappadeux for the work in his response. I had already noted that a simpler answer would be to take the inverse of a squared Gaussian, but Advaita had asked for an explicit function of U ~ Uniform(0,1) so I didn't pursue that. It turns out that I should have. The Wikipedia cite mentions that, but without the scale factor of 2 in the denominator. When I take the 2 out of the implementation of Wikipedia's generating algorithm, i.e. change the implemention to
def my_levy(u, c = 1.0, mu = 0.0):
return mu + c / (norm.ppf(1.0 - u)**2)
the resulting histogram aligns beautifully with the normalized plot of the pdf. (Note - I've now also edited the incorrect Wikipedia entry to correct the formula.)

Why are the random samples drawn from my custom distribution not following the pdf?

I have created a custom distribution using scipy's rv_continuous method. I am trying to create the energy distribution of an electron produced by beta decay. Given its pdf:
Which I took from: http://hyperphysics.phy-astr.gsu.edu/hbase/Nuclear/beta2.html#c1
I define my distribution:
import numpy as np
from scipy.stats import rv_continuous
import matplotlib.pyplot as plt
class beta_decay(rv_continuous):
def _pdf(self, x):
return (22.48949986*np.sqrt(x**2 + 2*x*0.511)*((0.6-x)**2)*(x+0.511))
# create distribution from 0 --> Q value = 0.6
beta = beta_decay(a=0, b= 0.6)
# plot pdf
x = np.linspace(0,0.6)
plt.plot(x, beta.pdf(x))
plt.show()
# random sample the distribution and plot histogram
random = beta.rvs(size =100)
plt.hist(random)
plt.show()
Where x = KE, Q = 0.6, C = 22.48... (found by integrating the above expression between 0 --> Q and setting equal to 1 to normalize), and I disregard the Fermi function F(Z',KEe) in the above eqn.
When I graph the pdf, it looks right:
However, when I try to draw random samples from it using .rvs(), the value they take are massively peaked towardes the RHS, not under the peak of the pdf as I'd expect:
Ultimately, my code needs to sample the distribution to get the KE of an electron released by beta decay. Why is my histogram so wrong?
I think your PDF is defined in a wrong way, it is not normalized. After I normalized it and made proper histogram, it seems to work fine
Code (Win10 x64, Anaconda Python 3.7)
#%%
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate
from scipy.stats import rv_continuous
def bd(x):
return (22.48949986*np.sqrt(x**2 + 2*x*0.511)*((0.6-x)**2)*(x+0.511))
a = 0.0
b = 0.6
norm = integrate.quad(bd, a, b) # normalization integral
print(norm)
class beta_decay(rv_continuous):
def _pdf(self, x):
return bd(x)/norm[0]
# create Q distribution in the [0...0.6] interval
beta = beta_decay(a = a, b = b)
# plot pdf
x = np.linspace(a, b)
plt.plot(x, beta.pdf(x))
plt.show()
# sample from pdf
r = beta.rvs(size = 10000)
plt.hist(r, range=(a, b), density=True)
plt.show()
And plots
sampling

Generating synthetic data with Gaussian distribution

Problem
In a paper I am reading now, it defines a new metric and authors claim some advantages over previous metrics. They verify their claim by some synthetic data, which looks like following
The implementation of their metric is pretty straightforward. However, I am not sure how they create this kind of synthetic data.
What I Have Done
This looks like Gaussian where x is only within certain intervals, I tried with following code but did not get anything similar to the plot presented in the paper.
import numpy as np
def generate_gaussian(size=1000, lb=-0.1, up=0.1):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 0.3
background_neg = base + 0.7
Now I am wondering if the authors create these data using some special distribution (other than Gaussian) I do not know?
Numpy has a numpy.random.normal that draws random samples from a normal (Gaussian) distribution.
import numpy as np
import matplotlib.pyplot as plt
sigma = 0.05
s0 = np.random.normal(0.2, sigma, 5000)
s1 = np.random.normal(0.6, sigma, 5000)
plt.hist(s0, 300, density=True, color="b")
plt.hist(s1, 300, density=True, color="r")
plt.xlim(0, 1)
plt.show()
You can change the values of the mu (mean) and sigma to alter the distributions
mu = 0.55
sigma = 0.1
dist = np.random.normal(mu, sigma, 5000)
You have cut off the data at +/- 0.1. A normalised Gausian distribution only 'looks Gaussian' if you look over the range approximately +/- 3. Try this:
import numpy as np
def generate_gaussian(size=1000, lb=-3, up=3):
data = np.random.randn(5000)
data = data[(data <= up) & (data >= lb)][:size]
return data
np.random.seed(1234)
base = generate_gaussian()
background_pos = base + 5
background_neg = base + 15
You can use scipy.stats.norm (info).
import libraries
>>> from scipy.stats import norm
>>> from matplotlib import pyplot
plot
>>> pyplot.hist(norm.rvs(loc=1, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_1')
>>> pyplot.hist(norm.rvs(loc=5, scale=0.5, size=10000), bins=30, alpha=0.5, label='norm_2')
>>> pyplot.legend()
>>> pyplot.show()
Clarification:
A normal distribution is defined by mean (loc, distribution center) and standard distribution (scale, measure of distribution dispersion or width). rvs generates random samples of the desired normal distribution of size size. For example next code generates 4 random elements of a normal distribution (mean = 1, SD = 1).
>>> norm.rvs(loc=1, scale=1, size=4)
array([ 0.52154255, 1.40873701, 1.55959291, -0.01730568])

Is there any solution for better fit beta prime distribution to data than using Scipy?

I was trying to fit beta prime distribution to my data using python. As there's scipy.stats.betaprime.fit, I tried this:
import numpy as np
import math
import scipy.stats as sts
import matplotlib.pyplot as plt
N = 5000
nb_bin = 100
a = 12; b = 106; scale = 36; loc = -a/(b-1)*scale
y = sts.betaprime.rvs(a,b,loc,scale,N)
a_hat,b_hat,loc_hat,scale_hat = sts.betaprime.fit(y)
print('Estimated parameters: \n a=%.2f, b=%.2f, loc=%.2f, scale=%.2f'%(a_hat,b_hat,loc_hat,scale_hat))
plt.figure()
count, bins, ignored = plt.hist(y, nb_bin, normed=True)
pdf_ini = sts.betaprime.pdf(bins,a,b,loc,scale)
pdf_est = sts.betaprime.pdf(bins,a_hat,b_hat,loc_hat,scale_hat)
plt.plot(bins,pdf_ini,'g',linewidth=2.0,label='ini');plt.grid()
plt.plot(bins,pdf_est,'y',linewidth=2.0,label='est');plt.legend();plt.show()
It shows me the result that:
Estimated parameters:
a=9935.34, b=10846.64, loc=-90.63, scale=98.93
which is quite different from the original one and the figure from the PDF:
If I give the real value of loc and scale as the input of fit function, the estimation result would be better. Has anyone worked on this part already or got a better solution?

Categories