pdf estimation with scipy.stats

pdf estimation with scipy.stats - python

Say I compute the density of Beta(4,8):
from scipy.stats import beta
rv = beta(4, 8)
x = np.linspace(start=0, stop=1, num=200)
my_pdf = rv.pdf(x)
Why does the integral of the pdf not equal one?
> my_pdf.sum()
199.00000139548044

The integral over the pdf is one. You can see this by using numerical integration from scipy
>>> from scipy.integrate import quad
>>> quad(rv.pdf, 0, 1)
(0.9999999999999999, 1.1102230246251564e-14)
or by writing your own ad-hoc integration (with a trapezoidal rule in this example)
>>> x = numpy.linspace(start=0, stop=1, num=201)
>>> (0.5 * rv.pdf(x[0]) + rv.pdf(x[1:-1]).sum() + 0.5 * rv.pdf(x[-1])) / 200.0
1.0000000068732813

rv.pdf returns the value of the pdf at each value of x. It doesn't sum to one because your aren't actually computing an integral. If you want to do that, you need to divide your sum by the number of intervals, which is len(x) - 1, which is 199. That would then give you a result very close to 1.

Related

Why numpy fft return incorrect phase information?

I compare phase and amplitude spectrum in Matlab and numpy. I think Matlab work correct, but numpy compute correct amplitude spectrum, but phase spectrum is strange. How i must change python code for correct computing fft by numpy?
Matlab:
fs = 1e4;
dt = 1 / fs;
t = 0:dt:0.5;
F = 1e3;
y = cos(2*pi*F*t);
S = fftshift(fft(y) / length(y));
f_scale = linspace(-1, 1, length(y)) * (fs / 2);
a = abs(S);
phi = (angle(S));
subplot(2, 1, 1)
plot(f_scale, a)
title('amplitude')
subplot(2, 1, 2)
plot(f_scale, phi)
title('phase')
Python:
import numpy as np
import matplotlib.pyplot as plt
fs = 1e4
dt = 1 / fs
t = np.arange(0, 0.5, dt)
F = 1e3
y = np.cos(2*np.pi*F*t)
S = np.fft.fftshift(np.fft.fft(y) / y.shape[0])
f_scale = np.linspace(-1, 1, y.shape[0]) * (fs / 2)
a = np.abs(S)
phi = np.angle(S)
plt.subplot(2, 1, 1, title="amplitude")
plt.plot(f_scale, a)
plt.subplot(2, 1, 2, title="phase")
plt.plot(f_scale, phi)
plt.show()
matlab output
numpy output

It's a problem in understanding np.arange. It stops one dt before reaching the desired value (the interval you pass is open on the right side). If you define
t = np.arange(0, 0.5+dt, dt)
everything will work fine.

As pointed out in another answer, to make the Python plot match the matlab output, you have to adjust the t array to have the same values as the t array in the matlab code.
However, if your intent was to have an integer number of periods in the signal, so the FFT has just two nonzero values (at ± the input frequency), then it is the Python code that is correct. The phase in the Python code looks strange because all the Fourier coefficients except those associated with the signal's frequency are (theoretically) 0. With finite precision arithmetic, the coefficients end up being numerical "noise" with very small amplitude and essentially random phase.

Oscillatory integral in python

I wrote the following code to plot the intensity of light exiting an optical components, which is basically a spherical Fourier integral on the incident field, so it has a Bessel function. The argument of which depends on the integrating variable (x) and the plotting variable (r).
from sympy import *
import matplotlib.pyplot as plt
import numpy as np
from scipy.integrate import quad
from scipy.special import jn
#constants
mm = 1
um = 1e-3 * mm
nm = 1e-6 * mm
wavelength = 532*nm
klaser = 2*np.pi / wavelength
waist = 3.2*mm
angle = 2 #degrees
focus = 125 * mm
ng = 1.5 # refractive index of axicon
upperintegration = 5
#integrals
def b(angle):
radians = angle* np.pi/180
return klaser * (ng-1) * np.tan(radians)
def delta(angle):
return np.pi/(b(angle)*waist)
def integrand(x, r):
return klaser/focus * waist**2 * np.exp(-x**2) * np.exp(-np.pi * 1j * x/delta(angle)) * jn(0, waist*klaser*r*x/focus) * x
def intensity1D(r):
return np.sqrt(quad(lambda x: np.real(integrand(x, r)), 0, upperintegration)[0]**2 + quad(lambda x: np.imag(integrand(x, r)), 0, upperintegration)[0]**2)
fig = plt.figure()
ax = fig.add_subplot(111)
t = np.linspace(-3.5, 3.5, 25)
plt.plot(t, np.vectorize(intensity1D)(t))
The issue is that the plot changes drastically as I change the number of points I am using in my linspace, when I plot it.
I suspect this may be because of the oscillatory nature of the integral, so the step-size taken can dramatically change the value of the exponent and hence of the integral.
How does quad deal with this? Are there better methods to integrate numerically for this particular application?

In the call to quad, set the limit argument to a large number. This increases the maximum number subintervals that quad is allowed to use to estimate the integral. When I use
def intensity1D(r):
re = quad(lambda x: np.real(integrand(x, r)), 0, upperintegration, limit=8000)[0]
im = quad(lambda x: np.imag(integrand(x, r)), 0, upperintegration, limit=8000)[0]
return np.sqrt(re**2 + im**2)
and compute the function with the array t defined as
t = np.linspace(1.5, 3, 1000)
I get the following plot:
(I also removed the line from sympy import *. sympy does not appear to be used in
your script.)
You should always check the error estimate that is the second return value of quad.
For example:
In [14]: r = 3.0
In [15]: val, err = quad(lambda x: np.real(integrand(x, r)), 0, upperintegration, limit=8000)
In [16]: val
Out[16]: 2.975500141416676e-11
In [17]: err
Out[17]: 1.4590630152807049e-08
As you can see, the error estimate is much larger than the approximate integral. The estimates returned by quad might be conservative, but a result with such a large error estimate should still be treated with caution. Let's take a look at the corresponding imaginary part:
In [25]: val, err = quad(lambda x: np.imag(integrand(x, r)), 0, upperintegration, limit=8000)
In [26]: val
Out[26]: 0.0026492702707317257
In [27]: err
Out[27]: 1.4808416189183e-08
val is now orders of magnitude larger than the estimated error. So when the magnitude of the complex value is computed in intensity1D(), we end up with estimated relative error on the order of 1e-5. That may be sufficient for your calculation.
At the peak near r=2.1825, the magnitude of the error estimate is still small, and it is much smaller than the computed integral:
In [32]: r = 2.1825
In [33]: quad(lambda x: np.real(integrand(x, r)), 0, upperintegration, limit=8000)
Out[33]: (6.435730031424414, 8.801375195176556e-08)
In [34]: quad(lambda x: np.imag(integrand(x, r)), 0, upperintegration, limit=8000)
Out[34]: (-6.583055286038913, 9.211333259956749e-08)

There are specific methods for integration of oscillatory integrands that actually increase in accuracy as the frequency increases. Filon and Levin methods are described here:
https://www.sciencedirect.com/science/article/pii/S0377042706005929
Mathematica should use one of these if you specify LevinRule as method in
NIntegrate. This is perhaps simple enough that -- if your integrand has the form, apparently common in optics calculations-- that you could even write a short program in your favorite efficient numerical programming language.
I suspect that using usual quadrature for oscillatory integrands is going to be painfully slow if you want to get accurate results.

Inverse probability density function

What do I have to use to figure out the inverse probability density function for normal distribution? I'm using scipy to find out normal distribution probability density function:
from scipy.stats import norm
norm.pdf(1000, loc=1040, scale=210)
0.0018655737107410499
How can I figure out that 0.0018 probability corresponds to 1000 in the given normal distribution?

There can be no 1:1 mapping from probability density to quantile.
Because the PDF of the normal distribution is quadratic, there can be either 2, 1 or zero quantiles that have a particular probability density.
Update
It's actually not that hard to find the roots analytically. The PDF of a normal distribution is given by:
With a bit of rearrangement we get:
(x - mu)**2 = -2 * sigma**2 * log( pd * sigma * sqrt(2 * pi))
If the discriminant on the RHS is < 0, there are no real roots. If it equals zero, there is a single root (where x = mu), and where it is > 0 there are two roots.
To put it all together into a function:
import numpy as np
def get_quantiles(pd, mu, sigma):
discrim = -2 * sigma**2 * np.log(pd * sigma * np.sqrt(2 * np.pi))
# no real roots
if discrim < 0:
return None
# one root, where x == mu
elif discrim == 0:
return mu
# two roots
else:
return mu - np.sqrt(discrim), mu + np.sqrt(discrim)
This gives the desired quantile(s), to within rounding error:
from scipy.stats import norm
pd = norm.pdf(1000, loc=1040, scale=210)
print get_quantiles(pd, 1040, 210)
# (1000.0000000000001, 1079.9999999999998)

import scipy.stats as stats
import scipy.optimize as optimize
norm = stats.norm(loc=1040, scale=210)
y = norm.pdf(1000)
print(y)
# 0.00186557371074
print(optimize.fsolve(lambda x:norm.pdf(x)-y, norm.mean()-norm.std()))
# [ 1000.]
print(optimize.fsolve(lambda x:norm.pdf(x)-y, norm.mean()+norm.std()))
# [ 1080.]
There exist distributions which attain any value an infinite number of times. (For example, the simple function with value 1 on an infinite sequence of intervals with lengths 1/2, 1/4, 1/8, etc. attains the value 1 an infinite number of times. And it is a distribution since 1/2 + 1/4 + 1/8 + ... = 1)
So the use of fsolve above is not guaranteed to find all values of x where pdf(x) equals a certain value, but it may help you find some root.

python scipy.stats.powerlaw negative exponent

I want to supply a negative exponent for the scipy.stats.powerlaw routine, e.g. a=-1.5, in order to draw random samples:
"""
powerlaw.pdf(x, a) = a * x**(a-1)
"""
from scipy.stats import powerlaw
R = powerlaw.rvs(a, size=100)
Why is a > 0 required, how can I supply a negative a in order to generate the random samples, and how can I supply a normalization coefficient/transform, i.e.
PDF(x,C,a) = C * x**a
The documentation is here
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html
Thanks!
EDIT: I should add that I'm trying to replicate IDL's RANDOMP function:
http://idlastro.gsfc.nasa.gov/ftp/pro/math/randomp.pro

A PDF, integrated over its domain, must equal one. In other words, the area under a probability density function's curve must equal one.
In [36]: import scipy.integrate as integrate
In [40]: y, err = integrate.quad(lambda x: 0.5*x**(-0.5), 0, 1)
In [41]: y
Out[41]: 0.9999999999999998 # The integral is close to 1
The powerlaw density function has a domain from 0 <= x <= 1. On this domain, the integral of x**b is finite for any b > -1. When b is smaller, x**b blows up too rapidly near x = 0. So it is not a valid probability density function when b <= -1.
In [38]: integrate.quad(lambda x: x**(-1), 0, 1)
UserWarning: The maximum number of subdivisions (50) has been achieved...
# The integral blows up
Thus for x**(a-1), a must satisfy a-1 > -1 or equivalently, a > 0.
The first constant a in a * x**(a-1) is the normalizing constant which makes the integral of a * x**(a-1) over the domain [0,1] equal to 1. So you don't get to choose this constant independent of a.
Now if you change the domain to be a measurable distance away from 0, then yes, you could define a PDF of the form C * x**a for negative a. But you'd have to state what domain you want, and I don't think there is (yet) a PDF available in scipy.stats for this.

The Python package powerlaw can do this. Consider for a>1 a power law distribution with probability density function
f(x) = c * x^(-a)
for x > x_min and f(x) = 0 otherwise. Here c is a normalization factor and is determined as
c = (a-1) * x_min^(a-1).
In the example below it is a = 1.5 and x_min = 1.0 and comparing the probability density function estimated from the random sample with the PDF from the expression above gives the expected result.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pl
import numpy as np
import powerlaw
a, xmin = 1.5, 1.0
N = 10000
# generates random variates of power law distribution
vrs = powerlaw.Power_Law(xmin=xmin, parameters=[a]).generate_random(N)
# plotting the PDF estimated from variates
bin_min, bin_max = np.min(vrs), np.max(vrs)
bins = 10**(np.linspace(np.log10(bin_min), np.log10(bin_max), 100))
counts, edges = np.histogram(vrs, bins, density=True)
centers = (edges[1:] + edges[:-1])/2.
# plotting the expected PDF
xs = np.linspace(bin_min, bin_max, 100000)
pl.plot(xs, [(a-1)*xmin**(a-1)*x**(-a) for x in xs], color='red')
pl.plot(centers, counts, '.')
pl.xscale('log')
pl.yscale('log')
pl.savefig('powerlaw_variates.png')
returns

If r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the exponent of the distribution.

If you want to generate power-law distribution, you can use a random deviation. You just have to generate a random number between [0,1] and apply the inverse method (Wolfram). In this case, the probability density function is:
p(k) = k^(-gamma)
and y is the variable uniform between 0 and 1.
y ~ U(0,1)
import numpy as np
def power_law(k_min, k_max, y, gamma):
return ((k_max**(-gamma+1) - k_min**(-gamma+1))*y + k_min**(-gamma+1.0))**(1.0/(-gamma + 1.0))
Now to generate a distribution, you just have to create an array
nodes = 1000
scale_free_distribution = np.zeros(nodes, float)
k_min = 1.0
k_max = 100*k_min
gamma = 3.0
for n in range(nodes):
scale_free_distribution[n] = power_law(k_min, k_max,np.random.uniform(0,1), gamma)
This will work to generate a power-law distribution with gamma=3.0, if you want to fix the average of distribution, you have to study Complex Networks cause the k_min depends of k_max and the average connectivity.

My answer is almost the same as Virgil's above, with the crucial difference that that alpha is actually the negative exponent of powerlaw distribution
So, if r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the negative exponent of the distribution, that is the P(x) = [constant] * x**-alpha

Artefacts from Riemann sum in scipy.signal.convolve

Short summary: How do I quickly calculate the finite convolution of two arrays?
Problem description
I am trying to obtain the finite convolution of two functions f(x), g(x) defined by
To achieve this, I have taken discrete samples of the functions and turned them into arrays of length steps:
xarray = [x * i / steps for i in range(steps)]
farray = [f(x) for x in xarray]
garray = [g(x) for x in xarray]
I then tried to calculate the convolution using the scipy.signal.convolve function. This function gives the same results as the algorithm conv suggested here. However, the results differ considerably from analytical solutions. Modifying the algorithm conv to use the trapezoidal rule gives the desired results.
To illustrate this, I let
f(x) = exp(-x)
g(x) = 2 * exp(-2 * x)
the results are:
Here Riemann represents a simple Riemann sum, trapezoidal is a modified version of the Riemann algorithm to use the trapezoidal rule, scipy.signal.convolve is the scipy function and analytical is the analytical convolution.
Now let g(x) = x^2 * exp(-x) and the results become:
Here 'ratio' is the ratio of the values obtained from scipy to the analytical values. The above demonstrates that the problem cannot be solved by renormalising the integral.
The question
Is it possible to use the speed of scipy but retain the better results of a trapezoidal rule or do I have to write a C extension to achieve the desired results?
An example
Just copy and paste the code below to see the problem I am encountering. The two results can be brought to closer agreement by increasing the steps variable. I believe that the problem is due to artefacts from right hand Riemann sums because the integral is overestimated when it is increasing and approaches the analytical solution again as it is decreasing.
EDIT: I have now included the original algorithm 2 as a comparison which gives the same results as the scipy.signal.convolve function.
import numpy as np
import scipy.signal as signal
import matplotlib.pyplot as plt
import math
def convolveoriginal(x, y):
'''
The original algorithm from http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P, Q, N = len(x), len(y), len(x) + len(y) - 1
z = []
for k in range(N):
t, lower, upper = 0, max(0, k - (Q - 1)), min(P - 1, k)
for i in range(lower, upper + 1):
t = t + x[i] * y[k - i]
z.append(t)
return np.array(z) #Modified to include conversion to numpy array
def convolve(y1, y2, dx = None):
'''
Compute the finite convolution of two signals of equal length.
#param y1: First signal.
#param y2: Second signal.
#param dx: [optional] Integration step width.
#note: Based on the algorithm at http://www.physics.rutgers.edu/~masud/computing/WPark_recipes_in_python.html.
'''
P = len(y1) #Determine the length of the signal
z = [] #Create a list of convolution values
for k in range(P):
t = 0
lower = max(0, k - (P - 1))
upper = min(P - 1, k)
for i in range(lower, upper):
t += (y1[i] * y2[k - i] + y1[i + 1] * y2[k - (i + 1)]) / 2
z.append(t)
z = np.array(z) #Convert to a numpy array
if dx != None: #Is a step width specified?
z *= dx
return z
steps = 50 #Number of integration steps
maxtime = 5 #Maximum time
dt = float(maxtime) / steps #Obtain the width of a time step
time = [dt * i for i in range (steps)] #Create an array of times
exp1 = [math.exp(-t) for t in time] #Create an array of function values
exp2 = [2 * math.exp(-2 * t) for t in time]
#Calculate the analytical expression
analytical = [2 * math.exp(-2 * t) * (-1 + math.exp(t)) for t in time]
#Calculate the trapezoidal convolution
trapezoidal = convolve(exp1, exp2, dt)
#Calculate the scipy convolution
sci = signal.convolve(exp1, exp2, mode = 'full')
#Slice the first half to obtain the causal convolution and multiply by dt
#to account for the step width
sci = sci[0:steps] * dt
#Calculate the convolution using the original Riemann sum algorithm
riemann = convolveoriginal(exp1, exp2)
riemann = riemann[0:steps] * dt
#Plot
plt.plot(time, analytical, label = 'analytical')
plt.plot(time, trapezoidal, 'o', label = 'trapezoidal')
plt.plot(time, riemann, 'o', label = 'Riemann')
plt.plot(time, sci, '.', label = 'scipy.signal.convolve')
plt.legend()
plt.show()
Thank you for your time!

or, for those who prefer numpy to C. It will be slower than the C implementation, but it's just a few lines.
>>> t = np.linspace(0, maxtime-dt, 50)
>>> fx = np.exp(-np.array(t))
>>> gx = 2*np.exp(-2*np.array(t))
>>> analytical = 2 * np.exp(-2 * t) * (-1 + np.exp(t))
this looks like trapezoidal in this case (but I didn't check the math)
>>> s2a = signal.convolve(fx[1:], gx, 'full')*dt
>>> s2b = signal.convolve(fx, gx[1:], 'full')*dt
>>> s = (s2a+s2b)/2
>>> s[:10]
array([ 0.17235682, 0.29706872, 0.38433313, 0.44235042, 0.47770012,
0.49564748, 0.50039326, 0.49527721, 0.48294359, 0.46547582])
>>> analytical[:10]
array([ 0. , 0.17221333, 0.29682141, 0.38401317, 0.44198216,
0.47730244, 0.49523485, 0.49997668, 0.49486489, 0.48254154])
largest absolute error:
>>> np.max(np.abs(s[:len(analytical)-1] - analytical[1:]))
0.00041657780840698155
>>> np.argmax(np.abs(s[:len(analytical)-1] - analytical[1:]))
6

Short answer: Write it in C!
Long answer
Using the cookbook about numpy arrays I rewrote the trapezoidal convolution method in C. In order to use the C code one requires three files (https://gist.github.com/1626919)
The C code (performancemodule.c).
The setup file to build the code and make it callable from python (performancemodulesetup.py).
The python file that makes use of the C extension (performancetest.py)
The code should run upon downloading by doing the following
Adjust the include path in performancemodule.c.
Run the following
python performancemodulesetup.py build
python performancetest.py
You may have to copy the library file performancemodule.so or performancemodule.dll into the same directory as performancetest.py.
Results and performance
The results agree neatly with one another as shown below:
The performance of the C method is even better than scipy's convolve method. Running 10k convolutions with array length 50 requires
convolve (seconds, microseconds) 81 349969
scipy.signal.convolve (seconds, microseconds) 1 962599
convolve in C (seconds, microseconds) 0 87024
Thus, the C implementation is about 1000 times faster than the python implementation and a bit more than 20 times as fast as the scipy implementation (admittedly, the scipy implementation is more versatile).
EDIT: This does not solve the original question exactly but is sufficient for my purposes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pdf estimation with scipy.stats - python

Say I compute the density of Beta(4,8): from scipy.stats import beta rv = beta(4, 8) x = np.linspace(start=0, stop=1, num=200) my_pdf = rv.pdf(x) Why does the integral of the pdf not equal one? > my_pdf.sum() 199.00000139548044

rv.pdf returns the value of the pdf at each value of x. It doesn't sum to one because your aren't actually computing an integral. If you want to do that, you need to divide your sum by the number of intervals, which is len(x) - 1, which is 199. That would then give you a result very close to 1.

Related

Why numpy fft return incorrect phase information?

Oscillatory integral in python

Inverse probability density function

python scipy.stats.powerlaw negative exponent

Artefacts from Riemann sum in scipy.signal.convolve

Categories

Resources