1D Wasserstein distance in Python - python

The formula below is a special case of the Wasserstein distance/optimal transport when the source and target distributions, x and y (also called marginal distributions) are 1D, that is, are vectors.
where F^{-1} are inverse probability distribution functions of the cumulative distributions of the marginals u and v, derived from real data called x and y, both generated from the normal distribution:
import numpy as np
from numpy.random import randn
import scipy.stats as ss
n = 100
x = randn(n)
y = randn(n)
How can the integral in the formula be coded in python and scipy? I'm guessing the x and y have to be converted to ranked marginals, which are non-negative and sum to 1, while Scipy's ppf could be used to calculate the inverse F^{-1}'s?

Note that when n gets large we have that a sorted set of n samples approaches the inverse CDF sampled at 1/n, 2/n, ..., n/n. E.g.:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.plot(norm.ppf(np.linspace(0, 1, 1000)), label="invcdf")
plt.plot(np.sort(np.random.normal(size=1000)), label="sortsample")
plt.legend()
plt.show()
Also note that your integral from 0 to 1 can be approximated as a sum over 1/n, 2/n, ..., n/n.
Thus we can simply answer your question:
def W(p, u, v):
assert len(u) == len(v)
return np.mean(np.abs(np.sort(u) - np.sort(v))**p)**(1/p)
Note that if len(u) != len(v) you can still apply the method with linear interpolation:
def W(p, u, v):
u = np.sort(u)
v = np.sort(v)
if len(u) != len(v):
if len(u) > len(v): u, v = v, u
us = np.linspace(0, 1, len(u))
vs = np.linspace(0, 1, len(v))
u = np.linalg.interp(u, us, vs)
return np.mean(np.abs(u - v)**p)**(1/p)
An alternative method if you have prior information about the sort of distribution of your data, but not its parameters, is to find the best fitting distribution on your data (e.g. with scipy.stats.norm.fit) for both u and v and then do the integral with the desired precision. E.g.:
from scipy.stats import norm as gauss
def W_gauss(p, u, v, num_steps):
ud = gauss(*gauss.fit(u))
vd = gauss(*gauss.fit(v))
z = np.linspace(0, 1, num_steps, endpoint=False) + 1/(2*num_steps)
return np.mean(np.abs(ud.ppf(z) - vd.ppf(z))**p)**(1/p)

I guess I am a bit late but, but this is what I would do for an exact solution (using only numpy):
import numpy as np
from numpy.random import randn
n = 100
m = 80
p = 2
x = np.sort(randn(n))
y = np.sort(randn(m))
a = np.ones(n)/n
b = np.ones(m)/m
# cdfs
ca = np.cumsum(a)
cb = np.cumsum(b)
# points on which we need to evaluate the quantile functions
cba = np.sort(np.hstack([ca, cb]))
# weights for integral
h = np.diff(np.hstack([0, cba]))
# construction of first quantile function
bins = ca + 1e-10 # small tolerance to avoid rounding errors and enforce right continuity
index_qx = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qx = x[index_qx] # quantile funciton F^{-1}
# construction of second quantile function
bins = cb + 1e-10
index_qy = np.digitize(cba, bins, right=True) # right=True becouse quantile function is
# right continuous
qy = y[index_qy] # quantile funciton G^{-1}
ot_cost = np.sum((qx - qy)**p * h)
print(ot_cost)
In case you are interested, here you can find a more detailed numpy based implementation of the ot problem on the real line with dual and primal solutions as well: https://github.com/gnies/1d-optimal-transport. (I am still working on it though).

Related

How to add a phase shift to a sin wave in the frequency domain with fft?

I want to shift a sine wave in the frequency domain
My idea is the following:
Fourier-Transform
Add a phase shift of pi in frequency domain
Inverse-Fourier-Transform
In code:
t=np.arange(0, 6 , 0.001)
values = A*np.sin(t)
ft_values= np.fft.fft(values)
ft_values_phase=ft_values+1j*np.pi
back_again= np.fft.ifft(ft_values_phase)
plt.subplot(211)
plt.plot(t,values)
plt.subplot(212)
plt.plot(t,back_again)
I expected two images, in which one wave is shifted by pi, however I got this result
(no phase shift):
Thank you for any help!
You did not make a phase shift.
What you did was to add a 6000-vector, say P, with constant value P(i) = j π to V, the FFT of v.
Let's write Ṽ = V + P.
Due to linearity of the FFT (and of IFFT), what you have called back_again is
        ṽ = IFFT(Ṽ) = IFFT(V) + IFFT(P) = v + p
where, of course, p = IFFT(P) is the difference values-back_again — now, let's check what is p...
In [51]: P = np.pi*1j*np.ones(6000)
...: p = np.fft.ifft(P)
...: plt.plot(p.real*10**16, label='real(p)*10**16')
...: plt.plot(p.imag, label='imag(p)')
...: plt.legend();
As you can see, you modified values by adding a real component of ṽ that is essentially numerical noise in the computation of the IFFT (hence no change in the plot, that gives you the real part of back_again) and a single imaginary spike, its height unsurprisingly equal to π, for t=0.
The transform of a constant is a spike at ω=0, the antitransform of a constant (in frequency domain) is a spike at t=0.
On the other hand, if you multiply each FFT term by a constant, you also multiply the time domain signal by the same constant (remember, FFT and IFFT are linear).
To do what you want, you have to remember that a shift in the time domain is just the (circular) convolution of the (periodic) signal with a time-shifted spike, so you have to multiply the FFT of the signal by the FFT of the shifted spike.
Because the Fourier Transform of a Dirac Distribution δ(t-a) is exp(-iωa) you have to multiply each term of the FFT of the signal by a frequency dependent term, exp(-iωa)=cos(ωa)-i·sin(ωa) (Note: of course each one of these multiplicative terms has unit amplitude).
An Example
Some preliminaries
In [61]: import matplotlib.pyplot as plt
...: import numpy as np
In [62]: def multiple_formatter(x, pos, den=60, number=np.pi, latex=r'\pi'):
... # search on SO for an implementation
In [63]: def plot(t, x):
...: fig, ax = plt.subplots()
...: ax.plot(t, x)
...: ax.xaxis.set_major_formatter(plt.FuncFormatter(multiple_formatter))
...: ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
...: ax.xaxis.set_minor_locator(plt.MultipleLocator(np.pi / 4))
A function to compute the discrete FT of a Dirac Distribution centered in n for a period N
In [64]: def shift(n, N):
...: s = np.zeros(N)
...: s[n] = 1.0
...: return np.fft.fft(s)
Let's plot a signal and the shifted signal
In [65]: t = np.arange(4096)*np.pi/1024
In [66]: v0 = np.sin(t)
In [67]: v1 = np.sin(t-np.pi/4)
In [68]: f, a = plot(t, v0)
In [69]: a.plot(t, v1, label='shifted by $\\pi/4$');
In [70]: a.legend();
Now compute the FFT of the correct spike (note that π/4 = (4π)/16), the FFT of the shifted signal, the IFFT of the FFT of the s.s. and finally plot our results
In [71]: S = shift(4096//16-1, 4096)
In [72]: VS = np.fft.fft(v0)*S
In [73]: vs = np.fft.ifft(VS)
In [74]: f, ay = plot(t, v0)
In [75]: ay.plot(t, vs.real, label='shifted in frequency domain');
In [76]: ay.legend();
Nice, that helped!
For anyone who wants to do the same, here is it in one python file:
import numpy as np
from matplotlib.pyplot import plot, legend
def shift(n, N):
s = np.zeros(N)
s[n] = 1.0
return np.fft.fft(s)
t = np.linspace(0, 2*np.pi,1000)
v0 = np.sin(t)
S = shift(1000//4, 1000) # shift by pi/4
VS = np.fft.fft(v0)*S
vs = np.fft.ifft(VS)
plot(t, v0 , label='original' )
plot(t,vs.real,label='shifted in frequency domain')
legend()

Generating 3D Gaussian distribution in Python

I want to generate a Gaussian distribution in Python with the x and y dimensions denoting position and the z dimension denoting the magnitude of a certain quantity.
The distribution has a maximum value of 2e6 and a standard deviation sigma=0.025.
In MATLAB I can do this with:
x1 = linspace(-1,1,30);
x2 = linspace(-1,1,30);
mu = [0,0];
Sigma = [.025,.025];
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = 314159.153*reshape(F,length(x2),length(x1));
surf(x1,x2,F);
In Python, what I have so far is:
x = np.linspace(-1,1,30)
y = np.linspace(-1,1,30)
mu = (np.median(x),np.median(y))
sigma = (.025,.025)
There is a Numpy function numpy.random.multivariate_normal what can supposedly do the same as MATLAB's mvnpdf, but I am struggling to undestand the documentation. Especially in obtaining the covariance matrix needed by numpy.random.multivariate_normal.
As of scipy 0.14, you can use scipy.stats.multivariate_normal.pdf()
import numpy as np
from scipy.stats import multivariate_normal
x, y = np.mgrid[-1.0:1.0:30j, -1.0:1.0:30j]
# Need an (N, 2) array of (x, y) pairs.
xy = np.column_stack([x.flat, y.flat])
mu = np.array([0.0, 0.0])
sigma = np.array([.025, .025])
covariance = np.diag(sigma**2)
z = multivariate_normal.pdf(xy, mean=mu, cov=covariance)
# Reshape back to a (30, 30) grid.
z = z.reshape(x.shape)
I am working on a scikit called scikit-guess that contains some fast estimation routines for non-linear fits. It has a function skg.ngauss.model (also accessible as skg.ngauss_fit.model or skg.ngauss.ngauss_fit.model) which does exactly what you want. The nice thing is that it's not a PDF, so you set the amplitude out of the box:
import numpy as np
import skg.ngauss
a = 2e6
mu = 0, 0
sigma = 0.025, 0.025
x = y = np.linspace(-1, 1, 31)
cov = np.diag(sigma)**2
X = np.meshgrid(x, y)
data = skg.ngauss.model(X, a, mu, cov, axis=0)
You need to tell it axis=0 because it automatically stacks your arrays for you. To avoid passing in that argument, you could write
X = np.stack(np.meshgrid(x, y), axis=-1)
You can plot the result:
from matplotlib import pyplot as plt
plt.imshow(data)
plt.show()
This is not a very exciting distribution because the spread is so small that you end up with a value of ~2e-5 just one pixel away. You may want to up your sampling space to get any sort of meaningful resolution.
Note: At time of writing, the fitting function (ngauss_fit) is still buggy, but the model has been tested successfully, just not in the scikit.
Disclaimer: In case it wasn't obvious from the above, I am the author of scikit-guess.

Getting spline equation from UnivariateSpline object

I'm using UnivariateSpline to construct piecewise polynomials for some data that I have. I would then like to use these splines in other programs (either in C or FORTRAN) and so I would like to understand the equation behind the generated spline.
Here is my code:
import numpy as np
import scipy as sp
from scipy.interpolate import UnivariateSpline
import matplotlib.pyplot as plt
import bisect
data = np.loadtxt('test_C12H26.dat')
Tmid = 800.0
print "Tmid", Tmid
nmid = bisect.bisect(data[:,0],Tmid)
fig = plt.figure()
plt.plot(data[:,0], data[:,7],ls='',marker='o',markevery=20)
npts = len(data[:,0])
#print "npts", npts
w = np.ones(npts)
w[0] = 100
w[nmid] = 100
w[npts-1] = 100
spline1 = UnivariateSpline(data[:nmid,0],data[:nmid,7],s=1,w=w[:nmid])
coeffs = spline1.get_coeffs()
print coeffs
print spline1.get_knots()
print spline1.get_residual()
print coeffs[0] + coeffs[1] * (data[0,0] - data[0,0]) \
+ coeffs[2] * (data[0,0] - data[0,0])**2 \
+ coeffs[3] * (data[0,0] - data[0,0])**3, \
data[0,7]
print coeffs[0] + coeffs[1] * (data[nmid,0] - data[0,0]) \
+ coeffs[2] * (data[nmid,0] - data[0,0])**2 \
+ coeffs[3] * (data[nmid,0] - data[0,0])**3, \
data[nmid,7]
print Tmid,data[-1,0]
spline2 = UnivariateSpline(data[nmid-1:,0],data[nmid-1:,7],s=1,w=w[nmid-1:])
print spline2.get_coeffs()
print spline2.get_knots()
print spline2.get_residual()
plt.plot(data[:,0],spline1(data[:,0]))
plt.plot(data[:,0],spline2(data[:,0]))
plt.savefig('test.png')
And here is the resulting plot. I believe I have valid splines for each interval but it looks like my spline equation is not correct... I can't find any reference to what it is supposed to be in the scipy documentation. Anybody knows? Thanks !
The scipy documentation does not have anything to say about how one can take the coefficients and manually generate the spline curve. However, it is possible to figure out how to do this from the existing literature on B-splines. The following function bspleval shows how to construct the B-spline basis functions (the matrix B in the code), from which one can easily generate the spline curve by multiplying the coefficients with the highest-order basis functions and summing:
def bspleval(x, knots, coeffs, order, debug=False):
'''
Evaluate a B-spline at a set of points.
Parameters
----------
x : list or ndarray
The set of points at which to evaluate the spline.
knots : list or ndarray
The set of knots used to define the spline.
coeffs : list of ndarray
The set of spline coefficients.
order : int
The order of the spline.
Returns
-------
y : ndarray
The value of the spline at each point in x.
'''
k = order
t = knots
m = alen(t)
npts = alen(x)
B = zeros((m-1,k+1,npts))
if debug:
print('k=%i, m=%i, npts=%i' % (k, m, npts))
print('t=', t)
print('coeffs=', coeffs)
## Create the zero-order B-spline basis functions.
for i in range(m-1):
B[i,0,:] = float64(logical_and(x >= t[i], x < t[i+1]))
if (k == 0):
B[m-2,0,-1] = 1.0
## Next iteratively define the higher-order basis functions, working from lower order to higher.
for j in range(1,k+1):
for i in range(m-j-1):
if (t[i+j] - t[i] == 0.0):
first_term = 0.0
else:
first_term = ((x - t[i]) / (t[i+j] - t[i])) * B[i,j-1,:]
if (t[i+j+1] - t[i+1] == 0.0):
second_term = 0.0
else:
second_term = ((t[i+j+1] - x) / (t[i+j+1] - t[i+1])) * B[i+1,j-1,:]
B[i,j,:] = first_term + second_term
B[m-j-2,j,-1] = 1.0
if debug:
plt.figure()
for i in range(m-1):
plt.plot(x, B[i,k,:])
plt.title('B-spline basis functions')
## Evaluate the spline by multiplying the coefficients with the highest-order basis functions.
y = zeros(npts)
for i in range(m-k-1):
y += coeffs[i] * B[i,k,:]
if debug:
plt.figure()
plt.plot(x, y)
plt.title('spline curve')
plt.show()
return(y)
To give an example of how this can be used with Scipy's existing univariate spline functions, the following is an example script. This takes the input data and uses Scipy's functional and also its object-oriented approach to spline fitting. Taking the coefficients and knot points from either of the two and using these as inputs to our manually-calculated routine bspleval, we reproduce the same curve that they do. Note that the difference between the manually evaluated curve and Scipy's evaluation method is so small that it is almost certainly floating-point noise.
x = array([-273.0, -176.4, -79.8, 16.9, 113.5, 210.1, 306.8, 403.4, 500.0])
y = array([2.25927498e-53, 2.56028619e-03, 8.64512988e-01, 6.27456769e+00, 1.73894734e+01,
3.29052124e+01, 5.14612316e+01, 7.20531200e+01, 9.40718450e+01])
x_nodes = array([-273.0, -263.5, -234.8, -187.1, -120.3, -34.4, 70.6, 194.6, 337.8, 500.0])
y_nodes = array([2.25927498e-53, 3.83520726e-46, 8.46685318e-11, 6.10568083e-04, 1.82380809e-01,
2.66344008e+00, 1.18164677e+01, 3.01811501e+01, 5.78812583e+01, 9.40718450e+01])
## Now get scipy's spline fit.
k = 3
tck = splrep(x_nodes, y_nodes, k=k, s=0)
knots = tck[0]
coeffs = tck[1]
print('knot points=', knots)
print('coefficients=', coeffs)
## Now try scipy's object-oriented version. The result is exactly the same as "tck": the knots are the
## same and the coeffs are the same, they are just queried in a different way.
uspline = UnivariateSpline(x_nodes, y_nodes, s=0)
uspline_knots = uspline.get_knots()
uspline_coeffs = uspline.get_coeffs()
## Here are scipy's native spline evaluation methods. Again, "ytck" and "y_uspline" are exactly equal.
ytck = splev(x, tck)
y_uspline = uspline(x)
y_knots = uspline(knots)
## Now let's try our manually-calculated evaluation function.
y_eval = bspleval(x, knots, coeffs, k, debug=False)
plt.plot(x, ytck, label='tck')
plt.plot(x, y_uspline, label='uspline')
plt.plot(x, y_eval, label='manual')
## Next plot the knots and nodes.
plt.plot(x_nodes, y_nodes, 'ko', markersize=7, label='input nodes') ## nodes
plt.plot(knots, y_knots, 'mo', markersize=5, label='tck knots') ## knots
plt.xlim((-300.0,530.0))
plt.legend(loc='best', prop={'size':14})
plt.figure()
plt.title('difference')
plt.plot(x, ytck-y_uspline, label='tck-uspl')
plt.plot(x, ytck-y_eval, label='tck-manual')
plt.legend(loc='best', prop={'size':14})
plt.show()
The coefficients given by get_coeffs are B-spline (Basis spline) coefficients, described here: B-spline (Wikipedia)
Probably whatever other program/language you will be using has an implementation. Supply the knot locations and coefficients, and you should be all set.

Why is my 2D interpolant generating a matrix with swapped axes in SciPy?

I solve a differential equation with vector inputs
y' = f(t,y), y(t_0) = y_0
where y0 = y(x)
using the explicit Euler method, which says that
y_(i+1) = y_i + h*f(t_i, y_i)
where t is a time vector, h is the step size, and f is the right-hand side of the differential equation.
The python code for the method looks like this:
for i in np.arange(0,n-1):
y[i+1,...] = y[i,...] + dt*myode(t[i],y[i,...])
The result is a k,m matrix y, where k is the size of the t dimension, and m is the size of y.
The vectors y and t are returned.
t, x, and y are passed to scipy.interpolate.RectBivariateSpline(t, x, y, kx=1, ky=1):
g = scipy.interpolate.RectBivariateSpline(t, x, y, kx=1, ky=1)
The resulting object g takes new vectors ti,xi ( g(p,q) ) to give y_int, which is y interpolated at the points defined by ti and xi.
Here is my problem:
The documentation for RectBivariateSpline describes the __call__ method in terms of x and y:
__call__(x, y[, mth]) Evaluate spline at the grid points defined by the coordinate arrays
The matplotlib documentation for plot_surface uses similar notation:
Axes3D.plot_surface(X, Y, Z, *args, **kwargs)
with the important difference that X and Y are 2D arrays which are generated by numpy.meshgrid().
When I compute simple examples, the input order is the same in both and the result is exactly what I would expect. In my explicit Euler example, however, the initial order is ti,xi, yet the surface plot of the interpolant output only makes sense if I reverse the order of the inputs, like so:
ax2.plot_surface(xi, ti, u, cmap=cm.coolwarm)
While I am glad that it works, I'm not satisfied because I cannot explain why, nor why (apart from the array geometry) it is necessary to swap the inputs. Ideally, I would like to restructure the code so that the input order is consistent.
Here is a working code example to illustrate what I mean:
# Heat equation example with explicit Euler method
import numpy as np
import matplotlib.pyplot as mplot
import matplotlib.cm as cm
import scipy.sparse as sp
import scipy.interpolate as interp
from mpl_toolkits.mplot3d import Axes3D
import pdb
# explicit Euler method
def eev(myode,tspan,y0,dt):
# Preprocessing
# Time steps
tspan[1] = tspan[1] + dt
t = np.arange(tspan[0],tspan[1],dt,dtype=float)
n = t.size
m = y0.shape[0]
y = np.zeros((n,m),dtype=float)
y[0,:] = y0
# explicit Euler recurrence relation
for i in np.arange(0,n-1):
y[i+1,...] = y[i,...] + dt*myode(t[i],y[i,...])
return y,t
# generate matrix A
# u'(t) = A*u(t) + g*u(t)
def a_matrix(n):
aa = sp.diags([1, -2, 1],[-1,0,1],(n,n))
return aa
# System of ODEs with finite differences
def f(t,u):
dydt = np.divide(1,h**2)*A.dot(u)
return dydt
# homogenous Dirichlet boundary conditions
def rbd(t):
ul = np.zeros((t,1))
return ul
# Initial value problem -----------
def main():
# Metal rod
# spatial discretization
# number of inner nodes
m = 20
x0 = 0
xn = 1
x = np.linspace(x0,xn,m+2)
# Step size
global h
h = x[1]-x[0]
# Initial values
u0 = np.sin(np.pi*x)
# A matrix
global A
A = a_matrix(m)
# Time
t0 = 0
tend = 0.2
# Time step width
dt = 0.0001
tspan = [t0,tend]
# Test r for stability
r = np.divide(dt,h**2)
if r <= 0.5:
u,t = eev(f,tspan,u0[1:-1],dt)
else:
print('r = ',r)
print('r > 0.5. Explicit Euler method will not be stable.')
# Add boundary values back
rb = rbd(t.size)
u = np.hstack((rb,u,rb))
# Interpolate heat values
# Create interpolant. Note the parameter order
fi = interp.RectBivariateSpline(t, x, u, kx=1, ky=1)
# Create vectors for interpolant
xi = np.linspace(x[0],x[-1],100)
ti = np.linspace(t0,tend,100)
# Compute function values from interpolant
u_int = fi(ti,xi)
# Change xi, ti in to 2D arrays
xi,ti = np.meshgrid(xi,ti)
# Create figure and axes objects
fig3 = mplot.figure(1)
ax3 = fig3.gca(projection='3d')
print('xi.shape =',xi.shape,'ti.shape =',ti.shape,'u_int.shape =',u_int.shape)
# Plot surface. Note the parameter order, compare with interpolant!
ax3.plot_surface(xi, ti, u_int, cmap=cm.coolwarm)
ax3.set_xlabel('xi')
ax3.set_ylabel('ti')
main()
mplot.show()
As I can see you define :
# Change xi, ti in to 2D arrays
xi,ti = np.meshgrid(xi,ti)
Change this to :
ti,xi = np.meshgrid(ti,xi)
and
ax3.plot_surface(xi, ti, u_int, cmap=cm.coolwarm)
to
ax3.plot_surface(ti, xi, u_int, cmap=cm.coolwarm)
and it works fine (if I understood well ).

python scipy.stats.powerlaw negative exponent

I want to supply a negative exponent for the scipy.stats.powerlaw routine, e.g. a=-1.5, in order to draw random samples:
"""
powerlaw.pdf(x, a) = a * x**(a-1)
"""
from scipy.stats import powerlaw
R = powerlaw.rvs(a, size=100)
Why is a > 0 required, how can I supply a negative a in order to generate the random samples, and how can I supply a normalization coefficient/transform, i.e.
PDF(x,C,a) = C * x**a
The documentation is here
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.powerlaw.html
Thanks!
EDIT: I should add that I'm trying to replicate IDL's RANDOMP function:
http://idlastro.gsfc.nasa.gov/ftp/pro/math/randomp.pro
A PDF, integrated over its domain, must equal one. In other words, the area under a probability density function's curve must equal one.
In [36]: import scipy.integrate as integrate
In [40]: y, err = integrate.quad(lambda x: 0.5*x**(-0.5), 0, 1)
In [41]: y
Out[41]: 0.9999999999999998 # The integral is close to 1
The powerlaw density function has a domain from 0 <= x <= 1. On this domain, the integral of x**b is finite for any b > -1. When b is smaller, x**b blows up too rapidly near x = 0. So it is not a valid probability density function when b <= -1.
In [38]: integrate.quad(lambda x: x**(-1), 0, 1)
UserWarning: The maximum number of subdivisions (50) has been achieved...
# The integral blows up
Thus for x**(a-1), a must satisfy a-1 > -1 or equivalently, a > 0.
The first constant a in a * x**(a-1) is the normalizing constant which makes the integral of a * x**(a-1) over the domain [0,1] equal to 1. So you don't get to choose this constant independent of a.
Now if you change the domain to be a measurable distance away from 0, then yes, you could define a PDF of the form C * x**a for negative a. But you'd have to state what domain you want, and I don't think there is (yet) a PDF available in scipy.stats for this.
The Python package powerlaw can do this. Consider for a>1 a power law distribution with probability density function
f(x) = c * x^(-a)
for x > x_min and f(x) = 0 otherwise. Here c is a normalization factor and is determined as
c = (a-1) * x_min^(a-1).
In the example below it is a = 1.5 and x_min = 1.0 and comparing the probability density function estimated from the random sample with the PDF from the expression above gives the expected result.
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as pl
import numpy as np
import powerlaw
a, xmin = 1.5, 1.0
N = 10000
# generates random variates of power law distribution
vrs = powerlaw.Power_Law(xmin=xmin, parameters=[a]).generate_random(N)
# plotting the PDF estimated from variates
bin_min, bin_max = np.min(vrs), np.max(vrs)
bins = 10**(np.linspace(np.log10(bin_min), np.log10(bin_max), 100))
counts, edges = np.histogram(vrs, bins, density=True)
centers = (edges[1:] + edges[:-1])/2.
# plotting the expected PDF
xs = np.linspace(bin_min, bin_max, 100000)
pl.plot(xs, [(a-1)*xmin**(a-1)*x**(-a) for x in xs], color='red')
pl.plot(centers, counts, '.')
pl.xscale('log')
pl.yscale('log')
pl.savefig('powerlaw_variates.png')
returns
If r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the exponent of the distribution.
If you want to generate power-law distribution, you can use a random deviation. You just have to generate a random number between [0,1] and apply the inverse method (Wolfram). In this case, the probability density function is:
p(k) = k^(-gamma)
and y is the variable uniform between 0 and 1.
y ~ U(0,1)
import numpy as np
def power_law(k_min, k_max, y, gamma):
return ((k_max**(-gamma+1) - k_min**(-gamma+1))*y + k_min**(-gamma+1.0))**(1.0/(-gamma + 1.0))
Now to generate a distribution, you just have to create an array
nodes = 1000
scale_free_distribution = np.zeros(nodes, float)
k_min = 1.0
k_max = 100*k_min
gamma = 3.0
for n in range(nodes):
scale_free_distribution[n] = power_law(k_min, k_max,np.random.uniform(0,1), gamma)
This will work to generate a power-law distribution with gamma=3.0, if you want to fix the average of distribution, you have to study Complex Networks cause the k_min depends of k_max and the average connectivity.
My answer is almost the same as Virgil's above, with the crucial difference that that alpha is actually the negative exponent of powerlaw distribution
So, if r is a uniform random deviate U(0,1), then x in the following expression is a power-law distributed random deviate:
x = xmin * (1-r) ** (-1/(alpha-1))
where xmin is the smallest (positive) value above which the power-law distribution holds, and alpha is the negative exponent of the distribution, that is the P(x) = [constant] * x**-alpha

Categories