What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.
AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()
I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.
Use numpy.random.zipf and just reject any samples greater than or equal to m
Related
I want a function randgen(f, N) in python to generate N sample from a given pdf.
It's what I wrote:
import numpy as np
import matplotlib.pyplot as plt
def randgen(f,N, M=1):
sample = M*np.random.random(N)
y=[]
sum = 0
for x in sample:
v = f(x);
sum+=v;
y.append(v)
y = y/sum;
return np.random.choice(sample, p=y, size=N)
def pp(x):
return x**2
z = randgen(pp, 2000)
plt.hist(z)
It generates the following histogram for the function y=x^2. It seems working.
I have seen similar questions but without a clear reference to the function definition for randgen(f,N) which can takes arbitrary functions. I would like to know if my approach is correct or I missed a point.
Okay, to unpack your solution:
generate N random numbers between 0 and 1
calculate a probability for each number depending on a given function
rescale your solution so that the integral of that function is 1
draw N numbers from your "generated" pdf
The way you did it definitely fulfills the criteria for a probability density function and your solution should be correct, but can improve it by using uniformly spaced numbers for the calcultation of your pdf.
numpy.linspace(start,stop,N) produces N evenly spaced numbers between start and stop. (https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html)
Your solution is fine for discrete pdfs if you apply my suggested addition instead of your
sample = M*np.random.random(N)
sample = np.linspace(start, stop, N)
edit: The pdf also has a requirement, that the probabilities have to be positive, so there should be some mechanisms included to avoid negative
function values for x in range [0,1].
I have a number X of integers (very large) and a probability p with which I want to draw a sample s (a number) from X following a Poisson distribution. For example, if X = 10^8 and p=0.05, I expect s to be the number of heads we get.
I was able to easily do this with random.binomial as:
s=np.random.binomial(n=X, p=p)
How can I apply the same idea using random.poisson?
Just multiply p and X:
np.random.poisson(10**8 * 0.05)
The probability to get more than 10**8 is numerically zero.
Professor #pjs emphasizes that we are combining probability and number into a rate which is the parameter of the Poisson process.
Further worth mentioning that for such a large number you'll find the pmf's of Binomial and Poisson very similar to each other and also (using probability function or "cdf" as engineers call it) to a Gaussian.
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.poisson.html
import numpy as np
s = np.random.poisson(size=n, lam=p)
My problem:
I have an array of ufloats (e.g. an unarray) in pythons uncertainties package.
All values of the array got their own errors, and I need a funktion, that gives me the average of the array in respect to both, the error
I get when calculating the mean of the nominal values and the influence the values errors have.
I have an uarray:
2 +/- 1
3 +/- 2
4 +/- 3
and need a funktion, that gives me an average value of the array.
Thanks
Assuming Gaussian statistics, the uncertainties stem from Gaussian parent distributions. In such a case, it is standard to weight the measurements (nominal values) by the inverse variance. This application to the general weighted average gives,
$$ \frac{\sum_i w_i x_i}{\sum_i w_i} = \frac{\sum_i x_i/\sigma_i^2}{\sum_i 1/\sigma_i^2} $$.
One need only perform good 'ol error propagation on this to get an uncertainty of the weighted average as,
$$ \sqrt{\sum_i \frac{1}{1/\sum_i \sigma_i^2}} $$
I don't have an n-length formula to do this syntactically speaking on hand, but here's how one could get the weighted average and its uncertainty in a simple case:
a = un.ufloat(5, 2)
b = un.ufloat(8, 4)
wavg = un.ufloat((a.n/a.s**2 + b.n/b.s**2)/(1/a.s**2 + 1/b.s**2),
np.sqrt(2/(1/a.s**2 + 1/b.s**2)))
print(wavg)
>>> 5.6+/-2.5298221281347035
As one would expect, the result tends more-so towards the value with the smaller uncertainty. This is good since a smaller uncertainty in a measurement implies that its associated nominal value is closer to the true value in the parent distribution than those with larger uncertainties.
Unless I'm missing something, you could calculate the sum divided by the length of the array:
from uncertainties import unumpy, ufloat
import numpy as np
arr = np.array([ufloat(2, 1), ufloat(3, 2), ufloat(4,3)])
print(sum(arr)/len(arr))
# 3.0+/-1.2
You can also define it like this:
arr1 = unumpy.uarray([2, 3, 4], [1, 2, 3])
print(sum(arr1)/len(arr1))
# 3.0+/-1.2
uncertainties takes care of the rest.
I used Captain Morgan's answer to serve up some sweet Python code for a project and discovered that it needed a little extra ingredient:
import uncertainties as un
from un.unumpy import unp
epsilon = unp.nominal_values(values).mean()/(1e12)
wavg = ufloat(sum([v.n/(v.s**2+epsilon) for v in values])/sum([1/(v.s**2+epsilon) for v in values]),
np.sqrt(len(values)/sum([1/(v.s**2+epsilon) for v in values])))
if wavg.s <= np.sqrt(epsilon):
wavg = ufloat(wavg.n, 0.0)
Without that little something (epsilon) we'd get div/0 errors from observations recorded with zero uncertainty.
If you already have a .csv file which stores variables in 'mean+/-sted' format, you could try the code below; it works for me.
from uncertainties import ufloat_fromstr
df=pd.read_csv('Z:\compare\SL2P_PAR.csv')
for i in range(len(df.uncertainty)):
df['mean'] = ufloat_fromstr(df['uncertainty'][I]).n
df['sted'] = ufloat_fromstr(df['uncertainty'][I]).s
I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.
I want to use the gaussian function in python to generate some numbers between a specific range giving the mean and variance
so lets say I have a range between 0 and 10
and I want my mean to be 3 and variance to be 4
mean = 3, variance = 4
how can I do that ?
Use random.gauss. From the docs:
random.gauss(mu, sigma)
Gaussian distribution. mu is the mean, and sigma is the standard deviation. This is slightly
faster than the normalvariate() function defined below.
It seems to me that you can clamp the results of this, but that wouldn't make it a Gaussian distribution. I don't think you can satisfy all the constraints simultaneously. If you want to clamp it to the range [0, 10], you could get your numbers:
num = min(10, max(0, random.gauss(3, 4)))
But then the resulting distribution of numbers won't be truly Gaussian. In this case, it seems you can't have your cake and eat it, too.
There's probably a better way to do this, but this is the function I ended up creating to solve this problem:
import random
def trunc_gauss(mu, sigma, bottom, top):
a = random.gauss(mu,sigma))
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
return a
If we break it down line by line:
import random
This allows us to use functions from the random library, which includes a gaussian random number generator (random.gauss).
def trunc_gauss(mu, sigma, bottom, top):
The function arguments allow us to specify the mean (mu) and variance (sigma), as well as the top and bottom of our desired range.
a = random.gauss(mu,sigma))
Inside the function, we generate an initial random number according to a gaussian distribution.
while (bottom <= a <= top) == False:
a = random.gauss(mu,sigma))
Next, the while loop checks if the number is within our specified range, and generates a new random number as long as the current number is outside our range.
return a
As soon as the number is inside our range, the while loop stops running and the function returns the number.
This should give a better approximation of a gaussian distribution, since we don't artificially inflate the top and bottom boundaries of our range by rounding up or down the outliers.
I'm quite new to Python, so there are most probably simpler ways, but this worked for me.
I was working on some numerical analytical computation and I ran into this python tutorial site - http://www.python-course.eu/weighted_choice_and_sample.php
Now, this is what I proffer as a solution should anyone be too busy as to not hit the site.
I don't know how many gaussian values you need so I'll go with 100 as n, mu you gave as 3 and variance as 4 which makes sigma = 2. Here's the code:
from random import gauss
n = 100
values = []
frequencies = {}
while len(values) < n:
value = gauss(3, 2)
if 0 < value < 10:
frequencies[int(value)] = frequencies.get(int(value), 0) + 1
values.append(value)
print(values)
I hope this helps. You can get the plot as well. It's all in the tutorials.
If you have a small range of integers, you can create a list with a gaussian distribution of the numbers within that range and then make a random choice from it.
import numpy as np
from random import uniform
from scipy.special import erf,erfinv
import math
def trunc_gauss(mu, sigma,xmin=np.nan,xmax=np.nan):
"""Truncated Gaussian distribution.
mu is the mean, and sigma is the standard deviation.
"""
if np.isnan(xmin):
zmin=0
else:
zmin = erf((xmin-mu)/sigma)
if np.isnan(xmax):
zmax=1
else:
zmax = erf((xmax-mu)/sigma)
y = uniform(zmin,zmax)
z = erfinv(y)
# This will not come up often but if y >= 0.9999999999999999
# due to the truncation of the ervinv function max z = 5.805018683193454
while math.isinf(z):
z = erfinv(uniform(zmin,zmax))
return mu + z*sigma
You can use minimalistic code for 150 variables:
import numpy as np
s = np.random.normal(3,4,150) #<= mean = 3, variance = 4
print(s)
Normal distribution is another like random, stochastic distribution.
So, we can check it by:
import seaborn as sns
import matplotlib.pyplot as plt
AA1_plot = sns.distplot(s, kde=True, rug=False)
plt.show()