I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
Related
I was wondering if anybody knows of a reliable and fast analytic HDI calculation, preferably for beta functions.
The definition of the HDI is in this question called "Highest Posterior Density Region".
I am looking for a function that has the following I/O:
Input
credMass - credible interval mass (e.g, 0.95 for 95% credible interval)
a - shape parameter (e.g, number of HEADS coin tosses)
b - shape parameter (e.g, number of TAILS coin tosses)
output
ci_min - minimum bound of mass (value between 0. to ci_max)
ci_max - maximum bound of mass (value between ci_min to 1.)
One way to go about it is to speed up this script that I found
in the same said question (adapted from R to Python) and taken from the book Doing bayesian data analysis by John K. Kruschke). I have used this solution, which is quite reliable, but it's a bit too slow for multiple calls. A speedup by 100X or even 10X would be very helpful!
from scipy.optimize import fmin
from scipy.stats import *
def HDIofICDF(dist_name, credMass=0.95, **args):
# freeze distribution with given arguments
distri = dist_name(**args)
# initial guess for HDIlowTailPr
incredMass = 1.0 - credMass
def intervalWidth(lowTailPr):
return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)
# find lowTailPr that minimizes intervalWidth
HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
# return interval as array([low, high])
return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])
usage
print HDIofICDF(beta, credMass=0.95, a=5, b=4)
Word of caution! Some solutions confuse this HDI with the Equal Tail Interval solution (which the said question calls "Central Credible Region") which is far easier to compute, but does not answer the same question. (E.g, see Kruschke's Why HDI and not equal-tailed interval?)
Also this question does not concern the MCMC approaches that I have seen in PyMC3 (pymc3.stats.hpd(a), where a is a random variable sample), but rather regards an analytical solution.
Thank you!
Not analytical, but just in case you still find this useful. Instead of evaluating the ppf (that can be very slow for many distributions) you can find an approximate solution by evaluating the pdf, something like this.
pdf = dist.pdf(x_vals)
pdf = pdf / pdf.sum()
idx = np.argsort(pdf)[::-1]
mass_cum = 0
indices = []
for i in idx:
mass_cum += pdf[i]
indices.append(i)
if mass_cum >= mass:
break
return x_vals[np.sort(indices)[[0, -1]]]
Where mass is a number between 0 and 1. And x_vals is a vector of equally spaced values in the support of the distribution. So for the beta x_vals = np.linspace(0, 1, size) should do the work. You can control the tradeoff of accuracy and speed by setting a proper value of size for your problem. How much faster this solution is compared with the other one (using the ppf and optimization) will depend on the details of the SciPy's ppf and pdf method (that will be different for different distributions). And could even depend on the parameters of the distribution.
I have a variable A which is Bernoulli distributed, A = pymc.Bernoulli('A', p_A), but I don't have a hard value for p_A and want to sample for it. I do know that it should be small, so I want to use an exponential distribution p_A = pymc.Exponential('p_A', 10).
However, the exponential distribution can return values higher than 1, which would throw off A. Is there a way of bounding the output of p_A without having to re-implement either the Bernoulli or the Exponential distributions in my own #pymc.stochastic-decorated function?
You can use a deterministic function to truncate the Exponential distribution. Personally I believe it would be better if you use a distribution that is bound between 0 and 1, but to exactly solve your problem you can do as follows:
import pymc as pm
p_A = pm.Exponential('p_A',10)
#pm.deterministic
def p_B(p=p_A):
return min(1, p)
A = pm.Bernoulli('A', p_B)
model = dict(p_A=p_A, p_B=p_B, A=A)
S = pm.MCMC(model)
S.sample(1000)
p_B_trace = S.trace('p_B')[:]
PyMC provides bounds. The following should also work:
p_A = pymc.Bound(pymc.Exponential, upper=1)('p_A', lam=10)
For any other lost souls who come across this:
I think the best solution for my purposes (that is, I was only using the exponential distribution because the probabilities I was looking to generate were probably small, rather than out of mathematical convenience) was to use a Beta function instead.
For certain parameter values it approximates the shape of an exponential function (and can do the same for binomials and normals), but is bounded to [0 1]. Probably only useful for doing things numerically, though, as I imagine it's a pain to do any analysis with.
SciPy's stats module have objects of the type "random variable" (they call it rv_frozen). It makes it easy to plot, say, cdf's of random variables of a given distribution. Here's a very simple example:
import scipy.stats as stats
n = stats.norm()
x = linspace(-3, 3)
y = n.cdf(x)
plot(x, y)
I wondered whether there's a way of doing basic arithmetic manipulations on such random variables. The following example is a wishful thinking (it doesn't work).
du_list = [stats.randint(2, 5) for _ in xrange(100)]
du_avg = sum(du_list) / len(du_list)
x = linspace(0, 10)
y = du_avg.cdf(x)
plot(x, y)
This wishful-thinking example should produce the graph of the cumulative distribution function of the random variable which is the average of 100 i.i.d. random variables, each is distributed uniformly on the set {2,3,4}.
I realize this is a bit late, but I figured I'd answer in case anyone else needs this in the future. I needed the same functionality recently and even thought about extending scipy's rv_discrete to implement this, but then I found PaCAL.
PaCAL is a Python software package for doing arithmetic on random variables and it supports quite a few distributions, including continuous distributions. There is even some support for bivariate joint distributions. Available as a package on PyPI. Only for Python 2.x though.
EDIT: The PaCAL repo on Github now supports Python 3.x as well.
The method that matches your description exactly doesn't exist. The cdf of different distributions are all defined int the **/scipy/stats/distributions.py` source file. For example:
Boltzman distribution CDF (Line 7675):
def _cdf(self, x, lambda_, N):
k = floor(x)
return (1-exp(-lambda_*(k+1)))/(1-exp(-lambda_*N))
You can, estimate the MLE and then call the cdf method, see this sample:
import scipy.stats as ss
unknown=np.random.normal(loc=1.1, scale=2.0, size=100)
Loc, Scale=ss.norm.fit_loc_scale(unknown) #making a MLE fit
unknown_cdf=lambda x: ss.norm.cdf(x, loc=Loc, scale=Scale) #the cdf of the MLE to the data
plt.plot(np.linspace(-10, 10), unknown_cdf(np.linspace(-10, 10)), '-')
You could compute it by hand.
With X the random variable created as the sum of the Xi, i random variables uniformly distributed U(2,5). Sample your distribution created by X to obtain the pdf, and integrate to obtain the cdf.
Or you could try to find an analytical solution for this problem.
See the Irwin-Hall distribution
and the related discussion on Math-stackexchange.
Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)
I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]
It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.
Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.