I have a variable A which is Bernoulli distributed, A = pymc.Bernoulli('A', p_A), but I don't have a hard value for p_A and want to sample for it. I do know that it should be small, so I want to use an exponential distribution p_A = pymc.Exponential('p_A', 10).
However, the exponential distribution can return values higher than 1, which would throw off A. Is there a way of bounding the output of p_A without having to re-implement either the Bernoulli or the Exponential distributions in my own #pymc.stochastic-decorated function?
You can use a deterministic function to truncate the Exponential distribution. Personally I believe it would be better if you use a distribution that is bound between 0 and 1, but to exactly solve your problem you can do as follows:
import pymc as pm
p_A = pm.Exponential('p_A',10)
#pm.deterministic
def p_B(p=p_A):
return min(1, p)
A = pm.Bernoulli('A', p_B)
model = dict(p_A=p_A, p_B=p_B, A=A)
S = pm.MCMC(model)
S.sample(1000)
p_B_trace = S.trace('p_B')[:]
PyMC provides bounds. The following should also work:
p_A = pymc.Bound(pymc.Exponential, upper=1)('p_A', lam=10)
For any other lost souls who come across this:
I think the best solution for my purposes (that is, I was only using the exponential distribution because the probabilities I was looking to generate were probably small, rather than out of mathematical convenience) was to use a Beta function instead.
For certain parameter values it approximates the shape of an exponential function (and can do the same for binomials and normals), but is bounded to [0 1]. Probably only useful for doing things numerically, though, as I imagine it's a pain to do any analysis with.
Related
I have some data {x_i,y_i} and I want to fit a model function y=f(x,a,b,c) to find the best fitting values of the parameters (a,b,c); however, the three of them are not totally independent but constraints to 1<b , 0<=c<1 and g(a,b,c)>0, where g is a "good" function. How could I implement this in Python since with curve_fit one cannot put the parametric constraints directly?
I have been reading with lmfit but I only see numerical constraints like 1<b, 0<=c<1 and not the one with g(a,b,c)>0, which is the most important.
If I understand correctly, you have
def g(a,b,c):
c1 = (1.0 - c)
cx = 1/c1
c2 = 2*c1
g = a*a*b*gamma(2+cx)*gamma(cx)/gamma(1+3/c2)-b*b/(1+b**c2)**(1/c2)
return g
If so, and if get the math right, this could be represented as
a = sqrt((g+b*b/(1+b**c2)**(1/c2))*gamma(1+3/c2)/(b*gamma(2+cx)*gamma(cx)))
Which is to say that you could think about your problem as having a variable g which is > 0 and a value for a derived from b, c, and g by the above expression.
And that you can do with lmfit and its expression-based constraint mechanism. You would have to add the gamma function, as with
from lmfit import Parameters
from scipy.special import gamma
params = Parameters()
params._asteval.symtable['gamma'] = gamma
and then set up the parameters with bounds and constraints. I would probably follow the math above to allow better debugging and use something like:
params.add('b', 1.5, min=1)
params.add('c', 0.4, min=0, max=1)
params.add('g', 0.2, min=0)
params.add('c1', expr='1-c')
params.add('cx', expr='1.0/c1')
params.add('c2', expr='2*c1')
params.add('gprod', expr='b*gamma(2+cx)*gamma(cx)/gamma(1+3/c2)')
params.add('bfact', expr='(1+b**c2)**(1/c2)')
params.add('a', expr='sqrt(g+b*b/(bfact*gprod))')
Note that this gives 3 actual variables (now g, b, and c) with plenty of derived values calculated from these, including a. I would certainly check all that math. It looks like you're safe from negative**fractional_power, sqrt(negitive), and gamma(-1), but be aware of these possibilities that will kill the fit.
You could embed all of that into your fitting function, but using constraint expressions gives you the ability to constrain parameter values independently of how the fitting or model function is defined.
Hope that helps. Again, if this does not get to what you are trying to do, post more details about the constraint you are trying to impose.
Like James Phillips, I was going to suggest SciPy's curve_fit. But the way that you have defined your function, one of the constraints is on the function itself, and SciPy's bounds are defined only in terms of input variables.
What, exactly, are the forms of your functions? Can you transform them so that you can use a standard definition of bounds, and then reverse the transformation to give a function in the original form that you wanted?
I have encountered a related problem when trying to fit exponential regressions using SciPy's curve_fit. The parameter search algorithms vary in a linear fashion, and it's really easy to fail to establish a gradient. If I write a function which fits the logarithm of the function I want, it's much easier to make curve_fit work. Then, for my final work, I take the exponent of my fitted function.
This same strategy could work for you. Predict ln(y). The value of that function can be unbounded. Then for your final result, output exp(ln(y)) = y.
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
I have a function to optimize, which I can't get the derivative or Hessian or Jacobian out of (hence the "black box" in the title). Say my function looks like this:
def my_fun(some_int, some_other_int, some_string):
return float(some_int + some_other_int + len(some_string))
note that I only perform the cast to show that the function returns a floating point number.
The search space / constraints / bounds (or however you call it) would be:
some_int = [1..10] # int interval
some_other_int = [1, 2, 3] # int discrete
some_string = ["methodA", "methodB", "methodC"] #discrete
How should I formulate the problem in python? This is what I've searched so far:
Scipy optimizers don't seem to accept constraints for the Nelder-Mead Simplex (or the Powell method) that is applicable in my case
PyOpt ... well I can't get around using multivariate objective function
There's also Pyswarm. Does anyone know how to do this in Pyswarm?
Any thoughts?
You could use an hyperparameter optimization package such as https://github.com/Dreem-Organization/benderopt/
It supports the different type of your parameters.
Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.
An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.
Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy
I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)
I am trying to fit a polynomial to my data, e.g.
import scipy as sp
x = [1,6,9,17,23,28]
y = [6.1, 7.52324, 5.71, 5.86105, 6.3, 5.2]
and say I know the degree of polynomial (e.g.: 3), then I just use scipy.polyfit method to get the polynomial of a given degree:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fittedModelFunction = sp.polyfit(x, y, 3)
func = sp.poly1d(fittedModelFunction)
++++++++++++++++++++++++++++++
QUESTIONS: ++++++++++++++++++++++++++++++
1) How can I tell in addition that the resulting function func must be always positive (i.e. f(x) >= 0 for any x)?
2) How can I further define a constraint (e.g. number of (local) min and max points, etc.) in order to get a better fitting?
Is there smth like this:
http://mail.scipy.org/pipermail/scipy-user/2007-July/013138.html
but more accurate?
Always Positve
I haven't been able to find a scipy reference that determines if a function is positive-definite, but an indirect way would be to find the all the roots - Scipy Roots - of the function and inspect the limits near those roots. There are a few cases to consider:
No roots at all
Pick any x and evaluate the function. Since the function does not cross the x-axis because of a lack of roots, any positive result will indicate the function is positive!
Finite number of roots
This is probably the most likely case. You would have to inspect the limits before and after each root - Scipy Limits. You would have to specify your own minimum acceptable delta for the limit however. I haven't seen a 2-sided limit method provided by Scipy, but it looks simple enough to make your own.
from sympy import limit
// f: function, v: variable to limit, p: point, d: delta
// returns two limit values
def twoSidedLimit(f, v, p, d):
return limit(f, v, p-d), limit(f, v, p+d)
Infinite roots
I don't think that polyfit would generate an oscillating function, but this is something to consider. I don't know how to handle this with the method I have already offered... Um, hope it does not happen?
Constraints
The only built-in form of constraints seems to be limited to the optimize library of SciPy. A crude way to enforce constraints for polyfit would be to get the function from polyfit, generate a vector of values for various x, and try to select values from the vector that violate the constraint. If you try to use filter, map, or lambda it may be slow with large vectors since python's filter makes a copy of the list/vector being filtered. I can't really help in this regard.