A generator for multivariate normal variates in Python - python

I want to generate samples from a multivariate normal distribution with given mean and covariance, which, of course, is possible with numpy.random.multivariate_normal. But I want to generate a (philosophically) infinite stream of such things, and so I want to define a multivariate normal generator mvn so that mvn.next() produces another random vector with given mean and covariance. Of course, I can just keep calling numpy.random.multivariate_normal(mean, cov, 1) but that is extremely inefficent (I will be computing the eigendecomposition of the covariance matrix on each call). Of course I can implement this from scratch myself, but it seems like something like this should already exist...

Just to establish some ground truth, this is how one might implement it:
from collections.abc import Generator
import numpy as np
class multinorm(Generator):
def __init__(self, themean, themat):
self.eigs, self.cmat = np.linalg.eigh(themat)
self.meanvec = themean
self.thedim = self.meanvec.shape[0]
self.themult = np.diag(np.sqrt(self.eigs))
def send(self, ignored_arg):
tmpvec = np.random.randn(self.thedim)
return ( self.cmat.T # self.themult) # tmpvec + self.meanvec
def throw(self, type=None, value=None, traceback=None):
raise StopIteration

Related

ODE with non-analytical time-dependent parameters in PyMC3

I'm working on solving the following ODE with PyMC3:
def production( y, t, p ):
return p[0]*getBeam( t ) - p[1]*y[0]
The getBeam( t ) is my time dependent coefficient. Those coefficients are given by an array of data which is accessed by the time index as follows:
def getBeam( t ):
nBeam = I[int(t/10)]*pow( 10, -6 )/q_e
return nBeam
I have successfully implemented it by using the scipy.integrate.odeint, but I have hard time to do it with pymc3.ode. In fact, by using the following:
ode_model = DifferentialEquation(func=production, times=x, n_states=1, n_theta=3, t0=0)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = ode_model(y0=[0], theta=[a, Yield, lambd])
I obviously get the error TypeError: __trunc__ returned non-Integral (type TensorVariable), as the t is a TensorVariable, thus can not be used to access the array in which the coefficients are stored.
Is there a way to overcome this difficulty? I thought about using the theano.function but I can not get it working since, unfortunately, the coefficients can not be expressed by any analytical function: they are just stored inside the array I which index represents the time variable t.
Thanks
Since you already have a working implementation with scipy.integrate.odeint, you could use theano.compile.ops.as_op, though it comes with some inconveniences (see how to fit a method belonging to an instance with pymc3? and How to write a custom Deterministic or Stochastic in pymc3 with theano.op?)
Using your exact definitions for production and getBeam, the following code seems to work for me:
from scipy.integrate import odeint
from theano.compile.ops import as_op
import theano.tensor as tt
import pymc3 as pm
def ScipySolveODE(a):
return odeint(production, y0=[0], t=x, args=([a, Yield, lambd],)).flatten()
#as_op(itypes=[tt.dscalar], otypes=[tt.dvector])
def TheanoSolveODE(a):
return ScipySolveODE(a)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = TheanoSolveODE(a)
Sorry I know this is more of a workaround than an actual solution...

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

Using scipy.optimize.curve_fit within a class

I have a class describing a mathematical function. The class needs to be able to least squares fit itself to passed in data. i.e. you can call a method like this:
classinstance.Fit(x,y)
and it adjusts its internal variables to best fit the data. I'm trying to use scipy.optimize.curve_fit for this, and it needs me to pass in a model function. The problem is that the model function is within the class and needs to access the variables and members of the class to compute the data. However, curve_fit can't call a function whose first parameter is self. Is there a way to make curve_fit use a method of the class as it's model function?
Here is a minimum executable snippet to show the issue:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# This is a class which encapsulates a gaussian and fits itself to data.
class GaussianComponent():
# This is a formula string showing the exact code used to produce the gaussian. I
# It has to be printed for the user, and it can be used to compute values.
Formula = 'self.Amp*np.exp(-((x-self.Center)**2/(self.FWHM**2*np.sqrt(2))))'
# These parameters describe the gaussian.
Center = 0
Amp = 1
FWHM = 1
# HERE IS THE CONUNDRUM: IF I LEAVE SELF IN THE DECLARATION, CURVE_FIT
# CANNOT CALL IT SINCE IT REQUIRES THE WRONG NUMBER OF PARAMETERS.
# IF I REMOVE IT, FITFUNC CAN'T ACCESS THE CLASS VARIABLES.
def FitFunc(self, x, y, Center, Amp, FWHM):
eval('y - ' + self.Formula.replace('self.', ''))
# This uses curve_fit to adjust the gaussian parameters to best match the
# data passed in.
def Fit(self, x, y):
#FitFunc = lambda x, y, Center, Amp, FWHM: eval('y - ' + self.Formula.replace('self.', ''))
FitParams, FitCov = curve_fit(self.FitFunc, x, y, (self.Center, self.Amp, self.FWHM))
self.Center = FitParams[0]
self.Amp = FitParams[1]
self.FWHM = FitParams[2]
# Give back a vector which describes what this gaussian looks like.
def GetPlot(self, x):
y = eval(self.Formula)
return y
# Make a gausssian with default shape and position (height 1 at the origin, FWHM 1.
g = GaussianComponent()
# Make a space in which we can plot the gaussian.
x = np.linspace(-5,5,100)
y = g.GetPlot(x)
# Make some "experimental data" which is just the default shape, noisy, and
# moved up the y axis a tad so the best fit will be different.
ynoise = y + np.random.normal(loc=0.1, scale=0.1, size=len(x))
# Draw it
plt.plot(x,y, x,ynoise)
plt.show()
# Do the fit (but this doesn't work...)
g.Fit(x,y)
And this produces the following graph and then crashes since the model function is incorrect when it tries to do the fit.
Thanks in advance!
I spent some time looking at your code and turned out 2 minutes late unfortunately. Anyhow, to make things a bit more interesting I've edited your class a bit, Here's what I concocted:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
class GaussianComponent():
def __init__(self, func, params=None):
self.formula = func
self.params = params
def eval(self, x):
allowed_locals = {key: self.params[key] for key in self.params}
allowed_locals["x"] = x
allowed_globals = {"np":np}
return eval(self.formula, allowed_globals, allowed_locals)
def Fit(self, x, y):
FitParams, FitCov = curve_fit(self.eval, x, y, self.params)
self.fitparams = fitParams
# Make a gausssian with default shape and position (height 1 at the origin, FWHM 1.
g = GaussianComponent("Amp*np.exp(-((x-Center)**2/(FWHM**2*np.sqrt(2))))",
params={"Amp":1, "Center":0, "FWHM":1})
**SNIPPED FOR BREVITY**
I believe you'll perhaps find this a more satisfying solution?
Currently all your gauss parameters are class attributes, that means if you try to create a second instance of your class with different values for parameters you will change the values for the first class too. By shoving all the parameters as instance attribute(s), you get rid of that. That is why we have classes in the first place.
Your issue with self stems from the fact you write self in your Formula. Now you don't have to any more. I think it makes a bit more sense like this, because when you instantiate an object of the class you can add as many or as little params to your declared function as you want. It doesn't even have to be gaussian now (unlike before).
Just throw all params to a dictionary, just like curve_fit does and forget about them.
By explicitly stating what eval can use, you help make sure that any evil-doers have a harder time breaking your code. It's still possible though, it always is with eval.
Good luck, ask if you need something clarified.
Ahh! It was actually a bug in my code. If I change this line:
def FitFunc(self, x, y, Center, Amp, FWHM):
to
def FitFunc(self, x, Center, Amp, FWHM):
Then we are fine. So curve_fit does correctly handle the self parameter but my model function shouldn't include y.
(Embarrased!)

SciPy: generating custom random variable from PMF

I'm trying to generate random variables according to a certain ugly distribution, in Python. I have an explicit expression for the PMF, but it involves some products which makes it unpleasant to obtain and invert the CDF (see below code for explicit form of PMF).
In essence, I'm trying to define a random variable in Python by its PMF and then have built-in code do the hard work of sampling from the distribution. I know how to do this if the support of the RV is finite, but here the support is countably infinite.
The code I am currently trying to run as per #askewchan's advice below is:
import scipy as sp
import numpy as np
class x_gen(sp.stats.rv_discrete):
def _pmf(self,k,param):
num = np.arange(1+param, k+param, 1)
denom = np.arange(3+2*param, k+3+2*param, 1)
p = (2+param)*(np.prod(num)/np.prod(denom))
return p
pa_limit = limitrv_gen()
print pa_limit.rvs(alpha,n=1)
However, this returns the error while running:
File "limiting_sim.py", line 42, in _pmf
num = np.arange(1+param, k+param, 1)
TypeError: only length-1 arrays can be converted to Python scalars
Basically, it seems that the np.arange() list isn't working somehow inside the def _pmf() function. I'm at a loss to see why. Can anyone enlighten me here and/or point out a fix?
EDIT 1: cleared up some questions by askewchan, edits reflected above.
EDIT 2: askewchan suggested an interesting approximation using the factorial function, but I'm looking more for an exact solution such as the one that I'm trying to get work with np.arange.
You should be able to subclass rv_discrete like so:
class mydist_gen(rv_discrete):
def _pmf(self, n, param):
return yourpmf(n, param)
Then you can create a distribution instance with:
mydist = mydist_gen()
And generate samples with:
mydist.rvs(param, size=1000)
Or you can then create a frozen distribution object with:
mydistp = mydist(param)
And finally generate samples with:
mydistp.rvs(1000)
With your example, this should work, since factorial automatically broadcasts. But, it might fail for large enough alpha:
import scipy as sp
import numpy as np
from scipy.misc import factorial
class limitrv_gen(sp.stats.rv_discrete):
def _pmf(self, k, alpha):
#num = np.prod(np.arange(1+alpha, k+alpha))
num = factorial(k+alpha-1) / factorial(alpha)
#denom = np.prod(np.arange(3+2*alpha, k+3+2*alpha))
denom = factorial(k + 2 + 2*alpha) / factorial(2 + 2*alpha)
return (2+alpha) * num / denom
pa_limit = limitrv_gen()
alpha = 100
pa_limit.rvs(alpha, size=10)

Creating new distributions in scipy

I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.

Categories