I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.
Related
I'm trying to solve two unknown parameters based on my function expression using the scipy.optimize.curve_fit function. The equation I used is as follows:
enter image description here
My code is as follows:
p_freqs =np.array(0.,8.19672131,16.39344262,24.59016393,32.78688525,
40.98360656,49.18032787,57.37704918,65.57377049,73.7704918,
81.96721311,90.16393443,98.36065574,106.55737705,114.75409836,
122.95081967,131.14754098,139.3442623, 147.54098361,155.73770492,
163.93442623,172.13114754,180.32786885,188.52459016,196.72131148,
204.91803279,213.1147541, 221.31147541,229.50819672,237.70491803,
245.90163934)
p_fft_amp1 = np.array(3.34278536e-08,5.73549829e-08,1.94897033e-08,1.59088184e-08,
9.23948302e-09,3.71198908e-09,1.85535722e-09,1.86064653e-09,
1.52149363e-09,1.33626573e-09,1.19468040e-09,1.08304535e-09,
9.96594475e-10,9.25671797e-10,8.66775330e-10,8.17287132e-10,
7.75342888e-10,7.39541296e-10,7.08843676e-10,6.82440637e-10,
6.59712650e-10,6.40169517e-10,6.23422124e-10,6.09159901e-10,
5.97134297e-10,5.87146816e-10,5.79040074e-10,5.72691200e-10,
5.68006964e-10,5.64920239e-10,5.63387557e-10)
def cal_omiga_tstar(omiga,tstar,f):
return omiga*np.exp(-np.pi*f*tstar)/(1+(f/18.15)**2)
omiga,tstar = optimize.curve_fit(cal_omiga_tstar,p_freqs,p_fft_amp1)[0]
When I run the code I get the following prompt:
OptimizeWarning: Covariance of the parameters could not be estimated warnings.warn('Covariance of the parameters could not be estimated'
I couldn't exactly pinpoint the cause of your error message because your code had some errors prior to that. First, the construction of the two arrays has invalid syntax, then your definition of cal_omiga_tstar has the wrong argument order. While fixing these problems I did get your error message once, but I haven't been able to reproduce it, weirdly enough. However, I did manage to fit your function. You should supply initial guesses to the parameters, especially since your y has so many low values. There's no magic here, just plot your model and data until it's relatively close. Then, let the algorithm take the wheel.
Here's my code:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
# Changed here, was "np.array(0.,..."
p_freqs =np.array([0.,8.19672131,16.39344262,24.59016393,32.78688525,
40.98360656,49.18032787,57.37704918,65.57377049,73.7704918,
81.96721311,90.16393443,98.36065574,106.55737705,114.75409836,
122.95081967,131.14754098,139.3442623, 147.54098361,155.73770492,
163.93442623,172.13114754,180.32786885,188.52459016,196.72131148,
204.91803279,213.1147541, 221.31147541,229.50819672,237.70491803,
245.90163934])
p_fft_amp1 = np.array([3.34278536e-08,5.73549829e-08,1.94897033e-08,1.59088184e-08,
9.23948302e-09,3.71198908e-09,1.85535722e-09,1.86064653e-09,
1.52149363e-09,1.33626573e-09,1.19468040e-09,1.08304535e-09,
9.96594475e-10,9.25671797e-10,8.66775330e-10,8.17287132e-10,
7.75342888e-10,7.39541296e-10,7.08843676e-10,6.82440637e-10,
6.59712650e-10,6.40169517e-10,6.23422124e-10,6.09159901e-10,
5.97134297e-10,5.87146816e-10,5.79040074e-10,5.72691200e-10,
5.68006964e-10,5.64920239e-10,5.63387557e-10])
# Changed sequence from "omiga, tstar, f" to "f, omiga, tstar".
def cal_omiga_tstar(f, omiga,tstar):
return omiga*np.exp(-np.pi*f*tstar)/(1+(f/18.15)**2)
# Changed this call to get popt, pcov, and supplied the initial guesses
popt, pcov = curve_fit(cal_omiga_tstar,p_freqs,p_fft_amp1, p0=(1E-5, 1E-2))
Here's popt: array([ 4.51365934e-08, -1.48124194e-06]) and pcov: array([[1.35757744e-17, 3.54656128e-12],[3.54656128e-12, 2.90508007e-06]]). As you can see, the covariance matrix could be estimated in this case.
Here's the model x data curve:
I'm working on solving the following ODE with PyMC3:
def production( y, t, p ):
return p[0]*getBeam( t ) - p[1]*y[0]
The getBeam( t ) is my time dependent coefficient. Those coefficients are given by an array of data which is accessed by the time index as follows:
def getBeam( t ):
nBeam = I[int(t/10)]*pow( 10, -6 )/q_e
return nBeam
I have successfully implemented it by using the scipy.integrate.odeint, but I have hard time to do it with pymc3.ode. In fact, by using the following:
ode_model = DifferentialEquation(func=production, times=x, n_states=1, n_theta=3, t0=0)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = ode_model(y0=[0], theta=[a, Yield, lambd])
I obviously get the error TypeError: __trunc__ returned non-Integral (type TensorVariable), as the t is a TensorVariable, thus can not be used to access the array in which the coefficients are stored.
Is there a way to overcome this difficulty? I thought about using the theano.function but I can not get it working since, unfortunately, the coefficients can not be expressed by any analytical function: they are just stored inside the array I which index represents the time variable t.
Thanks
Since you already have a working implementation with scipy.integrate.odeint, you could use theano.compile.ops.as_op, though it comes with some inconveniences (see how to fit a method belonging to an instance with pymc3? and How to write a custom Deterministic or Stochastic in pymc3 with theano.op?)
Using your exact definitions for production and getBeam, the following code seems to work for me:
from scipy.integrate import odeint
from theano.compile.ops import as_op
import theano.tensor as tt
import pymc3 as pm
def ScipySolveODE(a):
return odeint(production, y0=[0], t=x, args=([a, Yield, lambd],)).flatten()
#as_op(itypes=[tt.dscalar], otypes=[tt.dvector])
def TheanoSolveODE(a):
return ScipySolveODE(a)
with pm.Model() as model:
a = pm.Uniform( "S-Factor", lower=0.01, upper=100 )
ode_solution = TheanoSolveODE(a)
Sorry I know this is more of a workaround than an actual solution...
I want to generate samples from a multivariate normal distribution with given mean and covariance, which, of course, is possible with numpy.random.multivariate_normal. But I want to generate a (philosophically) infinite stream of such things, and so I want to define a multivariate normal generator mvn so that mvn.next() produces another random vector with given mean and covariance. Of course, I can just keep calling numpy.random.multivariate_normal(mean, cov, 1) but that is extremely inefficent (I will be computing the eigendecomposition of the covariance matrix on each call). Of course I can implement this from scratch myself, but it seems like something like this should already exist...
Just to establish some ground truth, this is how one might implement it:
from collections.abc import Generator
import numpy as np
class multinorm(Generator):
def __init__(self, themean, themat):
self.eigs, self.cmat = np.linalg.eigh(themat)
self.meanvec = themean
self.thedim = self.meanvec.shape[0]
self.themult = np.diag(np.sqrt(self.eigs))
def send(self, ignored_arg):
tmpvec = np.random.randn(self.thedim)
return ( self.cmat.T # self.themult) # tmpvec + self.meanvec
def throw(self, type=None, value=None, traceback=None):
raise StopIteration
I have a problem using the orthogonal distance regression with a function using the error function math.erf(). To be more clear it seems I have a problem with the variables which should be fit.
import numpy as np
from scipy import odr
import math
def CDF(x,a,b,c):
#definition of errorfunction to fit data
return c/2*(1+math.erf((x-a)/(b*math.sqrt(2))))
#to make things a little shorter I excluded the parts in which the data is
read and prepared
guess=Ergebnis_popt[1] #using found values for a,b,c from curve_fit without
errors
guess_a=guess[0]
guess_b=guess[1]
guess_c=guess[2]
xerr=Verschiebung_fit[2] #determination of x_erros, the values for the y
errors are included in the prepared data
xerr=xerr/xerr
xerr=xerr/200
#building the model for the odr fit, as the erf function seems to only
handle single values I use np.vectorize
cfd = odr.Model(np.vectorize(CDF,excluded=['a','b','c']),extra_args=
['a','b','c'])
#Definition of the Data with ererrors for the fit
mydata =odr.RealData(Verschiebung_fit[2],y=y_values_fit[2],sx=xerr,sy=y_values_fit[4])
#basis for the odr fit
odr=odr.ODR(mydata,cfd, beta0 = [guess_a,guess_b,guess_c] )
myoutput = odr.run()
myoutput.pprint()
All this results in a Error:
File "C:\Users\Public\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 2785, in _get_ufunc_and_otypes
outputs = func(*inputs)
File "C:\Users\Public\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 2750, in func
return self.pyfunc(*the_args, **kwargs)
TypeError: CDF() takes 4 positional arguments but 5 were given
I am a novice using python. I guess there is a problem with communicating the variables which should be fitted, but I can not figure out where this problem occurs.
I'm trying to code my own logistic regression, and compare different methods of maximizing the log-likelihood. Using the Newton-CG method, I get the error message "ValueError: setting an array element with a sequence". Reading around, it seems this error rises if the function sought to be minimized returns a non-skalar, but that is not the case here. I need the three methods given below to give the same result (approximately), but when running on my real data, one does not converge, and the other one gives a worse LL than the initial guess, and the third does not run at all.
Why do I get the ValueError message and how can I fix it?
My code (with dummy data, the real data is ~100 measurements) is as follows:
import numpy as np
from numpy import linalg
import scipy
from scipy.optimize import minimize
def CalcLL(beta,xinlist,yinlist):
LL=0.0
ncol=len(beta)
pi=FindPi(xinlist,beta.reshape(ncol,1))
for i in range(len(yinlist)):
LL=LL+np.where(yinlist[i]==1,np.log(pi[i]),np.log(1-pi[i]))
return -LL
def Jacobian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
Jac=np.transpose(np.matrix(yinlist-pi))*np.matrix(xinlist)
return Jac
def Hessian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
W=FindW(pi)
Hes=np.matrix(np.transpose(xinlist))*(np.matrix(W)*np.matrix(xinlist))
return Hes
def FindPi(xinlist,beta):
rows=np.shape(xinlist)[0]# Number of rows in x_new
cols=np.shape(xinlist)[1]# Number of columns in x_new
expon=np.dot(xinlist,beta)
expon=np.array(expon).reshape(rows,1)
pi=np.exp(expon)/(1+np.exp(expon))
return pi
def FindW(pi):
W=np.zeros(len(pi)*len(pi)).reshape(len(pi),len(pi))
for i in range(len(pi)):
W[i,i]=float(pi[i]*(1-pi[i]))
return W
xinlist=np.matrix([[1,1],[0,1],[1,1],[1,1],[1,1],[0,1],[0,1],[1,1],[1,1],[0,1]])
yinlist=np.transpose(np.matrix([0,0,0,0,0,1,1,1,1,1]))
ncol=np.shape(xinlist)[1]
beta1=np.zeros(ncol).reshape(ncol,1) # Initial guess for parameter values
limit=0.000001 # selfwritten Newton-Raphson method
iter_i=limit+1
while iter_i>limit:
Hes=Hessian(beta1,xinlist,yinlist)
Jac=np.transpose(Jacobian(beta1,xinlist,yinlist))
root_diff=np.array(linalg.inv(Hes)*Jac)
beta1=beta1+root_diff
iter_i=np.sum(root_diff*root_diff)
print "When running self-written algorithm, the log-likelihood is",-CalcLL(beta1,xinlist,yinlist)
beta2=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta2,args=(xinlist,yinlist),method='Nelder-Mead',options={'xtol':1e-8,'disp':True,'maxiter':10000})
beta2=res.x
print "The log-likelihood using Nelder-Mead is", -CalcLL(beta2,xinlist,yinlist)
beta3=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
beta3=res.x
print "The log-likelihood using Newton-CG is", -CalcLL(beta3,xinlist,yinlist)
EDIT:
The errorstack is as follows:
Traceback (most recent call last):
File "MyLogisticRegression2.py", line 62, in
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
File C:\Python27\lib\site-packages\scipy\optimize_minimize.py, line 447, in minimize **options)
File C:\Python27\lib\site-packages\scipy\optimize\optimize.py, line 2393, in _minimize_newtoncg eta=numpy.min([0.5, numpy.sqrt(maggrad)])
File C:\Python27\lib\site-packages\numpy\core\fromnumeric.py, line 2393, in amin out=out, **kwargs)
File C:\Python27\lib\site-packages\numpy\core_methods.py, line 29, in _amin return umr_minimum(a,axis,None,out,keepdims)
ValueError: setting an array element with a sequence
I found out the problem rose from beta arrays having shape (2,1) instead of (2,), and likewise for the Jacobian. Reshaping these two solved the problem.
The Newton-CG solver needs only 1d arrays for the Jacobian apparently.