I'm trying to fit a two global parameters of a galactic model using Scipy curve_fit in python. I have an array of independent variables and an array of dependent variables. The first 1/4 of the data set needs to be fit to a function depending on the two global parameters and two local parameters, the next quarter to another function depending on the two global parameters and two local variables, etc.
Is there anyway that I can write a function that will call the appropriate function with the right index and the global parameters through the entire array.
What I have so far is:
def galaxy_func_inner(time,a,b,c,d):
telescope_inner = lt.station(rot_angle=c,pol_angle=d)
power = telescope_inner.calculate_gpowervslstarray(time)[0]
return a*np.array(power)+b
def galaxy_func_outer(time,a,b,c,d):
telescope_outer = lt.station(rot_angle=c,pol_angle=d)
power = telescope_outer.calculate_gpowervslstarray(time)[0]
return a*np.array(power)+b
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
for t_index in range(len(time)):
if t_index in range(0,50):
return galaxy_func_outer(t_index,a,b,R,P)
elif t_index in range(50,100):
return galaxy_func_outer(t_index,c,d,R,P)
elif t_index in range(100,150):
return galaxy_func_inner(t_index,e,f,R,P)
elif t_index in range(150,200):
return galaxy_func_inner(t_index,g,h,R,P)
The problem is that this only fits the first time but the whole time array, and the single point is only fitted to the corresponding model point and not the whole array. Any help as to how to reformulate this? I've tried to reformulate it as:
def galaxy_func_global(xdata,R,P,a,b,c,d,e,f,g,h):
return galaxy_func_outer(xdata[0:50],a,b,R,P),galaxy_func_outer(xdata[50:100],c,d,R,P),galaxy_func_inner(xdata[100:150],e,f,R,P),galaxy_func_inner(xdata[150:200],g,h,R,P)
but I get the error:
File "galaxy_calibration.py", line 117, in <module>
popt,pcov = curve_fit(galaxy_func_global,xdata,ydata)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 555, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 369, in leastsq
shape, dtype = _check_func('leastsq', 'func', func, x0, args, n)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 20, in _check_func
res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 445, in _general_function
return function(xdata, *params) - ydata
ValueError: operands could not be broadcast together with shapes (4,) (191,)
Any help would be much appreciated.
If you want to cut your input data into 4 batches (based on the index of the time points) and process the data depending on the batches, then return the results in a single array, then you can do this:
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
return np.concatenate([galaxy_func_outer(time[0:50],a,b,R,P),
galaxy_func_outer(time[50:100],c,d,R,P),
galaxy_func_inner(time[100:150],e,f,R,P),
galaxy_func_inner(time[150:200],g,h,R,P)])
This will slice into your time array to pick out each slice of interest, then call the appropriate function for each piece. It seems to me that these functions return simple np.arrays, which can be concatenated to get a single array as result.
(I just realized that I could've just said "what you tried was almost perfect, but you need to concatenate the resulting arrays into a single array":)
Note that there are at least two ways in which you can have dimensioning problems.
Firstly, you should make sure that the return value of both of your functions (galaxy...inner/outer()) is a 1d numpy array. Otherwise you'll run into problems with your global return value.
Secondly, every fitting method expects a function the return value of which has the same size (shape) as the input variable, for obvious reasons. So you can also run into problems with your current code if time is not exactly 200 elements long, since your output will be truncated to 200 elements even if time is longer. At least you should put
galaxy_func_inner(time[150:],g,h,R,P)
into your last function call to catch all the remaining points of time, but if you want to do it properly, call
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
inds=np.floor(np.linspace(0,len(time)-1,5))
return np.concatenate([galaxy_func_outer(time[0:inds[1]],a,b,R,P),
galaxy_func_outer(time[inds[1]:inds[2]],c,d,R,P),
galaxy_func_inner(time[inds[2]:inds[3]],e,f,R,P),
galaxy_func_inner(time[inds[3]:],g,h,R,P)])
Also note that your original error is formally of this kind:
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 445, in _general_function
return function(xdata, *params) - ydata
ValueError: operands could not be broadcast together with shapes (4,) (191,)
This tells you that python couldn't subtract ydata from function(xdata,*params) (i.e. your fitting model) because one is of length 4 while the other is of length 191. This is because if your function calls return a,b,c,d, then it will return a tuple (a,b,c,d), so the return value will have a length of 4. It's more interesting that your ydata has length 191, this might mean that you'll still run into an error.
Related
I'm trying to code my own logistic regression, and compare different methods of maximizing the log-likelihood. Using the Newton-CG method, I get the error message "ValueError: setting an array element with a sequence". Reading around, it seems this error rises if the function sought to be minimized returns a non-skalar, but that is not the case here. I need the three methods given below to give the same result (approximately), but when running on my real data, one does not converge, and the other one gives a worse LL than the initial guess, and the third does not run at all.
Why do I get the ValueError message and how can I fix it?
My code (with dummy data, the real data is ~100 measurements) is as follows:
import numpy as np
from numpy import linalg
import scipy
from scipy.optimize import minimize
def CalcLL(beta,xinlist,yinlist):
LL=0.0
ncol=len(beta)
pi=FindPi(xinlist,beta.reshape(ncol,1))
for i in range(len(yinlist)):
LL=LL+np.where(yinlist[i]==1,np.log(pi[i]),np.log(1-pi[i]))
return -LL
def Jacobian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
Jac=np.transpose(np.matrix(yinlist-pi))*np.matrix(xinlist)
return Jac
def Hessian(beta,xinlist,yinlist):
ncol=len(beta)
nrow=np.shape(xinlist)[0]
pi=FindPi(xinlist,beta.reshape(ncol,1))
W=FindW(pi)
Hes=np.matrix(np.transpose(xinlist))*(np.matrix(W)*np.matrix(xinlist))
return Hes
def FindPi(xinlist,beta):
rows=np.shape(xinlist)[0]# Number of rows in x_new
cols=np.shape(xinlist)[1]# Number of columns in x_new
expon=np.dot(xinlist,beta)
expon=np.array(expon).reshape(rows,1)
pi=np.exp(expon)/(1+np.exp(expon))
return pi
def FindW(pi):
W=np.zeros(len(pi)*len(pi)).reshape(len(pi),len(pi))
for i in range(len(pi)):
W[i,i]=float(pi[i]*(1-pi[i]))
return W
xinlist=np.matrix([[1,1],[0,1],[1,1],[1,1],[1,1],[0,1],[0,1],[1,1],[1,1],[0,1]])
yinlist=np.transpose(np.matrix([0,0,0,0,0,1,1,1,1,1]))
ncol=np.shape(xinlist)[1]
beta1=np.zeros(ncol).reshape(ncol,1) # Initial guess for parameter values
limit=0.000001 # selfwritten Newton-Raphson method
iter_i=limit+1
while iter_i>limit:
Hes=Hessian(beta1,xinlist,yinlist)
Jac=np.transpose(Jacobian(beta1,xinlist,yinlist))
root_diff=np.array(linalg.inv(Hes)*Jac)
beta1=beta1+root_diff
iter_i=np.sum(root_diff*root_diff)
print "When running self-written algorithm, the log-likelihood is",-CalcLL(beta1,xinlist,yinlist)
beta2=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta2,args=(xinlist,yinlist),method='Nelder-Mead',options={'xtol':1e-8,'disp':True,'maxiter':10000})
beta2=res.x
print "The log-likelihood using Nelder-Mead is", -CalcLL(beta2,xinlist,yinlist)
beta3=np.zeros(ncol).reshape(ncol,1)
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
beta3=res.x
print "The log-likelihood using Newton-CG is", -CalcLL(beta3,xinlist,yinlist)
EDIT:
The errorstack is as follows:
Traceback (most recent call last):
File "MyLogisticRegression2.py", line 62, in
res=minimize(CalcLL,beta3,args=(xinlist,yinlist),method='Newton-CG',jac=Jacobian,hess=Hes,options={'xtol':1e-8,'disp':True})
File C:\Python27\lib\site-packages\scipy\optimize_minimize.py, line 447, in minimize **options)
File C:\Python27\lib\site-packages\scipy\optimize\optimize.py, line 2393, in _minimize_newtoncg eta=numpy.min([0.5, numpy.sqrt(maggrad)])
File C:\Python27\lib\site-packages\numpy\core\fromnumeric.py, line 2393, in amin out=out, **kwargs)
File C:\Python27\lib\site-packages\numpy\core_methods.py, line 29, in _amin return umr_minimum(a,axis,None,out,keepdims)
ValueError: setting an array element with a sequence
I found out the problem rose from beta arrays having shape (2,1) instead of (2,), and likewise for the Jacobian. Reshaping these two solved the problem.
The Newton-CG solver needs only 1d arrays for the Jacobian apparently.
I am trying to utilise scipy.optimise.fsolve for solving a function. I noticed that the function is evaluated with the same value multiple times in the beginning and the end of the iteration steps. For example when the following code is evaluated:
from scipy.optimize import fsolve
def yy(x):
print(x)
return x**2+9*x+20
y = fsolve(yy,22.)
print(y)
The following output is obtained:
[ 22.]
[ 22.]
[ 22.]
[ 22.00000033]
[ 8.75471707]
[ 4.34171812]
[ 0.81508685]
[-1.16277103]
[-2.42105811]
[-3.17288066]
[-3.61657372]
[-3.85653348]
[-3.96397335]
[-3.99561793]
[-3.99984826]
[-3.99999934]
[-4.]
[-4.]
[-4.]
Therefore the function is evaluated with 22. three times, which is unnecessary.
This is especially annoying when the function requires substantial evaluation time. Could anyone please explain this and suggest how to avoid this issue?
The first evaluation is done only to check the shape and data type of the output of the function. Specifically, fsolve calls _root_hybr which contains the line
shape, dtype = _check_func('fsolve', 'func', func, x0, args, n, (n,))
Naturally, _check_func calls the function:
res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
Since only the shape and data type are retained from this evaluation, the solver will be calling the function with the value x0 again when actual root finding process begins.
The above accounts for one extraneous call (out of two). I did not track down the other one, but it's conceivable that the FORTRAN code does some kind of preliminary check of its own. This sort of thing happens when algorithms written long ago get wrapped over and over again.
If you really want to save these two evaluations of expensive function yy, one way is to compute the value yy(x0) separately and store it. For example:
def yy(x):
if x == x0 and y0 is not None:
return y0
print(x)
return x**2+9*x+20
x0 = 22.
y0 = None
y0 = yy(x0)
y = fsolve(yy, x0)
I realised an important reason for this issue is that fsolve is not meant for such a problem: Solvers should be chosen wisely :)
multivariate: fmin, fmin_powell, fmin_cg, fmin_bfgs, fmin_ncg
nonlinear: leastsq
constrained: fmin_l_bfgs_b, fmin_tnc, fmin_cobyla
global: basinhopping, brute, differential_evolution
local: fminbound, brent, golden, bracket
n-dimensional: fsolve
one-dimensional: brenth, ridder, bisect, newton
scalar: fixed_point
I am implementing Neural Network whose input and output matrices are very large, so I am using dask arrays for storing them.
X is input matrix of 32000 x 7500 and y is output matrix of same dimension.
Below is neural network code having 1 hidden layer:
class Neural_Network(object):
def __init__(self,i,j,k):
#define hyperparameters
self.inputLayerSize = i
self.outputLayerSize = j
self.hiddenLayerSize = k
#weights
self.W1 = da.random.normal(0.5,0.5,size =(self.inputLayerSize,self.hiddenLayerSize),chunks=(1000,1000))
self.W2 = da.random.normal(0.5,0.5,size =(self.hiddenLayerSize,self.outputLayerSize),chunks=(1000,1000))
self.W1 = self.W1.astype('float96')
self.W2 = self.W2.astype('float96')
def forward(self,X):
self.z2 = X.dot(self.W1)
self.a2 = self.z2.map_blocks(self.sigmoid)
self.z3 = self.a2.dot(self.W2)
yhat = self.z3.map_blocks(self.sigmoid)
return yhat
def exp(z):
return np.exp(z)
def sigmoid(self,z):
#sigmoid function
## return 1/(1+np.exp(-z))
return 1/(1+(-z).map_blocks(self.exp))
def sigmoidprime(self,z):
ez = (-z).map_blocks(self.exp)
return ez/(1+ez**2)
def costFunction (self,X,y):
self.yHat = self.forward(X)
return 1/2*sum((y-self.yHat)**2)
def costFunctionPrime (self,X,y):
self.yHat = self.forward(X)
self.error = -(y - self.yHat)
self.delta3 = self.error*self.z3.map_blocks(self.sigmoidprime)
dJdW2 = self.a2.transpose().dot(self.delta3)
self.delta2 = self.delta3.dot(self.W2.transpose())*self.z2.map_blocks(self.sigmoidprime)
dJdW1 = X.transpose().dot(self.delta2)
return dJdW1 , dJdW2
Now I try to reduce cost of function as below:
>>> n = Neural_Network(7420,7420,5000)
>>> for i in range(0,500):
cost1,cost2 = n.costFunctionPrime(X,y)
n.W1 = n.W1 -3*cost1
n.W2 = n.W2 -3*cost2
if i%5==0:
print (i*100/500,'%')
But when i reaches around 120 it gives me error:
File "<pyshell#127>", line 3, in <module>
n.W1 = n.W1 -3*cost1
File "c:\python34\lib\site-packages\dask\array\core.py", line 1109, in __sub__
return elemwise(operator.sub, self, other)
File "c:\python34\lib\site-packages\dask\array\core.py", line 2132, in elemwise
dtype=dt, name=name)
File "c:\python34\lib\site-packages\dask\array\core.py", line 1659, in atop
return Array(merge(dsk, *dsks), out, chunks, dtype=dtype)
File "c:\python34\lib\site-packages\toolz\functoolz.py", line 219, in __call__
return self._partial(*args, **kwargs)
File "c:\python34\lib\site-packages\toolz\curried\exceptions.py", line 20, in merge
return toolz.merge(*dicts, **kwargs)
File "c:\python34\lib\site-packages\toolz\dicttoolz.py", line 39, in merge
rv.update(d)
MemoryError
It also gives MemoryError when I do nn.W1.compute()
This looks like its failing while building the graph, not during computation. Two things come to mind:
Avoid excessive looping
Each iteration of your for loop may be dumping millions of tasks into the task graph. Each task probably takes up something like 100B to 1kB. When these add up they can easily overwhelm your machine.
In a typical deep learning library, like Theano, you would use a scan operation for something like this. Dask.array has no such operation.
Avoid inserting graphs into graphs
You call map_blocks on a function that itself calls map_blocks.
self.delta2 = self.delta3.dot(self.W2.transpose())*self.z2.map_blocks(self.sigmoidprime)
def sigmoidprime(self,z):
ez = (-z).map_blocks(self.exp)
return ez/(1+ez**2)
Instead you might just make a sigmoid prime function
def sigmoidprime(z):
ez = np.exp(-z)
return ez / (1 + ez ** 2)
And then map that function
self.z2.map_blocks(sigmoidprime)
Deep learning is tricky
Generally speaking, doing deep learning well often requires specialization. The libraries designed to do this well generally aren't general purpose for a reason. A general purpose library, like dask.array might be useful but will probably never reach the smooth operation of a library like Theano.
A possible approach
You might try building a function that takes just one step. It would read from disk, do all of your dot products, transposes, and normal computations, and would then store explicitly into an on-disk dataset. You would then call this function many times. Even then I'm not convinced that the scheduling policies behind dask.array could do this well.
I used fsolve to find the zeros of an example sinus function, and worked great. However, I wanted to do the same with a dataset. Two lists of floats, later converted to arrays with numpy.asarray(), containing the (x,y) values, namely 't' and 'ys'.
Although I found some related questions, I failed to implement the code provided in them, as I try to show here. Our arrays of interest are stored in a 2D list (data[i][j], where 'i' corresponds to a variable (e.g. data[0]==t==time==x values) and 'j' are the values of said variable along the x axis (e.g. data[1]==Force). Keep in mind that each data[i] is an array of floats.
Could you offer an example code that takes two inputs (the two mentioned arrays) and returns its intersecting points with a defined function (e.g. 'y=0').
I include some testing I made regarding the other related question. ( #HYRY 's answer)
I do not think it is relevant, but I'm using Spyder through Anaconda.
Thanks in advance!
"""
Following the answer provided by #HYRY in the 'related questions' (see link above).
At this point of the code, the variable 'data' has already been defined as stated before.
"""
from scipy.optimize import fsolve
def tfun(x):
return data[0][x]
def yfun(x):
return data[14][x]
def findIntersection(fun1, fun2, x0):
return [fsolve(lambda x:fun1(x)-fun2(x, y), x0) for y in range(1, 10)]
print findIntersection(tfun, yfun, 0)
Which returns the next error
File "E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py", line 36, in tfun
return data[0][x]
IndexError: arrays used as indices must be of integer (or boolean) type
The full output is as it follows:
Traceback (most recent call last):
File "<ipython-input-16-105803b235a9>", line 1, in <module>
runfile('E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py', wdir='E:/Data/Anaconda/[...]/00-Latest')
File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py", line 44, in <module>
print findIntersection(tfun, yfun, 0)
File "E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py", line 42, in findIntersection
return [fsolve(lambda x:fun1(x)-fun2(x, y), x0) for y in range(1, 10)]
File "C:\Anaconda\lib\site-packages\scipy\optimize\minpack.py", line 140, in fsolve
res = _root_hybr(func, x0, args, jac=fprime, **options)
File "C:\Anaconda\lib\site-packages\scipy\optimize\minpack.py", line 209, in _root_hybr
ml, mu, epsfcn, factor, diag)
File "E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py", line 42, in <lambda>
return [fsolve(lambda x:fun1(x)-fun2(x, y), x0) for y in range(1, 10)]
File "E:/Data/Anaconda/[...]/00-Latest/fsolvestacktest001.py", line 36, in tfun
return data[0][x]
IndexError: arrays used as indices must be of integer (or boolean) type
You can 'convert' a datasets (arrays) to continuous functions by means of interpolation. scipy.interpolate.interp1d is a factory that provides you with the resulting function, which you could then use with your root finding algorithm.
--edit-- an example for computing an intersection of sin and cos from 20 samples (I've used cubic spline interpolation, as piecewise linear gives warnings about the smoothness):
>>> import numpy, scipy.optimize, scipy.interpolate
>>> x = numpy.linspace(0,2*numpy.pi, 20)
>>> x
array([ 0. , 0.33069396, 0.66138793, 0.99208189, 1.32277585,
1.65346982, 1.98416378, 2.31485774, 2.64555171, 2.97624567,
3.30693964, 3.6376336 , 3.96832756, 4.29902153, 4.62971549,
4.96040945, 5.29110342, 5.62179738, 5.95249134, 6.28318531])
>>> y1sampled = numpy.sin(x)
>>> y2sampled = numpy.cos(x)
>>> y1int = scipy.interpolate.interp1d(x,y1sampled,kind='cubic')
>>> y2int = scipy.interpolate.interp1d(x,y2sampled,kind='cubic')
>>> scipy.optimize.fsolve(lambda x: y1int(x) - y2int(x), numpy.pi)
array([ 3.9269884])
>>> scipy.optimize.fsolve(lambda x: numpy.sin(x) - numpy.cos(x), numpy.pi)
array([ 3.92699082])
Note that interpolation will give you 'guesses' about what data should be between the sampling points. No way to tell how good these guesses are. (but for my example, you can see it's a pretty good estimation)
I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.