Memory error with dask array - python

I am implementing Neural Network whose input and output matrices are very large, so I am using dask arrays for storing them.
X is input matrix of 32000 x 7500 and y is output matrix of same dimension.
Below is neural network code having 1 hidden layer:
class Neural_Network(object):
def __init__(self,i,j,k):
#define hyperparameters
self.inputLayerSize = i
self.outputLayerSize = j
self.hiddenLayerSize = k
#weights
self.W1 = da.random.normal(0.5,0.5,size =(self.inputLayerSize,self.hiddenLayerSize),chunks=(1000,1000))
self.W2 = da.random.normal(0.5,0.5,size =(self.hiddenLayerSize,self.outputLayerSize),chunks=(1000,1000))
self.W1 = self.W1.astype('float96')
self.W2 = self.W2.astype('float96')
def forward(self,X):
self.z2 = X.dot(self.W1)
self.a2 = self.z2.map_blocks(self.sigmoid)
self.z3 = self.a2.dot(self.W2)
yhat = self.z3.map_blocks(self.sigmoid)
return yhat
def exp(z):
return np.exp(z)
def sigmoid(self,z):
#sigmoid function
## return 1/(1+np.exp(-z))
return 1/(1+(-z).map_blocks(self.exp))
def sigmoidprime(self,z):
ez = (-z).map_blocks(self.exp)
return ez/(1+ez**2)
def costFunction (self,X,y):
self.yHat = self.forward(X)
return 1/2*sum((y-self.yHat)**2)
def costFunctionPrime (self,X,y):
self.yHat = self.forward(X)
self.error = -(y - self.yHat)
self.delta3 = self.error*self.z3.map_blocks(self.sigmoidprime)
dJdW2 = self.a2.transpose().dot(self.delta3)
self.delta2 = self.delta3.dot(self.W2.transpose())*self.z2.map_blocks(self.sigmoidprime)
dJdW1 = X.transpose().dot(self.delta2)
return dJdW1 , dJdW2
Now I try to reduce cost of function as below:
>>> n = Neural_Network(7420,7420,5000)
>>> for i in range(0,500):
cost1,cost2 = n.costFunctionPrime(X,y)
n.W1 = n.W1 -3*cost1
n.W2 = n.W2 -3*cost2
if i%5==0:
print (i*100/500,'%')
But when i reaches around 120 it gives me error:
File "<pyshell#127>", line 3, in <module>
n.W1 = n.W1 -3*cost1
File "c:\python34\lib\site-packages\dask\array\core.py", line 1109, in __sub__
return elemwise(operator.sub, self, other)
File "c:\python34\lib\site-packages\dask\array\core.py", line 2132, in elemwise
dtype=dt, name=name)
File "c:\python34\lib\site-packages\dask\array\core.py", line 1659, in atop
return Array(merge(dsk, *dsks), out, chunks, dtype=dtype)
File "c:\python34\lib\site-packages\toolz\functoolz.py", line 219, in __call__
return self._partial(*args, **kwargs)
File "c:\python34\lib\site-packages\toolz\curried\exceptions.py", line 20, in merge
return toolz.merge(*dicts, **kwargs)
File "c:\python34\lib\site-packages\toolz\dicttoolz.py", line 39, in merge
rv.update(d)
MemoryError
It also gives MemoryError when I do nn.W1.compute()

This looks like its failing while building the graph, not during computation. Two things come to mind:
Avoid excessive looping
Each iteration of your for loop may be dumping millions of tasks into the task graph. Each task probably takes up something like 100B to 1kB. When these add up they can easily overwhelm your machine.
In a typical deep learning library, like Theano, you would use a scan operation for something like this. Dask.array has no such operation.
Avoid inserting graphs into graphs
You call map_blocks on a function that itself calls map_blocks.
self.delta2 = self.delta3.dot(self.W2.transpose())*self.z2.map_blocks(self.sigmoidprime)
def sigmoidprime(self,z):
ez = (-z).map_blocks(self.exp)
return ez/(1+ez**2)
Instead you might just make a sigmoid prime function
def sigmoidprime(z):
ez = np.exp(-z)
return ez / (1 + ez ** 2)
And then map that function
self.z2.map_blocks(sigmoidprime)
Deep learning is tricky
Generally speaking, doing deep learning well often requires specialization. The libraries designed to do this well generally aren't general purpose for a reason. A general purpose library, like dask.array might be useful but will probably never reach the smooth operation of a library like Theano.
A possible approach
You might try building a function that takes just one step. It would read from disk, do all of your dot products, transposes, and normal computations, and would then store explicitly into an on-disk dataset. You would then call this function many times. Even then I'm not convinced that the scheduling policies behind dask.array could do this well.

Related

How can I make pseudo function and function prime for SciPy fmin_l_bfgs_b?

I want to use scipy.optimize.fmin_l_bfgs_b to find the minimum of a cost function.
To do this, I want to create an instance of one_batch (the code of one_batch is given below) in the first place to specify the batch of training examples and those parameters that are not included in the loss function but necessary to calculate the loss.
Because the module loss_calc is designed to return the loss and loss prime at the same time, I'm facing with the problem of separating the loss function and loss function prime for scipy.optimize.fmin_l_bfgs_b.
As you can see from the code of one_batch, given a batch of traning examples, the [loss, dloss/dParameters] will be calculated in parallel for each of the examples. I don't want to do the exact same calculations twice for get_loss and get_loss_prime.
So how can I design the methods get_loss and get_loss_prime, so that I only need to do parallel calculation once?
Here is the code of one_batch
from calculator import loss_calc
class one_batch:
def __init__(self,
auxiliary_model_parameters,
batch_example):
# auxiliary_model_parameters are parameters need to specify
# the loss calculator but are not included in the loss function.
self.auxiliary_model_parameters = auxiliary_model_parameters
self.batch_example = batch_example
def parallel(self, func, args):
pool = multiprocessing.Pool(multiprocessing.cpu_count())
result = pool.map(func, args)
return result
def one_example(self, example):
temp_instance = loss_calc(self.auxiliary_model_parameters,
self.model_vector)
loss, dloss = temp_instance(example).calculate()
return [loss, dloss]
def main(self, model_vector):
self.model_vector = model_vector
# model_vector and auxiliary_model_parameters are necessary
# for creating an instance of loss function calculator
result_list = parallel(self.one_example,
self.batch_examples)
# result_list is a list of sublists, each sublist is
# [loss, dloss/dParameter] for each training example
def get_loss(self):
?
def get_loss_prime(self):
?
You can use an objective function that returns both function value directly as an input to fmin_l_bfgs_b:
from scipy.optimize import fmin_l_bfgs_b
import numpy as np
def obj_fun(x):
fx = 2*x**2 + 2*x + 1
grad = np.array([4*x + 2])
return fx, grad
fmin_l_bfgs_b(obj_fun, x0=[12])
(array([-0.5]), array([0.5]), {'grad': array([[-3.55271368e-15]]),
'task': b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL',
'funcalls': 4, 'nit': 2, 'warnflag': 0})

Tensorflow (GPU) vs. Numpy

so I've got two implementations of a linear regression using gradient descent. One in Tensorflow, one in Numpy. I'm finding the one in Numpy is about 3x faster than in Tensorflow. Here's my code -
Tensorflow:
class network_cluster(object):
def __init__(self, data_frame, feature_cols, label_cols):
self.init_data(data_frame, feature_cols, label_cols)
self.init_tensors()
def init_data(self, data_frame, feature_cols, label_cols):
self.data_frame = data_frame
self.feature_cols = feature_cols
self.label_cols = label_cols
def init_tensors(self):
self.features = tf.placeholder(tf.float32)
self.labels = tf.placeholder(tf.float32)
self.weights = tf.Variable(tf.random_normal((len(self.feature_cols), len(self.label_cols))))
self.const = tf.Variable(tf.random_normal((len(self.label_cols),)))
def linear_combiner(self):
return tf.add(tf.matmul(self.features, self.weights), self.const)
def predict(self):
return self.linear_combiner()
def error(self):
return tf.reduce_mean(tf.pow(self.labels - self.predict(), 2), axis = 0)
def learn_model(self, epocs = 100):
optimizer = tf.train.AdadeltaOptimizer(1).minimize(self.error())
error_rcd = []
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoc in range(epocs):
_, error = sess.run([optimizer, self.error()], feed_dict={
self.features: self.data_frame[self.feature_cols],
self.labels: self.data_frame[self.label_cols]
})
error_rcd.append(error[0])
return error_rcd
def get_coefs(self):
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
coefs = sess.run([self.weights, self.const])
return coefs
test_cluster = network_cluster(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'])
%timeit test_cluster.learn_model(epocs = 100)
And numpy:
def grad_descent(dataset, features, predictor, max_iters = 10000):
def initialize_model(dataset, features, predictor):
constant_array = np.ones(shape = (len(dataset), 1))
features_array = dataset.loc[:, features].values
features_array = np.append(constant_array, features_array, axis = 1)
predict_array = dataset.loc[:, predictor].values
betas = np.zeros(shape = (len(features) + 1, len(predictor)))
return (features_array, predict_array, betas)
def calc_gradient(features_array, predict_array, betas):
prediction = np.dot(features_array, betas)
predict_error = predict_array - prediction
gradient = -2 * np.dot(features_array.transpose(), predict_error)
gradient_two = 2 * np.expand_dims(np.sum(features_array ** 2, axis = 0), axis = 1)
return (gradient, gradient_two)
def update_betas(gradient, gradient_two, betas):
new_betas = betas - ((gradient / gradient_two) / len(betas))
return new_betas
def model_error(features_array, predict_array, betas):
prediction = np.dot(features_array, betas)
predict_error = predict_array - prediction
model_error = np.sqrt(np.mean(predict_error ** 2))
return model_error
features_array, predict_array, betas = initialize_model(dataset, features, predictor)
prior_error = np.inf
for iter_count in range(max_iters):
gradient, gradient_two = calc_gradient(features_array, predict_array, betas)
betas = update_betas(gradient, gradient_two, betas)
curr_error = model_error(features_array, predict_array, betas)
if curr_error == prior_error:
break
prior_error = curr_error
return (betas, iter_count, curr_error)
%timeit grad_descent(dataset, ['ship_jumps', 'npc_kills', 'ship_kills', 'pod_kills'], ['hour_of_week'], max_iters = 100)
I'm testing using the Spyder IDE, and I do have an Nvidia GPU (960). The Tensorflow code clocks in at ~20 seconds, with the Numpy code at about 7 seconds on the same dataset. The dataset is almost 1 million rows.
I would have expected Tensorflow to beat out Numpy handily here, but that's not the case. Granted I am new to using Tensorflow, and the Numpy implementation doesn't use a class, but still, 3x better with Numpy?!
Hoping for some thoughts/ideas on what I'm doing wrong here.
Without looking at your code in detail (not that much experience with TF):
This comparison is flawed!
Yaroslav's comment is of course true: GPU-computing has some overhead (at least data-preparation; not sure what kind of compiling is clocked here)
You are comparing pure GD to Adadelta in full-batch mode it seems:
Adadelta of course implicates some overhead (there are more operations than calculating the gradient and multiplying the current iterate) as it's one of the common variance-reduction methods which come with a price!
The idea is: invest some additional operations to:
remove the number of iterations needed given some learning-rate
(this is even much more complex: for most people -> achieve good convergence with using the default learning-rates)
It seems you are just running 100 epochs each and clocking this
That's not meaningful!
It's very much possible that the objective is very different:
if iteration size is not enough
or the initial learning-rate is badly chosen
or the same, but the non-existing early-stopping made sure a possible better algorithm with proven convergence (according to some criterion) wastes some additional time doing all iterations until 100 is reached!
(Adadelta was probably designed for the SGD-setting; not GD)
It's very hard to compare such different algorithms, especially when using just one task / dataset.
Even if you would introduce early-stopping, you will observe random-seed-based indeterministic performance which is hard to interpret.
You are basically measuring iteration-time, but this is not a good measure. Compare first-order methods (gradients -> SGD, GD, ...) with second-order methods (hessian -> Newton). the latter is very slow to iterate, but usually obtains quadratic convergence behaviour resulting in way less iterations needed! In NN-applications this example is more: LBFGS vs. SGD/... (although i don't know if LBFGS is available in TF; torch supports it). LBFGS is known to achieve local-quadratic convergence which is again hard to interpret in real-world tasks (especially as this limited-memory approximation of the inverse-hessian is a parameter of LBFGS). This comparison can also be done on Linear-Programming where the Simplex-method has fast-iterations while Interior-point methods (basically Newton-based; but treating constrained-optimization here there are some additional ideas needed) are much slower per iteration (despite being faster to achieve convergence in many cases).
What i ignored here: nearly all theoretical results regarding convergence and co. are limited to convex and smooth functions. NNs are typically non-convex, which means, the task of evaluating these performance-measures is even harder. But your problem here is convex of course.
I also have to admit, that my answer is only scratching the surface of this complex problem, even if unconstrained smooth convex optimization is one of the easier tasks in numerical-optimization (compared to constrained, nonsmooth nonconvex optimization).
For a general introduction to numerical-optimization, which also talks a lot about first-order vs. second-order methods (and there are many methods in-between), i recommend Numerical Optimization by Nocedal and Wright which can be found on the web.

How to write a custom Deterministic or Stochastic in pymc3 with theano.op?

I'm doing some pymc3 and I would like to create custom Stochastics, however there doesn't seem to be a lot documentation about how it's done. I know how to use the as_op way, however apparently that makes it impossible to use the NUTS sampler, in which case I don't see the advantage of pymc3 over pymc.
The tutorial mentions that it can be done by inheriting from theano.Op. But can anyone show me how that would work (I'm still getting started on theano)? I have two Stochastics that I want to define.
The first one should be easier, it's an N dimension vector F that has only constant parent variables:
with myModel:
F = DensityDist('F', lambda value: pymc.skew_normal_like(value, F_mu_array, F_std_array, F_a_array), shape = N)
I want a skew normal distribution, which doesn't seem to be implemented in pymc3 yet, I just imported the pymc2 version. Unfortunately, F_mu_array, F_std_array, F_a_array and F are all N-dimensional vectors, and the lambda thing doesn't seem to work with an N-dimension list value.
Firstly, is there a way to make the lambda input an N-dimensional array? If not, I guess I would need to define the Stochastic F directly, and this is where I presume I need theano.Op to make it work.
The second example is a more complicated function of other Stochastics. Here how I want to define it (incorrectly at the moment):
with myModel:
ln2_var = Uniform('ln2_var', lower=-10, upper=4)
sigma = Deterministic('sigma', exp(0.5*ln2_var))
A = Uniform('A', lower=-10, upper=10, shape=5)
C = Uniform('C', lower=0.0, upper=2.0, shape=5)
sw = Normal('sw', mu=5.5, sd=0.5, shape=5)
# F from before
F = DensityDist('F', lambda value: skew_normal_like(value, F_mu_array, F_std_array, F_a_array), shape = N)
M = Normal('M', mu=M_obs_array, sd=M_stdev, shape=N)
# Radius forward-model (THIS IS THE STOCHASTIC IN QUESTION)
R = Normal('R', mu = R_forward(F, M, A, C, sw, N), sd=sigma, shape=N)
Where the function R_forward(F,M,A,C,sw,N) is naively defined as:
from theano.tensor import lt, le, eq, gt, ge
def R_forward(Flux, Mass, A, C, sw, num):
for i in range(num):
if lt(Mass[i], 0.2):
if lt(Flux[i], sw[0]):
muR = C[0]
else:
muR = A[0]*log10(Flux[i]) + C[0] - A[0]*log10(sw[0])
elif (le(0.2, Mass[i]) or le(Mass[i], 0.5)):
if lt(Flux[i], sw[1]):
muR = C[1]
else:
muR = A[1]*log10(Flux[i]) + C[1] - A[1]*log10(sw[1])
elif (le(0.5, Mass[i]) or le(Mass[i], 1.5)):
if lt(Flux[i], sw[2]):
muR = C[2]
else:
muR = A[2]*log10(Flux[i]) + C[2] - A[2]*log10(sw[2])
elif (le(1.5, Mass[i]) or le(Mass[i], 3.5)):
if lt(Flux[i], sw[3]):
muR = C[3]
else:
muR = A[3]*log10(Flux[i]) + C[3] - A[3]*log10(sw[3])
else:
if lt(Flux[i], sw[4]):
muR = C[4]
else:
muR = A[4]*log10(Flux[i]) + C[4] - A[4]*log10(sw[4])
return muR
This presumably won't work of course. I can see how I would use as_op, but I want to preserve the NUTS sampling.
I realize this is a bit late now, but I thought I'd answer the question (rather vaguely) anyways.
If you want to define a stochastic function (e.g. a probability distribution), then you need to do a couple of things:
First, define a subclass of either Discrete (pymc3.distributions.Discrete) or Continuous, which has at least the method logp, which returns the log-likelihood of your stochastic. If you define this as a simple symbolic equation (x+1), I believe you do not need to take care of any gradients (but don't quote me on this; see the documentation about this). I'll get on to more complicated cases below. In the unfortunate case that you need to do anything more complex, as in your second example (pymc3 now has a skew normal distribution implemented, by the way), you need to define the operations required for it (used in the logp method) as a Theano Op. If you need no derivatives, then the as_op does the job, but as you said, gradients are kind of the idea of pymc3.
This is where it gets complicated. If you want to use NUTS (or need gradients for whatever reason), then you need to implement your operation used in logp as a subclass of theano.gof.Op. Your new op class (let's call it just Op from now on) will need two or three methods at least. The first one defines inputs/outputs to the Op (check the Op documentation). The perform() method (or variants you might choose) is the one that does the operation you want (your R_forward function, for example). This can be done in pure python, if you so wish. The third method, grad(), is where you define the gradient of your perform()'s output wrt the inputs. The actual output to grad() is a bit different, but not a big deal.
And it is in grad() that using Theano pays off. If you define your entire perform() in Theano, then it might be that you can easily use automatic differentiation (theano.tensor.grad or theano.tensor.jacobian) to do the work for you (see the example below). However, this is not necessarily going to be easy.
In your second example, it would mean implementing your R_forward function in Theano, which could be complicated.
Here I include a somewhat minimal example of an Op that I created while learning to do these things.
def my_th_fun():
""" Some needed auxiliary functions.
"""
X = th.tensor.vector('X')
SCALE = th.tensor.scalar('SCALE')
X.tag.test_value = np.array([1,2,3,4])
SCALE.tag.test_value = 5.
Scale, upd_sm_X = th.scan(lambda x, scale: scale*(scale+ x),
sequences=[X],
outputs_info=[SCALE])
fun_Scale = th.function(inputs=[X, SCALE], outputs=Scale)
D_out_d_scale = th.tensor.grad(Scale[-1], SCALE)
fun_d_out_d_scale = th.function([X, SCALE], D_out_d_scale)
return Scale, fun_Scale, D_out_d_scale, fun_d_out_d_scale
class myOp(th.gof.Op):
""" Op subclass with a somewhat silly computation. It uses
th.scan and th.tensor.grad is used to calculate the gradient
automagically in the grad() method.
"""
__props__ = ()
itypes = [th.tensor.dscalar]
otypes = [th.tensor.dvector]
def __init__(self, *args, **kwargs):
super(myOp, self).__init__(*args, **kwargs)
self.base_dist = np.arange(1,5)
(self.UPD_scale, self.fun_scale,
self.D_out_d_scale, self.fun_d_out_d_scale)= my_th_fun()
def perform(self, node, inputs, outputs):
scale = inputs[0]
updated_scale = self.fun_scale(self.base_dist, scale)
out1 = self.base_dist[0:2].sum()
out2 = self.base_dist[2:4].sum()
maxout = np.max([out1, out2])
exp_out1 = np.exp(updated_scale[-1]*(out1-maxout))
exp_out2 = np.exp(updated_scale[-1]*(out2-maxout))
norm_const = exp_out1 + exp_out2
outputs[0][0] = np.array([exp_out1/norm_const, exp_out2/norm_const])
def grad(self, inputs, output_gradients): #working!
""" Calculates the gradient of the output of the Op wrt
to the input. As a simple example, the input is scalar.
Notice how the output is actually the gradient multiplied
by the output_gradients, which is an input provided by
theano when calculating gradients.
"""
scale = inputs[0]
X = th.tensor.as_tensor(self.base_dist)
# Do I need to recalculate all this or can I assume that perform() has
# always been called before grad() and thus can take it from there?
# In any case, this is a small enough example to recalculate quickly:
all_scale, _ = th.scan(lambda x, scale_1: scale_1*(scale_1+ x),
sequences=[X],
outputs_info=[scale])
updated_scale = all_scale[-1]
out1 = self.base_dist[0:1].sum()
out2 = self.base_dist[2:3].sum()
maxout = np.max([out1, out2])
exp_out1 = th.tensor.exp(updated_scale*(out1 - maxout))
exp_out2 = th.tensor.exp(updated_scale*(out2 - maxout))
norm_const = exp_out1 + exp_out2
d_S_d_scale = th.theano.grad(all_scale[-1], scale)
Jac1 = (-(out1-out2)*d_S_d_scale*
th.tensor.exp(updated_scale*(out1+out2 - 2*maxout))/(norm_const**2))
Jac2 = -Jac1
return Jac1*output_gradients[0][0]+ Jac2*output_gradients[0][1],
This Op can then be used inside the logp() method of a stochastic in pymc3:
import pymc3 as pm
class myDist(pm.distributions.Discrete):
def __init__(self, invT, *args, **kwargs):
super(myDist, self).__init__(*args, **kwargs)
self.invT = invT
self.myOp = myOp()
def logp(self, value):
return self.myOp(self.invT)[value]
I hope it helps any (hopeless) pymc3/theano newbie out there.

Scipy Curve Fit Global Parameters

I'm trying to fit a two global parameters of a galactic model using Scipy curve_fit in python. I have an array of independent variables and an array of dependent variables. The first 1/4 of the data set needs to be fit to a function depending on the two global parameters and two local parameters, the next quarter to another function depending on the two global parameters and two local variables, etc.
Is there anyway that I can write a function that will call the appropriate function with the right index and the global parameters through the entire array.
What I have so far is:
def galaxy_func_inner(time,a,b,c,d):
telescope_inner = lt.station(rot_angle=c,pol_angle=d)
power = telescope_inner.calculate_gpowervslstarray(time)[0]
return a*np.array(power)+b
def galaxy_func_outer(time,a,b,c,d):
telescope_outer = lt.station(rot_angle=c,pol_angle=d)
power = telescope_outer.calculate_gpowervslstarray(time)[0]
return a*np.array(power)+b
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
for t_index in range(len(time)):
if t_index in range(0,50):
return galaxy_func_outer(t_index,a,b,R,P)
elif t_index in range(50,100):
return galaxy_func_outer(t_index,c,d,R,P)
elif t_index in range(100,150):
return galaxy_func_inner(t_index,e,f,R,P)
elif t_index in range(150,200):
return galaxy_func_inner(t_index,g,h,R,P)
The problem is that this only fits the first time but the whole time array, and the single point is only fitted to the corresponding model point and not the whole array. Any help as to how to reformulate this? I've tried to reformulate it as:
def galaxy_func_global(xdata,R,P,a,b,c,d,e,f,g,h):
return galaxy_func_outer(xdata[0:50],a,b,R,P),galaxy_func_outer(xdata[50:100],c,d,R,P),galaxy_func_inner(xdata[100:150],e,f,R,P),galaxy_func_inner(xdata[150:200],g,h,R,P)
but I get the error:
File "galaxy_calibration.py", line 117, in <module>
popt,pcov = curve_fit(galaxy_func_global,xdata,ydata)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 555, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kw)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 369, in leastsq
shape, dtype = _check_func('leastsq', 'func', func, x0, args, n)
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 20, in _check_func
res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 445, in _general_function
return function(xdata, *params) - ydata
ValueError: operands could not be broadcast together with shapes (4,) (191,)
Any help would be much appreciated.
If you want to cut your input data into 4 batches (based on the index of the time points) and process the data depending on the batches, then return the results in a single array, then you can do this:
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
return np.concatenate([galaxy_func_outer(time[0:50],a,b,R,P),
galaxy_func_outer(time[50:100],c,d,R,P),
galaxy_func_inner(time[100:150],e,f,R,P),
galaxy_func_inner(time[150:200],g,h,R,P)])
This will slice into your time array to pick out each slice of interest, then call the appropriate function for each piece. It seems to me that these functions return simple np.arrays, which can be concatenated to get a single array as result.
(I just realized that I could've just said "what you tried was almost perfect, but you need to concatenate the resulting arrays into a single array":)
Note that there are at least two ways in which you can have dimensioning problems.
Firstly, you should make sure that the return value of both of your functions (galaxy...inner/outer()) is a 1d numpy array. Otherwise you'll run into problems with your global return value.
Secondly, every fitting method expects a function the return value of which has the same size (shape) as the input variable, for obvious reasons. So you can also run into problems with your current code if time is not exactly 200 elements long, since your output will be truncated to 200 elements even if time is longer. At least you should put
galaxy_func_inner(time[150:],g,h,R,P)
into your last function call to catch all the remaining points of time, but if you want to do it properly, call
def galaxy_func_global(time,R,P,a,b,c,d,e,f,g,h):
inds=np.floor(np.linspace(0,len(time)-1,5))
return np.concatenate([galaxy_func_outer(time[0:inds[1]],a,b,R,P),
galaxy_func_outer(time[inds[1]:inds[2]],c,d,R,P),
galaxy_func_inner(time[inds[2]:inds[3]],e,f,R,P),
galaxy_func_inner(time[inds[3]:],g,h,R,P)])
Also note that your original error is formally of this kind:
File "/Library/Python/2.7/site-packages/scipy-0.14.0.dev_7cefb25-py2.7-macosx-10.9-intel.egg/scipy/optimize/minpack.py", line 445, in _general_function
return function(xdata, *params) - ydata
ValueError: operands could not be broadcast together with shapes (4,) (191,)
This tells you that python couldn't subtract ydata from function(xdata,*params) (i.e. your fitting model) because one is of length 4 while the other is of length 191. This is because if your function calls return a,b,c,d, then it will return a tuple (a,b,c,d), so the return value will have a length of 4. It's more interesting that your ydata has length 191, this might mean that you'll still run into an error.

Creating new distributions in scipy

I'm trying to create a distribution based on some data I have, then draw randomly from that distribution. Here's what I have:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv()
if __name__ == "__main__":
# pretend this is real data
data = numpy.concatenate((numpy.random.normal(2,5,100), numpy.random.normal(25,5,100)))
d = getDistribution(data)
print d.rvs(size=100) # this usually fails
I think this is doing what I want it to, but I frequently get an error (see below) when I try to do d.rvs(), and d.rvs(100) never works. Am I doing something wrong? Is there an easier or better way to do this? If it's a bug in scipy, is there some way to get around it?
Finally, is there more documentation on creating custom distributions somewhere? The best I've found is the scipy.stats.rv_continuous documentation, which is pretty spartan and contains no useful examples.
The traceback:
Traceback (most recent call last): File "testDistributions.py", line
19, in
print d.rvs(size=100) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 696, in rvs
vals = self._rvs(*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1193, in _rvs
Y = self._ppf(U,*args) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1212, in _ppf
return self.vecfunc(q,*args) File "/usr/local/lib/python2.6/dist-packages/numpy-1.6.1-py2.6-linux-x86_64.egg/numpy/lib/function_base.py",
line 1862, in call
theout = self.thefunc(*newargs) File "/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/stats/distributions.py",
line 1158, in _ppf_single_call
return optimize.brentq(self._ppf_to_solve, self.xa, self.xb, args=(q,)+args, xtol=self.xtol) File
"/usr/local/lib/python2.6/dist-packages/scipy-0.10.0-py2.6-linux-x86_64.egg/scipy/optimize/zeros.py",
line 366, in brentq
r = _zeros._brentq(f,a,b,xtol,maxiter,args,full_output,disp) ValueError: f(a) and f(b) must have different signs
Edit
For those curious, following the advice in the answer below, here's code that works:
from scipy import stats
import numpy
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _rvs(self, *x, **y):
# don't ask me why it's using self._size
# nor why I have to cast to int
return kernel.resample(int(self._size))
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
def _pdf(self, x):
return kernel.evaluate(x)
return rv(name='kdedist', xa=-200, xb=200)
Specifically to your traceback:
rvs uses the inverse of the cdf, ppf, to create random numbers. Since you are not specifying ppf, it is calculated by a rootfinding algorithm, brentq. brentq uses lower and upper bounds on where it should search for the value at with the function is zero (find x such that cdf(x)=q, q is quantile).
The default for the limits, xa and xb, are too small in your example. The following works for me with scipy 0.9.0, xa, xb can be set when creating the function instance
def getDistribution(data):
kernel = stats.gaussian_kde(data)
class rv(stats.rv_continuous):
def _cdf(self, x):
return kernel.integrate_box_1d(-numpy.Inf, x)
return rv(name='kdedist', xa=-200, xb=200)
There is currently a pull request for scipy to improve this, so in the next release xa and xb will be expanded automatically to avoid the f(a) and f(b) must have different signs exception.
There is not much documentation on this, the easiest is to follow some examples (and ask on the mailing list).
edit: addition
pdf: Since you have the density function also given by gaussian_kde, I would add the _pdf method, which will make some calculations more efficient.
edit2: addition
rvs: If you are interested in generating random numbers, then gaussian_kde has a resample method. Random Samples can be generated by sampling from the data and adding gaussian noise. So, this will be faster than the generic rvs using the ppf method. I would write a ._rvs method that just calls gaussian_kde's resample method.
precomputing ppf: I don't know of any general way to precompute the ppf. However, the way I thought of doing it (but never tried so far) is to precompute the ppf at many points and then use linear interpolation to approximate the ppf function.
edit3: about _rvs to answer Srivatsan's question in the comment
_rvs is the distribution specific method that is called by the public method rvs. rvs is a generic method that does some argument checking, adds location and scale, and sets the attribute self._size which is the size of the requested array of random variables, and then calls the distribution specific method ._rvs or it's generic counterpart. The extra arguments in ._rvs are shape parameters, but since there are none in this case, *x and **y are redundant and unused.
I don't know how well the size or shape of the .rvs method works in the multivariate case. These distributions are designed for univariate distributions, and might not fully work for the multivariate case, or might need some reshapes.

Categories