Quantopian / Zipline: weird pattern in Pipeline package - python

I recently found a very strange pattern in the "Pipeline" API from Quantopian/Zipline: they have a CustomFactor class, in which you will find a compute() method to be overriden when implementing your own Factor model.
The signature of compute() is: def compute(self, today, assets, out, *inputs), with the following comment for parameter "out":
Output array of the same shape as assets. compute should write its desired return values into out.
When I asked why the function could not simply return an output array instead of writing into an input parameter, I received the following answer:
"If the API required that the output array be returned by compute(), we'd end up doing a copy of the array into the actual output buffer which means an extra copy would get made unnecessarily."
I fail to understand why they end up doing so... Obviously in Python there is no issues about passing by value and there is no risk of unnecessarily copying data. This is really painful because this is the kind of implementations they are recommending people to code:
def compute(self, today, assets, out, data):
out[:] = data[-1]
So my question is, why could it not simply be:
def compute(self, today, assets, data):
return data[-1]

(I designed and implemented the API in question here.)
You're right that Python objects aren't copied when passed into and out of functions. The reason there's a difference between returning a row out of your CustomFactor and writing values into a provided array has to do with copies that would be made in the code that's calling your CustomFactor compute method.
When the CustomFactor API was originally designed, the code that calls your compute method looked roughly like this:
def _compute(self, windows, dates, assets):
# `windows` here is list of iterators yielding 2D slices of
# the user's requested inputs
# `dates` and `assets` are row/column labels for the final output.
# Allocate a (dates x assets) output array.
# Each invocation of the user's `compute` function
# corresponds to one row of output.
output = allocate_output()
for i in range(len(dates)):
# Grab the next set of input arrays.
inputs = [next(w) for w in windows]
# Call the user's compute, which is responsible for writing
# values into `out`.
self.compute(
dates[i],
assets,
# This index is a non-copying operation.
# It creates a view into row `i` of `output`.
output[i],
*inputs # Unpack all the inputs.
)
return output
The basic idea here is that we've pre-fetched a sizeable amount of data, and we're now going to loop over windows into that data, call the user's compute function on the data, and write the result into a pre-allocated output array, which is then passed along to further transformations.
No matter what we do, we have to pay the cost of at least one copy to get the result of the user's compute function into the output array.
The most obvious API, as you point out, is to have the user simply return the output row, in which case the calling code would look like:
# Get the result row from the user.
result_row = self.compute(dates[i], assets, *inputs)
# Copy the user's result into our output buffer.
output[i] = result_row
If that were the API, then we're locked into paying at least the following costs for each invocation of the user's compute
Allocating the ~64000 byte array that the user will return.
A copy of the user's computed data into the user's output array.
A copy from the user's output array into our own, larger array.
With the existing API, we avoid costs (1) and (3).
With all that said, we've since made changes to how CustomFactors work that make some of the above optimizations less useful. In particular, we now only pass data to compute for assets that weren't masked out on that day, which requires a partial copy of the output array before and after the call to compute.
There are still some design reasons to prefer the existing API though. In particular, leaving the engine in control of the output allocation makes it easier for us to do things like pass recarrays for multi-output factors.

Related

Dask map_blocks is running earlier with a bad result for overlap and nested procedures

I'm using Dask to create a simple pipeline of data manipulation. I'm basically using 3 functions. The first two uses a simple map_blocks and the third one uses a map_blocks also but for an overlapped data.
For some reason, the third map_blocks is executing earlier than I want. See the code and the ipython output (without executing run():
data = np.arange(2000)
data_da = da.from_array(data, chunks=(500,))
def func1(block, block_info=None):
return block + 1
def func2(block, block_info=None):
return block * 2
def func3(block, block_info=None):
print("func3", block_info)
return block
data_da_1 = data_da.map_blocks(func1)
data_da_2 = data_da_1.map_blocks(func2)
data_da_over = da.overlap.overlap(data_da_2, depth=(1), boundary='periodic')
data_da_map = data_da_over.map_blocks(func3)
data_da_3 = da.overlap.trim_internal(data_da_map, {0: 1})
The output is:
func3 None
func3 None
It is still not respecting the number of blocks which is 4 here.
I really don't know what is wrong with this code. Specially because if I use visualize() to see the data graph, It builds the right data sequence I want.
Initially, I thought that overlap requires compute() before like rechunk does, but I've already tested that too.
Like many dask operations, da.overlap operations can either be passed a meta argument specifying the output types and dimensions, or dask will execute the function with a small (or length zero) subset of the data.
From the dask.array.map_overlap docs:
Note that this function will attempt to automatically determine the output array type before computing it, please refer to the meta keyword argument in map_blocks if you expect that the function will not succeed when operating on 0-d arrays.

With pytorch/python, is it better to overwrite variables or define new ones?

Suppose I have an algorithm that does the following in python with pytorch. Please ignore whether the steps are efficient. This is just a silly toy example.
def foo(input_list):
# input_list is a list of N 2-D pytorch tensors of shape (h,w)
tensor = torch.stack(input_list) # convert to tensor.shape(h,w,N)
tensor1 = torch.transpose(tensor,0,2).unsqueeze(1) # convert to tensor.shape(N,1,h,w)
tensor2 = torch.interpolate(tensor1,size=(500,500) # upsample to new shape (N,1,500,500)
def bar(input_list):
tensor = torch.stack(input_list) # convert to tensor.shape(h,w,N)
tensor = torch.transpose(tensor,0,2).unsqueeze(1) # convert to tensor.shape(N,1,h,w)
tensor = torch.interpolate(tensor,size=(500,500) # upsample to new shape (N,1,500,500)
My question is whether it makes more sense to use method foo() or bar() or if it doesn't matter. My thought was that I save memory by rewriting over the same variable name (bar), since I will never actually need those intermediate steps. But if the CUDA interface is creating new memory spaces for each function, then I'm spending the same amount of memory with both methods.
tensor and tensor1 in your example are just different views of the same data in memory, so the memory difference of potentially maintaining two slightly different references to it should be negligible. The relevant part would only be tensor1 vs tensor2.
You might want to see this similar question:
Python: is the "old" memory free'd when a variable is assigned new content?
Since the reassignment to tensor that actually allocates new memory is also the final call in bar, I suspect that in this particular example the total memory wouldn't be impacted (tensor1 would be unreferenced once the function returns anyway).
With a longer chain of operations, I don't think the GC is guaranteed to be called on any of these reassignments, though it might give python some more flexibility. I'd probably prefer the style in foo just because it's easier to later change the order of operations in the chain. Keeping track of different names adds overhead for the programmer, not just the interpreter.

Counting used attributes at runtime

I'm working on a python project that requires me to compile certain attributes of some objects into a dataset. The code I'm currently using is something like the following:
class VectorBuilder(object):
SIZE = 5
def __init__(self, player, frame_data):
self.player = player
self.fd = frame_data
def build(self):
self._vector = []
self._add(self.player)
self._add(self.fd.getSomeData())
self._add(self.fd.getSomeOtherData())
char = self.fd.getCharacter()
self._add(char.getCharacterData())
self._add(char.getMoreCharacterData())
assert len(self._vector) == self.SIZE
return self._vector
def _add(self, element):
self._vector.append(element)
However, this code is slightly unclean because adding/removing attributes to/from the dataset also requires correctly adjusting the SIZE variable. The reason I even have the SIZE variable is that the size of the dataset needs to be known at runtime before the dataset itself is created.
I've thought of instead keeping a list of all the functions used to construct the dataset as strings (as in attributes = ['getPlayer', 'fd.getSomeData', ...]) and then defining the build function as something like:
def build(self):
self._vector = []
for att in attributes:
self._vector.append(getattr(self, att)())
return self._vector
This would let me access the size as simply len(attributes) and I only ever need to edit attributes, but I don't know how to make this approach work with the chained function calls, such as self.fd.getCharacter().getCharacterData().
Is there a cleaner way to accomplish what I'm trying to do?
EDIT:
Some additional information and clarification is necessary.
I was using __ due to some bad advice I read online (essentially saying I should use _ for module-private members and __ for class-private members). I've edited them to _ attributes now.
The getters are a part of the framework I'm using.
The vector is stored as a private class member so I don't have to pass it around the construction methods, which are in actuality more numerous than the simple _add, doing some other stuff like normalisation and bool->int conversion on the elements before adding them to the vector.
SIZE as it currently stands, is a true constant. It is only ever given a value in the first line of VectorBuilder and never changed at runtime. I realise that I did not clarify this properly in the main post, but new attributes never get added at runtime. The adjustment I was talking about would take place at programming time. For example, if I wanted to add a new attribute, I would need to add it in the build function, e.g.:
self._add(self.fd.getCharacter().getAction().getActionData().getSpeed())
, as well as change the SIZE definition to SIZE = 6.
The attributes are compiled into what is currently a simple python list (but will probably be replaced with a numpy array), then passed into a neural network as an input vector. However, the neural network itself needs to be built first, and this happens before any data is made available (i.e. before any input vectors are created). In order to be built successfully, the neural network needs to know the size of the input vectors it will be receiving, though. This is why SIZE is necessary and also the reason for the assert statement - to ascertain that the vectors I'm passing to the network are in fact the size I claimed I would be passing to it.
I'm aware the code is unpythonic, that is why I'm here - the code works, it's just ugly.
Instead of providing the strings of the attributes as a list you would like to create the input arguments from, why don't you initialize the build function with a list containing all the values returned by your getter functions?
Your SIZE variable would then still be the length of the dynamic argument list provided in build(self,*args) for example.

Modifying variables in Python function is affecting variables with different names outside the function

I have a nested dictionary containing a bunch of data on a number of different objects (where I mean object in the non-programming sense of the word). The format of the dictionary is allData[i][someDataType], where i is a number designation of the object that I have data on, and someDataType is a specific data array associated with the object in question.
Now, I have a function that I have defined that requires a particular data array for a calculation to be performed for each object. The data array is called cleanFDF. So I feed this to my function, along with a bunch of other things it requires to work. I call it like this:
rm.analyze4complexity(allData[i]['cleanFDF'], other data, other data, other data)
Inside the function itself, I straight away re-assign the cleanFDF data to another variable name, namely clFDF. I.e. The end result is:
clFDF = allData[i]['cleanFDF']
I then have to zero out all of the data that lies below a certain threshold, as such:
clFDF[ clFDF < threshold ] = 0
OK - the function works as it is supposed to. But now when I try to plot the original cleanFDF data back in the main script, the entries that got zeroed out in clFDF are also zeroed out in allData[i]['cleanFDF']. WTF? Obviously something is happening here that I do not understand.
To make matters even weirder (from my point of view), I've tried to do a bodgy kludge to get around this by 'saving' the array to another variable before calling the function. I.e. I do
saveFDF = allData[i]['cleanFDF']
then run the function, then update the cleanFDF entry with the 'saved' data:
allData[i].update( {'cleanFDF':saveFDF} )
but somehow, simply by performing clFDF[ clFDF < threshold ] = 0 within the function modifies clFDF, saveFDF and allData[i]['cleanFDF'] in the main friggin' script, zeroing out all the entires at the same array indexes! It is like they are all associated global variables somehow, but I've made no such declarations anywhere...
I am a hopeless Python newbie, so no doubt I'm not understanding something about how it works. Any help would be greatly appreciated!
You are passing the value at allData[i]['cleanFDF'] by reference (decent explanation at https://stackoverflow.com/a/430958/337678). Any changes made to it will be made to the object it refers to, which is still the same object as the original, just assigned to a different variable.
Making a deep copy of the data will likely fix your issue (Python has a deepcopy library that should do the trick ;)).
Everything is a reference in Python.
def function(y):
y.append('yes')
return y
example = list()
function(example)
print(example)
it would return ['yes'] even though i am not directly changing the variable 'example'.
See Why does list.append evaluate to false?, Python append() vs. + operator on lists, why do these give different results?, Python lists append return value.

Force matrix_world to be recalculated in Blender

I'm designing an add-on for blender, that changes the location of certain vertices of an object. Every object in blender has a matrix_world atribute, that holds a matrix that transposes the coordinates of the vertices from the object to the world frame.
print(object.matrix_world) # unit matrix (as expected)
object.location += mathutils.Vector((5,0,0))
object.rotation_quaternion *= mathutils.Quaternion((0.0, 1.0, 0.0), math.radians(45))
print(object.matrix_world) # Also unit matrix!?!
The above snippet shows that after the translation, you still have the same matrix_world. How can I force blender to recalculate the matrix_world?
You should call Scene.update after changing those values, otherwise Blender won't recalculate matrix_world until it's needed [somewhere else]. The reason, according to the "Gotcha's" section in the API docs, is that this re-calc is an expensive operation, so it's not done right away:
Sometimes you want to modify values from python and immediately access the updated values, eg:
Once changing the objects bpy.types.Object.location you may want to access its transformation right after from bpy.types.Object.matrix_world, but this doesn’t work as you might expect.
Consider the calculations that might go into working out the objects final transformation, this includes:
animation function curves.
drivers and their pythons expressions.
constraints
parent objects and all of their f-curves, constraints etc.
To avoid expensive recalculations every time a property is modified, Blender defers making the actual calculations until they are needed.
However, while the script runs you may want to access the updated values.
This can be done by calling bpy.types.Scene.update after modifying values which recalculates all data that is tagged to be updated.
Calls to bpy.context.scene.update() can become expensive when called within a loop.
If your objects have no complex constraints (e.g. plain or parented), the following can be used to recompute the world matrix after changing object's .location, .rotation_euler\quaternion, or .scale.
def update_matrices(obj):
if obj.parent is None:
obj.matrix_world = obj.matrix_basis
else:
obj.matrix_world = obj.parent.matrix_world * \
obj.matrix_parent_inverse * \
obj.matrix_basis
Some notes:
Immediately after setting object location/rotation/scale the object's matrix_basis is updated
But matrix_local (when parented) and matrix_world are only updated during scene.update()
When matrix_world is manually recomputed (using the code above), matrix_local is recomputed as well
If the object is parented, then its world matrix depends on the parent's world matrix as well as the parent's inverse matrix at the time of creation of the parenting relationship.
I needed to do this too but needed this value to be updated whilst I imported a large scene with tens of thousands of objects.
Calling 'scene.update()' became exponentially slower, so I needed to find a way to do this without calling that function.
This is what I came up with:
def BuildScaleMatrix(s):
return Matrix.Scale(s[0],4,(1,0,0)) * Matrix.Scale(s[1],4,(0,1,0)) * Matrix.Scale(s[2],4,(0,0,1))
def BuildRotationMatrixXYZ(r):
return Matrix.Rotation(r[2],4,'Z') * Matrix.Rotation(r[1],4,'Y') * Matrix.Rotation(r[0],4,'X')
def BuildMatrix(t,r,s):
return Matrix.Translation(t) * BuildRotationMatrixXYZ(r) * BuildScaleMatrix(s)
def UpdateObjectTransform(ob):
ob.matrix_world = BuildMatrix(ob.location, ob.rotation_euler, ob.scale)
This isn't most efficient way to build a matrix (if you know of a better way in blender, please add) and this only works for XYZ order transforms but this avoids exponential slow downs when dealing with large data sets.
Accessing Object.matrix_world is causing it to "freeze" even though you don't do anything to it, eg:
m = C.active_object.matrix_world
causes the matrix to be stuck. Whenever you want to access the matrix use
Object.matrix_world.copy()
only if you want to write the matrix, use
C.active_object.matrix_world = m

Categories