Dask difference between .take and .compute - python

I don't fully understand the semantics of .take and .compute.
From the dask.bad.Bag api docs:
Bag.compute(**kwargs) Compute this dask collection
Bag.take(k[, npartitions, compute, warn]) Take the first k elements.
This makes me think that If I compute the whole collection and then take one element .take will not trigger a re-computations. But it does. So when should someone use take vs compute? Is compute not supposed to be used during development when you want to inspect the result of a computation? Because if take(N) has the same result and using compute doesn't save anything, why use compute?

Related

scipy.signal.lfilter in Python

I have used
scipy.signal.lfilter(coefficient, 1, input, axis=0)
for filtering a signal in python with 9830000 samples (I have to use axis=0 to get similar answer to matlab), compere to matlab
filter(coefficient, 1, input)(seconds),
it takes very long time (minus) and it becomes worse when I have to filtering several of time. is there any suggestion for that? I have tried numpy as well same timing issue. Is there any other filter that I can received similar answers?
You can use ifilter from itertools to implement a faster filter.
The problem with the build-in function filter from Python is that it returns a list that is memory-consuming when you are dealing with lots of data. Therefore, itertools.ifilter is a better option because it calls the function only when needed.
Some other ideas depending on your implementation are:
If you are designing an FIR filter, you can apply convolve.
Or you can use correlation, to get faster results, because it uses FFT.

Using dask for loop parallelization in nested loops

I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.

Iterate and compute over multiple dask arrays

I have multiple dask arrays and would like to save them to a GIF or some movie format using imageio one frame at a time, but I think the problem is generic enough that the solution could help other people. I'm wondering if there is a way to compute the arrays in order and while computing one array and writing it to disk, start computing the next one on the remaining workers. If possible, it would be nice if the scheduler/graph could share tasks between the dask arrays if any.
The code would look something like this in my eyes:
import dask.array as da
writer = Writer(...)
for dask_arr in da.compute([dask_arr1, dask_arr2, dask_arr3]):
writer.write_frame(dask_arr)
It looks like this is probably hackable by users with the distributed scheduler, but I'd like to use the threaded scheduler if possible. I'm also not sure if this is super useful in my exact real world case given memory usage or possibly having to write entire frames at a time instead of chunks. I also don't doubt that this could be handled in a custom array-like object with da.store...some how.
If you're able to write a function that takes in a slice of the array and then writes it appropriately you might be able to use a function like da.map_blocks.
This would become much more complex if you're trying to write into a single file where random access is harder to guarantee.
Perhaps you could use map_blocks to save each slice as a single image and then use some post-processing tool to stitch those images together.

Opportunistic caching with reusable custom graphs in Dask

Dask supports defining custom computational graphs as well as opportinistic caching. The question is how can they be used together.
For instance, let's define a very simple computational graph, that computes x+1 operation,
import dask
def compute(x):
graph = {'step1': (sum, [x, 1])}
return dask.get(graph, 'step1')
print('Cache disabled:', compute(1), compute(2))
this yields 2 and 3 as expected.
Now we enable opportunistic caching,
from dask.cache import Cache
cc = Cache(1e9)
cc.register()
print('Cache enabled: ', compute(1), compute(2))
print(cc.cache.data)
we get incorrectly a result of 2 in both cases, because cc.cache.data is {'step1': 2} irrespective of the input.
I imagine this means that the input needs to be hashed (e.g. with dask.base.tokenize and appended to all the keys in the graph. Is there a simpler way of doing it, particularly since the tokenize function is not part of the public API?
The issue is that in complex graphs, a random step name, needs to account for the hash of all the inputs provided to it's children steps, which means that it's necessary to do full graph resolution.
It's important that key names in dask graphs are unique (as you found above). Additionally, we'd like identical computations to have the same key so we can avoid computing them multiple times - this isn't necessary for dask to work though, it just provides some opportunities for optimization.
In dask's internals we make use of dask.base.tokenize to compute a "hash" of the inputs, resulting in deterministic key names. You are free to make use of this function as well. In the issue you linked above we say the function is public, just that the implementation might change (not the signature).
Also note that for many use cases, we recommend using dask.delayed now instead of custom graphs for generating custom computations. This will do the deterministic hashing for you behind the scenes.

Flink batch data processing

I'm evaluating Flink for some processing batches of data. As a simple example say I have 2000 points which I would like to pass through an FIR filter using functionality provided by scipy. The scipy filter is a simple function which accepts a set of coefficients and the data to filter and returns the data. Is is possible to create a transformation to handle this in Flink? It seems Flink transformations are applied on a point by point basis but I may be missing something.
This should certainly be possible. Flink already has a Python API (beta) you might want to use.
About your second question: Flink can apply a function point by point and can do other stuff, too. It depends what kink of function you are defining. For example, filter, project, map, flatMap are applied per record; max, min, reduce, etc. are applied to a group of records (the groups are defined via groupBy). There is also the possibility to join data from different dataset using join, cross, or cogroup. Please have a look into the list of available transformations in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html

Categories