Disable pure function assumption in dask distributed - python

The Dask distributed library documentation says:
By default, distributed assumes that all functions are pure.
[...]
The scheduler avoids redundant computations. If the result is already in memory from a previous call then that old result will be used rather than recomputing it.
When benchmarking function runtimes, this caching behavior will get in the way, as we want to call the same function multiple times with the same inputs.
So is there a way to completely disable it?
I know that for submit and map there is an argument available. But for computations on dask collections I have not found any solution.

After some digging in the source code of distributed, I believe I have found an answer myself. Although someone might correct me if I didn't come to the right conclusion.
Short answer
It is not possible to globally disable disable the purity assumption in distributed.
However, for dask collection it is possible to separate computations from precomputed results with dask.graph_manipulation.clone().
Long answer
Internally, dask splits its computation up into labelled tasks.
A task label is called a "key" and is used to identify results from a computation (an execution of a task). Keys are used to identify dependencies between tasks and are therefore essential for how dask works.
When we submit a new computation graph, which is basically a list of tasks with their dependencies, to the scheduler in distributed, the scheduler checks whether some tasks have already been computed by checking their keys against the keys of the finished tasks, which the scheduler still holds.
This happens quite at the beginning of Scheduler.update_graph(), which is the method being called by the client when he wants to start a new computation.
There is no switch in the current implementation to disable this. The calls to plugin.update_graph() for the registered scheduler plugins also happen after this optimization phase, so we can neither regulate this behavior through plugins.
So what can we do?
By manually modifying the keys of the individual tasks in the graph, we can trick the scheduler into thinking that we have not yet computed this task.
Task keys usually have the format prefix-token, where the prefix is the original task name (e.g. function name) and the token is a hash built from the arguments of the task.
Distributed uses the task prefix to group tasks together and get an estimate for future runtimes.
The token is primarily used to identify different executions of the same task with different arguments.
So we can just adjust the token of the key to let dask think that we are running the task with different arguments.
This is in principle what dask.graph_manipulation.clone() does for us. It copies a dask collection and returns a new one such that the keys of the tasks in the internal graph are rewritten.

Related

How to submit a large set of long running parallel tasks to dask?

I have a computational workload that I originally ran with concurrent.futures.ProcessPoolExecutor which I converted to use dask so that I could make use of dask's integrations with distributed computing systems for scaling beyond one machine. The workload consists of two task types:
Task A: takes string/float inputs and produces a matrix (around 2000 x 2000). Task duration is usually 60 seconds or less.
Task B: takes the matrix from task A and uses it and some other small inputs to solve an ordinary differential equation. The solution is written to disk (so no return value). Task duration can be up to fifteen minutes.
There can multiple B tasks for each A task.
Originally, my code looked like this:
a_results = client.map(calc_a, a_inputs)
all_b_inputs = [(a_result, b_input) for b_input in b_inputs for a_result in a_results]
b_results = client.map(calc_b, all_b_inputs)
dask.distributed.wait(b_results)
because that was the clean translation from the concurrent.futures code (I actually kept the code so that it could be run either with dask or concurrent.futures so I could compare). client here is a distributed.Client instance.
I have been experiencing some stability issues with this code, especially for large numbers of tasks, and I think I might not be using dask in the best way. Recently, I changed my code to use Delayed instead like this:
a_results = [dask.delayed(calc_a)(a) for a in a_inputs]
b_results = [dask.delayed(calc_b)(a, b) for a in a_inputs for b in b_inputs]
client.compute(b_results)
I did this because I thought perhaps the scheduler could work through the tasks more efficiently if it examined the entire graph before starting anything rather than beginning to schedule the A tasks before knowing about the B tasks. This change seems to help some but I still see some stability issues.
I can create separate questions for the stability problems, but I first wanted to find out if I am using dask in the best way for this use case or if I should modify how I am submitting the tasks. Just to describe the problems briefly, the worst problem to me is that over time my workers drop to 0% CPU and tasks stop completing. Other problems include things like getting KilledWorker exceptions and seeing log messages about an unresponsive loop and time outs. Usually the scheduler runs fine for at least a few hours, completing thousands of tasks before these issues show up (which makes debugging difficult since the feedback loop is so long).
Some questions I have been wondering about:
I can have thousands of tasks to run. Can I submit these all to dask to start out or do I need to submit them in batches? My thought was that the dask scheduler would be better at scheduling tasks than my batching code.
If I do need to batch things myself, can I query the scheduler to find out the maximum number of workers so I can write something that will submit batches of the right size? Or do I need to make the batch size an input to my batching code?
In the end, my results all get written to disk and nothing gets returned. With the way I am running tasks, are resources getting held onto longer than necessary?
My B tasks are long but they could be split by scheduling tasks that solve for solutions at intermediate time steps and feeding those in as the inputs to subsequent solving tasks. I think I need to do this any way because I would like to use an HPC cluster with a timed queue and I think I need to use the lifetime parameter to retire workers to keep them from running over the time limit and that works best with short-lived tasks (to avoid losing work when shut down early). Is there an optimal way to split the B task?
There are lots of questions here, but with regards to the code snippets you provided, both look correct, but the futures version will scale better in my experience. The reason for that is that by default, whenever one of the delayed tasks fails, the computation of all delayed tasks halts, while futures can proceed as long as they are not directly affected by the failure.
Another observation is that delayed values will tend to hold on to resources after completion, while for futures you can at least .release() them once they have been completed (or use fire_and_forget).
Finally, with very large task lists, it might be worth to make them a bit more resilient to restarts. One basic option is to create simple text files after successful completion of a task, and then on restart check which tasks need to be re-computed. Fancier options include prefect and joblib.memory, but if you don't need all the bells and whistles, the text file route is often fastest.

Python and dask: how to invoke methods instead of functions on workers?

I have a dask Client with its workers.
And I want ho make my calculation in two steps:
1) Run the pre-calculation code (eats small settings objects, calculates slowly and generates rather big intermediate structures) one time per each worker, save the intermediate data on each worker.
2) Run the calculation function (much faster than pre-calculation, runs many times per each worker, and uses intermediate data saved on each worker).
How can I do this?
You do not need to do anything special for this to take place. Dask takes pains to schedule tasks to happen on workers where the data required by those tasks already resides. There are also heuristics in place comparing the size of the data, the transfer speed any work backlog to decide when it might be worthwhile to copy data to another worker.
Unless you are hitting specific problems with work distribution, it is likely that simply doing the normal thing: writing functions which depend on inputs using either delayed, the collections or the futures interface, things will be sensibly scheduled for you.
https://distributed.readthedocs.io/en/latest/locality.html

schedule tasks based purely on idleness, instead of data communication

Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?
While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html

Understanding concurrent.futures.Executor.map()

I'm trying to parallelize some Python code using processes and concurrent.futures. It looks like I can execute a function multiple times in parrallel either by submitting calls and then calling Future.result() on the futures, or by using Executor.map().
I'm wondering if the latter is just a syntactic sugar for the former and if there's any difference performance-wise. It doesn't seem immediately clear from the documentation.
It will allow you to execute a function multiple times concurrently instead true parallel execution.
Performance wise, I recently found that the ProcessPoolExecutor.submit() and ProcessPoolExecutor.map() consumed the same amount of compute time to complete the same task. Note: .submit() returns a future object (let's call it f) and you need to use it's f.result option to see it's result. On the other hand, .map() directly returns an iterator.
When converting their results into an ordered list using the sorted method, I have found that compute time of the entire .map()code can be faster than entire .submit() code in certain scenarios.
When converting their results into an unordered list using the list method, the compute time of the entire .submit() and .map() codes are the same. Also, these codes performed faster than the codes using the sorted method.
You can read the details in my answer. There, I have also shared my codes where you can see how they work. I hope they can be helpful to you.
I have not used ThreadPoolExecutor so I can't comment in detail. However, I have read that they are implemented the same way as the ProcessPoolExecutor and they are more suited to be used for I/O bound tasks instead of CPU bound tasks. You do need to specify the max_workers argument, i.e. the max number of threads, whereas in the ProcessPoolExecutor max_workers is an optional argument which defaults to the number of CPUs returned by os.cpu_count().

Is CCKeyDerivationPBKDF thread safe?

I'm using CCKeyDerivationPBKDF to generate and verify password hashes in a concurrent environment and I'd like to know whether it it thread safe. The documentation of the function doesn't mention thread safety at all, so I'm currently using a lock to be on the safe side but I'd prefer not to use a lock if I don't have to.
After going through the source code of the CCKeyDerivationPBKDF() I find it to be "thread unsafe". While the code for CCKeyDerivationPBKDF() uses many library functions which are thread-safe(eg: bzero), most user-defined function(eg:PRF) and the underlying functions being called from those user-defined functions, are potentially thread-unsafe. (For eg. due to use of several pointers and unsafe casting of memory eg. in CCHMac). I would suggest unless they make all the underlying functions thread-safe or have some mechanism to alteast make it conditionally thread-safe, stick with your approach, or modify the commoncrypto code to make it thread-safe and use that code.
Hope it helps.
Lacking documentation or source code, one option is to build a test app with say 10 threads looping on calls to CCKeyDerivationPBKDF with a random selection from say 10 different sets of arguments with 10 known results.
Each thread checks the result of a call to make sure it is what is expected. Each thread should also have a usleep() call for some random amount of time (bell curve sitting on say 10% of the time each call to CCKeyDerivationPBKDF takes) in this loop in order to attempt to interleave operations as much as possible.
You'll probably want to instrument it with debugging that keeps track of how much concurrency you are able to generate. With a 10% sleep time and 10 threads, you should be able to keep 9 threads concurrent.
If it makes it through an aggregate of say 100,000,000 calls without an error, I'd assume it was thread safe. Of course you could run it for much longer than that to get greater assurances.

Categories