Distributed chained computing with Dask on a high failure-rate cluster?

Distributed chained computing with Dask on a high failure-rate cluster? - python

I am using Dask Bag to run some simple map-reduce computation on a special cluster:
import dask.bag as bag
summed_image = bag.from_sequence(my_ids).map(gen_image_from_ids).reduction(sum, sum).compute()
This code generates a chained computation, starts mapping from from_sequence and gen_image_from_ids, and then reduces all results into one with sum's. Thanks to Dask Bag's feature, the summation is done in parallel in a multi-level tree.
My special cluster setting has higher failure rate because my worker can be killed anytime and the CUP is taken over by other higher-order processes and then released after a while. The kill may occur once on only a single node per 5 minutes, but my total reduction job may take more than 5 minutes.
Although Dask is good at failure recovery, my job sometimes just never ends. Consider if any internal node in the job tree gets killed, the temporary intermediate results from all previous computations are missing. And the computation should restart from beginning.
There is replicate for Dask Future objects but I could not find similar feature on higher-level Dask Bag or Dataframe to ensure data resiliency. Please let me know if there is a common treatment to keep intermediate results in a Dask cluster with super-high failure rate.
Update - My workaround
Maybe any distributed computing system will suffer from frequent failures even though the system can recover from them. In my case the worker shutdown is not essentially system failure, but is triggered by the higher-order process. So instead of directly killing my workers, the higher-order process now launches a small python script to send retire_worker() command, when it starts running.
As documented, by retire_worker() scheduler will move data from the retired worker to another one available. So my problem is temporarily solved. However, I sill leave the question open since I think replicated, redundant computing would be a faster solution, and better use idle nodes in the cluster.

This might not be the solution you are looking for, but one option is to divide up the task sequence into small-enough batches that can ensure that the task will complete in time (or will be quick to re-do from scratch).
Something like this perhaps:
import dask.bag as db
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
results = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.compute()
)
results.append(summed_image)
summed_image = sum(results) # final result
There are other things to keep in mind here regarding re-starting the workflow on failure (or potentially launching smaller tasks in parallel), but hopefully this gives you a starting point for a workable solution.

Update: More trials later -- this answer is not ideal because client.replicate() command is blocking. I suspect it requires all futures to be done before making replica -- this is unwanted because 1. any intermediate node can disconnect before all are ready, and 2. it prevents other tasks to run asynchronously. I need other way to make replica.
After lots of trials, I found one way to replicate the intermediate results during chained computation to realize data redundancy. Note the parallel reduction function is a Dask Bag feature, which does not directly support replicate facility. However, as Dask document states, one can replicate low-level Dask Future objects to improve resiliency.
Following #SultanOrazbayev's post to manually perform partial sums, with persist() function to keep partial sums in cluster memory as in the comment, the returned item is essentially a Dask Future:
import dask.bag as db
from dask.distributed import futures_of
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
bags = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.persist()
)
bags.append(summed_image)
futures = futures_of(bags) # This can only be called on the .persist() result
I can then replicate these remote intermediate partial sums and feel safer to sum the futures to get final result:
client.replicate(futures, 5) # Improve resiliency by replicating to 5 workers
summed_image = client.submit(sum, futures).result() # The only line that blocks for the final result
Here I feel replica of 5 is stable for my cluster, although higher value will incur higher network overhead to pass the replica among workers.
This works but may be improved, like how to perform parallel reduction (sum) on the intermediate results, especially when there are lots of tasks. Please leave me your suggestions.

Related

Dask high memory usage when computing two values with common dependency

I am using Dask on a single machine (LocalCluster with 4 processes, 16 threads, 68.56GB memory) and am running into worker memory problems when trying to compute two results at once which share a dependency.
In the example shown below, computing result with just one computation runs fine and quickly, with workers' combined memory usage maxing out at around 1GB. However, when computing results with two computations the workers quickly use all of their memory and start to write to disk when total memory usage is around 40GB. The computation will eventually finish, but there is a massive slowdown as would be expected once it starts writing to disk.
Intuitively, if one chunk is read in and then its two sums are immediately computed, then the chunk can be discarded and memory usage stays low. However, it appears that Dask is prioritizing the loading of the data instead of the later aggregate computations which clear up memory.
Any help understanding what's going on here would be greatly appreciated. How can I can compute two results with a common dependency without needing to read the underlying data twice or read it fully into memory?
import dask
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client
client = Client("localhost:8786")
array = da.random.normal(size=(int(1e9), 10), chunks=(int(1e6), 10))
df = dd.from_array(array, columns=[str(i) for i in range(10)])
# does not blow up worker memory, overall usage stays below 1GB total
result = dask.compute(df["0"].sum())
# does blow up worker memory
results = dask.compute([df["0"].sum(), df["1"].sum()])

The way the array is constructed, every time a chunk is created it has to generate every column of the array. So one opportunity for optimization (if possible) is to generate/load array in a way that allows for column-wise processing. This will reduce memory load of a single task.
Another venue for optimization is to explicitly specify the common dependencies, for example dask.compute(df[['0', '1']].sum()) will run efficiently.
However, the more important point is that by default dask follows some rules of thumb on how to prioritize work, see here. You have several options to intervene (not sure if this list is exhaustive): custom priorities, resource constraints, modify the compute graph (to allow workers to release memory from intermediate tasks without waiting for the final task to complete).
A simple way to modify the graph is to break down the dependency between the final sum figure and all the intermediate tasks by computing intermediate sums manually:
[results] = dask.compute([df["0"].map_partitions(sum), df["1"].map_partitions(sum)])
Note that results will be a list of two sublists, but it's trivial to calculate the sum of each sublist (trying to run sum on a delayed object would trigger computation, so it's more efficient to run sum after results are computed).

Dataflow pipeline throughput decreases drastically as execution advances + unexpected side input behavior

I have a dataflow pipeline processing an input of about 1Gb of data with two dicts as side_inputs. The goal is to calculate features from the main dataset with the help of those two side_inputs.
Overall structure of the pipeline is as follows:
# First side input, ends up as a 2GB dict with 3.5 million keys
side_inp1 = ( p |
"read side_input1" >> beam.io.ReadFromAvro("$PATH/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# Second side input, ends up as a 1.6GB dict with 4.5 million keys
side_inp2 = (p |
"read side_input2" >> beam.io.ReadFromAvro("$PATH2/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# The main part of the pipeline, reading an avro dataset of 1 million rows -- 20GB
(p |
"read inputs" >> beam.io.ReadFromAvro("$MainPath/*.avro") |
"main func" >> beam.Map(MyMapper, pvalue.AsDict(side_inp1), pvalue.AsDict(side_inp2))
)
Here's the Dataflow graph:
And the "Featurize" step unwrapped:
So Featurize is a function that looks for ids in the side-inputs, .gets the vectors and does like 180 different ways of vector dot products to calculate some features. It's a completely CPU bound process and it's expected to take longer than the rest of the pipeline, but stalling is the thing that's strange here.
My problems are two fold:
The dataflow pipeline seems to slow down drastically as it moves further in the process. I don't know what the reasons are and how can I alleviate this problem. A throughput chart of the MyMapper step can be seen below, I'm wondering for the declining throughput (from ~400 rows/sec to nearly ~1 rows/sec in the end).
Also the behavior of side_inputs is strange to me. I expected the side_inputs to be read only and only once, but when I checkout the Job Metrics / Throughput chart, I observe the following chart. As can be seen, the pipeline is constantly reading in side_inputs, while what I want is only two dicts that are kept in memory.
Other job configurations
zone: us-central-1a
machine_type: m1-ultramem-40 (40 CPU cores, 960GB RAM)
disk_type/size: ssd/50GB
experiments: shuffle-service enabled.
max_num_workers: 1 to help ease calculations and metrics, and not have them vary due to auto-scaling.
Extra Observations
I'm constantly seeing log entires like the following in LogViewer: [INFO] Completed workitem: 4867151000103436312 in 1069.056863785 seconds"
All completed workItems so far have taken about 1000-1100 seconds, this is another source of confusion, why should throughput drop while processing workItems takes the same time as before? Has parallelism dropped for some reason? (maybe some hidden threading threshold that's out of my control, like harness_threads?).
In the later parts of the pipelines, looking at the logs, it looks the execution pattern is very sequential (Seems like it's executing 1 workItem, finishes it, goes to the next, which is strange to me, considering there's 1TB of available memory and 40cores)
There are 0 errors or even warnings

The throughput chart in point 1 is a good indicator that the performance in your job decreased somehow.
The side input is intended to be in memory; however, I'm not quite sure that a pipeline with only 1 highmem node is a good approach. By having only one node, the pipeline might have bottlenecks difficult to identify, e.g. Network or OS limitations (like max number of files opened in the OS related to the files loaded into memory). Because of beam's architecture, I think it is not a problem that you can have more nodes even if autoscaling is enabled since we find that autoscaling automatically chooses the appropriate number of worker instances required to run your job. If you are worried about calculations and metrics for other reasons, please share.
Regarding point 2, I think it is expected to find activity on the graph since the side input (in memory) is read by each element being processed. However, if this doesn't make sense for you, you can always add the complete job graph for us to understand any other details of the pipeline steps.
My recommendation is adding more workers to distribute the workaload as a PCollection is a distributed dataset that will be distributed among available nodes. You can try to have similar computational resources with more nodes, for example, 4 instances n2d-highmem-16 (16vCPU 128GB). With this changes it is possible that any bottlenecks dissapear or can be mitigated; in addition, you can monitor the new job in the same way:
Remember to check errors in your pipeline, so you can identify any other issues that are happening/causing the performance issue.
Check the CPU and Memory usage in Dataflow UI. If memory errors are happening at job level Stackdriver should shows them as memory errors, but also the memory in the host instance should be checked to be sure that it is not reaching the limit in the OS for other reasons.
You might want to check this example with side inputs as dictionaries. I'm not expert, but you can follow the best practice in the example.
UPDATE
If machines n2d-highmem-16 have OOM, it seems to me that each harness thread might use a copy of the dicts. Not quite sure if configuring the number of threads can help, but you can try to set number_of_worker_harness_threads in the pipeline options.
On the other hand, can you expand the step Featurize? The wall time is very high in this step (~6 days), let's check the composite transforms that absorbed such latency. For the problematic composite transforms let us know the code snippet. To identify the composite transforms that can have issues please refer to Side Inputs Metrics especially Time spent writing and Time spent reading.

Iterative Distributed Cross-Validation with Early Stopping

To be specific I want to parallelize xgboost cross-validation
Please help me design such Dask application. Let's say I have a dask cluster. I want to do a 10-fold cross-validation for xgboost.
Let's say Scheduler needs to keep track of the current state of the job. It launches 10 xgboost tasks on 10 different workers(for each of the folds), with say 10000 iterations for each task maximum.
After each iteration is finished, there is a callback that reports current metric like rmse. So, worker would send that to Scheduler and receive an answer whether to continue or wrap up.
The main scheduler keeps periodically receiving those updates asynchronously. When all workers report a metric at a particular iteration, the scheduler aggregates them (just calculates mean) and pushes it to the current result stack. It also checks whether the result hasn't been improved in the last say 50 iterations, the scheduler tells all workers to wrap up (maybe at the next communication) and report back the result (which is a tree object).
After it gets them all, it returns all trees (and maybe metrics too).

To me it sounds like you're describing something similar to Hyperband, which is currently implemented in Dask-ML. You might want to look at these docs:
https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html?highlight=hyperband
If you want to implement something on your own, some of the pieces for that code may be of use to you as well. Dask-ML lives on Github at https://github.com/dask/dask-ml

What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps down the graph. But as far as I understood task
computation results are still (de)serialized to hard drive in even in this case.
So the question is how much performance gain would I get trying to keep
as little tasks as possible down a single path of independent computations in a graph:
A) many small "map" tasks along each path
t --> t --> t -->...
some reduce stage
t --> t --> t -->...
B) one huge "map" task along for each path
T ->
some reduce stage
T ->
Thank you!

The dask multiprocessing scheduler will automatically fuse linear chains of tasks into single tasks, so your case A above will automatically become case B.
If your workloads are more complex and do require inter-node communication then you might want to try the distributed scheduler on a single computer. It manages data movement between workers more intelligently.
$ pip install dask distributed
>>> from dask.distributed import Client
>>> c = Client() # Starts local "cluster". Becomes the global scheduler
Further reading
http://dask.pydata.org/en/latest/scheduler-choice.html
http://dask.pydata.org/en/latest/optimize.html
Correction
Also, just as a note, Dask doesn't persist intermediate results on disk. Rather it communicates intermediate results directly between processes.

In what order does data get process from RDDs in Spark?

Context
Spark provides RDDs for which map functions can be used to lazily set up the operations for processing in parallel. RDD's can be created with a specified partitioning parameter that determines how many partitions to create per RDD, preferably this parameter equals the number of systems (Ex. You have 12 files to process, create an RDD with 3 partitions which splits the data into buckets of 4 each for 4 systems and all the files get processed concurrently in each system). It is my understand that these partitions control the portion of data that goes to each system for processing.
Issue
I need to fine tune and control how many functions run at same time per system. If 2 or more functions run on same GPU at the same time, the system will crash.
Question
If an RDD is not evenly nicely split (like in the example above), how many threads run concurrently on the system?
Example
In:
sample_files = ['one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'eleven.png','ten.png']
CLUSTER_SIZE = 3
example_rdd = sc.parallelize(sample_files, CLUSTER_SIZE)
example_partitions = example_rdd.glom().collect()
# Print elements per partition
for i, l in enumerate(example_partitions): print "parition #{} length: {}".format(i, len(l))
# Print partition distribution
print example_partitions
# How many map functions run concurrently when the action is called on this Transformation?
example_rdd.map(lambda s: (s, len(s))
action_results = example_rdd.reduceByKey(add)
Out:
parition #0 length: 8
parition #1 length: 8
parition #2 length: 10
[ ['one.jpg', 'free.jpg', 'two.png', 'zero.png', 'four.jpg', 'six.png', 'seven.png', 'eight.jpg'],
['nine.png', 'eleven.png', 'ten.png', 'ten.png', 'one.jpg', 'free.jpg', 'two.png', 'zero.png'],
['four.jpg', 'six.png', 'seven.png', 'eight.jpg', 'nine.png', 'eleven.png', 'ten.png', 'ten.png', 'eleven.png', 'ten.png'] ]
In Conclusion
What I need to know, is if the RDD is split the way it is, what controls how many threads are processed simultaneously? Is it the number of cores, or is there a global parameter that can be set so it only processes 4 at a time on each partition (system)?

In what order does data get process from RDDs in Spark?
Unless it is some border case, like only one partition, order is arbitrary or nondeterministic. This will depend on the cluster, on the data and on different runtime events.
A number of partitions sets only a limit of overall parallelism for a given stage or in other words it is a minimal unit of parallelism in Spark. No matter how much resources you allocate you a single stage should process more data than at the time. Once again there can be border cases when worker is not accessible and task is rescheduled on another machine.
Another possible limit you can think of is the number of the executor threads. Even if you increase the number of partitions a single executor thread will process only one at the time.
Neither of the above tell you where or when given partition will be processed. While you can use some dirty, inefficient and non-portable tricks at the configuration level (like single worker with a single executor thread per machine) to make sure that only a one partition is processed on a given machine at the time it is not particularly useful in general.
As a rule of thumb I would say that Spark code should never be concerned wit a time an place it is executed. There are some low level aspects of the API which provides means to set partition specific preferences but as far as I know these don't provide hard guarantees.
That being said one can think of at least few ways you can approach this problem:
long running executor threads with configuration level guarantees - it could be acceptable if Spark is responsible only for loading and saving data
singleton objects which control queuing jobs on the GPU
delegating GPU processing to specialized service which ensures proper access
On a side not you may be interested in Large Scale Distributed Deep Learning on Hadoop Clusters which roughly describes an architecture which can be applicable here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.