Asynchronous Training with Ray

Asynchronous Training with Ray - python

I want to be able to throw at some ray workers a lot of data collection tasks where a trainer is working concurrently and asynchronously on another cpu training on the collected data, the notion resembles this example from the docs: https://docs.ray.io/en/master/auto_examples/plot_parameter_server.html#asynchronous-parameter-server-training
Difference is I don't want to hang waiting for the next sample to arrive, blocking me from assigning a new task (with the ray.wait in the attached example), but throw at the pool a lot of samples and condition the trainer's training process to start only when at least N samples were collected using the data collection tasks.
How can I do that using ray?

Can you take a look at e.g. DQN's or SAC's execution plan in RLlib?
ray/rllib/agents/dqn/dqn.py::execution_plan().
E.g. DQN samples via the remote workers and puts the collected samples into the buffer, while - at the same time - sampling from that buffer and doing learning updates on this buffer-sampled data. You can also set the "training intensity", the ratio between time steps sampled and time steps trained.
SAC works the same. APEX-DQN on the other hand uses distributed replay buffers to allow for even faster sample storage and retrieval.

Related

How to automatically define the optimal number of workers in a pipeline/conveyor?

Given a pipeline / conveyor with two steps:
a processing step
a writing step, which cannot be executed in parallel due to e.g. a filelock and is thus the bottleneck
How can define the optimal number of workers for the processing step?
If we choose too few workers, the writing step may become idle.
If we choose too many workers, the processing step may result in a large memory spike at the beginning.
If believe we want to aim for time_processing_step / number_workers ~= time_writing_step. However, we don't know these times upfront. And maybe the time is not even constant for each input item. Hence: Is there a way to automatically balance such a pipeline?
Notes:
I'm thinking of a pipeline implemented in the form of e.g. a ThreadPoolExecutor or ProcessPoolExecutor.
I found this relevant SO thread, but it's already 12 years old, so maybe something changed since then.

Featuretools taking too long to build features without using CPU cores

I'm using featuretools Deep Feature Sintesys to build features for a dataset of 40k rows and 200 columns. I choose about 40 transformation primitivies, as you can see in the code bellow:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="df", n_jobs=6,
trans_primitives=primitives.name.to_list(),
verbose=True)
but when I run my code, It takes a lot of time to discover the features to build, and this process doesn't run in multiple cores in my CPU, and not even a single-core gets 100% of usage. In other words, I'm waiting hours to run a process that is just using the minimal resources of my machine (memory also is not a problem).
After the feature tools discover the features (and print a log "built n features") then it creates the cluster and uses all the cores specified in the "n_jobs" parameter, in 100% of capability. This second moment is really fast, just some seconds, once all my resources are being used.
My question is, why is this happening? It's possible to discover the features faster to reduce this time? And just don't understand how a process that doesn't use resources takes too long.

Dataflow pipeline throughput decreases drastically as execution advances + unexpected side input behavior

I have a dataflow pipeline processing an input of about 1Gb of data with two dicts as side_inputs. The goal is to calculate features from the main dataset with the help of those two side_inputs.
Overall structure of the pipeline is as follows:
# First side input, ends up as a 2GB dict with 3.5 million keys
side_inp1 = ( p |
"read side_input1" >> beam.io.ReadFromAvro("$PATH/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# Second side input, ends up as a 1.6GB dict with 4.5 million keys
side_inp2 = (p |
"read side_input2" >> beam.io.ReadFromAvro("$PATH2/*.avro") |
"to list of tuples" >> beam.Map(lambda row: (row["key"], row["value"]))
)
# The main part of the pipeline, reading an avro dataset of 1 million rows -- 20GB
(p |
"read inputs" >> beam.io.ReadFromAvro("$MainPath/*.avro") |
"main func" >> beam.Map(MyMapper, pvalue.AsDict(side_inp1), pvalue.AsDict(side_inp2))
)
Here's the Dataflow graph:
And the "Featurize" step unwrapped:
So Featurize is a function that looks for ids in the side-inputs, .gets the vectors and does like 180 different ways of vector dot products to calculate some features. It's a completely CPU bound process and it's expected to take longer than the rest of the pipeline, but stalling is the thing that's strange here.
My problems are two fold:
The dataflow pipeline seems to slow down drastically as it moves further in the process. I don't know what the reasons are and how can I alleviate this problem. A throughput chart of the MyMapper step can be seen below, I'm wondering for the declining throughput (from ~400 rows/sec to nearly ~1 rows/sec in the end).
Also the behavior of side_inputs is strange to me. I expected the side_inputs to be read only and only once, but when I checkout the Job Metrics / Throughput chart, I observe the following chart. As can be seen, the pipeline is constantly reading in side_inputs, while what I want is only two dicts that are kept in memory.
Other job configurations
zone: us-central-1a
machine_type: m1-ultramem-40 (40 CPU cores, 960GB RAM)
disk_type/size: ssd/50GB
experiments: shuffle-service enabled.
max_num_workers: 1 to help ease calculations and metrics, and not have them vary due to auto-scaling.
Extra Observations
I'm constantly seeing log entires like the following in LogViewer: [INFO] Completed workitem: 4867151000103436312 in 1069.056863785 seconds"
All completed workItems so far have taken about 1000-1100 seconds, this is another source of confusion, why should throughput drop while processing workItems takes the same time as before? Has parallelism dropped for some reason? (maybe some hidden threading threshold that's out of my control, like harness_threads?).
In the later parts of the pipelines, looking at the logs, it looks the execution pattern is very sequential (Seems like it's executing 1 workItem, finishes it, goes to the next, which is strange to me, considering there's 1TB of available memory and 40cores)
There are 0 errors or even warnings

The throughput chart in point 1 is a good indicator that the performance in your job decreased somehow.
The side input is intended to be in memory; however, I'm not quite sure that a pipeline with only 1 highmem node is a good approach. By having only one node, the pipeline might have bottlenecks difficult to identify, e.g. Network or OS limitations (like max number of files opened in the OS related to the files loaded into memory). Because of beam's architecture, I think it is not a problem that you can have more nodes even if autoscaling is enabled since we find that autoscaling automatically chooses the appropriate number of worker instances required to run your job. If you are worried about calculations and metrics for other reasons, please share.
Regarding point 2, I think it is expected to find activity on the graph since the side input (in memory) is read by each element being processed. However, if this doesn't make sense for you, you can always add the complete job graph for us to understand any other details of the pipeline steps.
My recommendation is adding more workers to distribute the workaload as a PCollection is a distributed dataset that will be distributed among available nodes. You can try to have similar computational resources with more nodes, for example, 4 instances n2d-highmem-16 (16vCPU 128GB). With this changes it is possible that any bottlenecks dissapear or can be mitigated; in addition, you can monitor the new job in the same way:
Remember to check errors in your pipeline, so you can identify any other issues that are happening/causing the performance issue.
Check the CPU and Memory usage in Dataflow UI. If memory errors are happening at job level Stackdriver should shows them as memory errors, but also the memory in the host instance should be checked to be sure that it is not reaching the limit in the OS for other reasons.
You might want to check this example with side inputs as dictionaries. I'm not expert, but you can follow the best practice in the example.
UPDATE
If machines n2d-highmem-16 have OOM, it seems to me that each harness thread might use a copy of the dicts. Not quite sure if configuring the number of threads can help, but you can try to set number_of_worker_harness_threads in the pipeline options.
On the other hand, can you expand the step Featurize? The wall time is very high in this step (~6 days), let's check the composite transforms that absorbed such latency. For the problematic composite transforms let us know the code snippet. To identify the composite transforms that can have issues please refer to Side Inputs Metrics especially Time spent writing and Time spent reading.

Getting Dask map_blocks to make use of all available resources

I am using Dask to parallelize time series satellite imagery analysis on a cluster with a substantial amount of computational resources.
I have set up a distributed scheduler with many workers (--nprocs = 56) each managing one thread (--nthreads = 1) and 4GB of memory due to the embarrassingly parallel nature of the work.
My data comes in as an xarray that is chunked into a dask array and map_blocks is used to map a function across each chunk in order to generate an output array that will be saved to an image file.
data = inputArray.chunk(chunks={'y':1})
client.persist(data)
future = data.data.map_blocks(timeSeriesTrends.timeSeriesTrends, jd, drop_axis=[1])
future = client.persist(future)
dask.distributed.wait(future)
outputArray = future.compute()
My problem is that Dask does not make use of all the resources I have allocated to it. Instead it begins with very few parallelized tasks and slowly adds more as processes finish without ever reaching capacity.
This dramatically restricts the capabilities of the hardware I have access to as many of my resources spend most of their time sitting idle.
Is my approach appropriate for generating an output array from an input array? How can I best make use of the hardware I have access to in this situation?

Benchmark of HowTo: Reading Data

I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?

Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740

Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.

The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.