Need setup recommendation for parallel processing many contracts through many scenarios

Need setup recommendation for parallel processing many contracts through many scenarios - python

I need a recommendation from gurus out there on how to go about setting up a modeling application. I have thousands of scenarios to run on thousands for contracts for cash flow projections. Assuming I have 1000 scenarios and 1000 contracts I would need to run 1,000,000 projections (1000x1000). I'd like to do this in parallel using dask, ray or some other method. My data are in dataframes but I'm open to better suggestions. I can create 2 loops (scenario,contract) for each run but this would be sequential.
Scenario1 w Contract1
Scenario1 w Contract2
Scenario1 w Contract3
.
.
.
Scenario1000 w Contract1000
I'd like to distribute compute to multiple processor and multiple servers.
I'll save my question on the inner loop projections where I have to run 100 scenario projections at each time step of the 1,000,000 runs for next time.
Any suggestion to point me in the right direction would help.

From a simple conceptual perspective:
Write yourself a function that takes a contract and a scenario as parameters and performs the desired calculation
Use Python's multiprocessing to set up a worker pool
Create a Queue (from multiprocessing package) that is to be shared across workers
Fill the queue with all combinations (might be a good idea to use fixed indices and only push a tuple of the contract/scenario indices (C, S) to the queue to reduce required space
Map your function to the worker pool given the queue
There are more elaborate ways to do this (including amqp/celery/...) depending whether you want to distribute tasks across multiple machines or just to all of your locally available cores. This simple concept should contain all required keywords to build your first local multiprocessing on your own!

Related

Distributed chained computing with Dask on a high failure-rate cluster?

I am using Dask Bag to run some simple map-reduce computation on a special cluster:
import dask.bag as bag
summed_image = bag.from_sequence(my_ids).map(gen_image_from_ids).reduction(sum, sum).compute()
This code generates a chained computation, starts mapping from from_sequence and gen_image_from_ids, and then reduces all results into one with sum's. Thanks to Dask Bag's feature, the summation is done in parallel in a multi-level tree.
My special cluster setting has higher failure rate because my worker can be killed anytime and the CUP is taken over by other higher-order processes and then released after a while. The kill may occur once on only a single node per 5 minutes, but my total reduction job may take more than 5 minutes.
Although Dask is good at failure recovery, my job sometimes just never ends. Consider if any internal node in the job tree gets killed, the temporary intermediate results from all previous computations are missing. And the computation should restart from beginning.
There is replicate for Dask Future objects but I could not find similar feature on higher-level Dask Bag or Dataframe to ensure data resiliency. Please let me know if there is a common treatment to keep intermediate results in a Dask cluster with super-high failure rate.
Update - My workaround
Maybe any distributed computing system will suffer from frequent failures even though the system can recover from them. In my case the worker shutdown is not essentially system failure, but is triggered by the higher-order process. So instead of directly killing my workers, the higher-order process now launches a small python script to send retire_worker() command, when it starts running.
As documented, by retire_worker() scheduler will move data from the retired worker to another one available. So my problem is temporarily solved. However, I sill leave the question open since I think replicated, redundant computing would be a faster solution, and better use idle nodes in the cluster.

This might not be the solution you are looking for, but one option is to divide up the task sequence into small-enough batches that can ensure that the task will complete in time (or will be quick to re-do from scratch).
Something like this perhaps:
import dask.bag as db
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
results = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.compute()
)
results.append(summed_image)
summed_image = sum(results) # final result
There are other things to keep in mind here regarding re-starting the workflow on failure (or potentially launching smaller tasks in parallel), but hopefully this gives you a starting point for a workable solution.

Update: More trials later -- this answer is not ideal because client.replicate() command is blocking. I suspect it requires all futures to be done before making replica -- this is unwanted because 1. any intermediate node can disconnect before all are ready, and 2. it prevents other tasks to run asynchronously. I need other way to make replica.
After lots of trials, I found one way to replicate the intermediate results during chained computation to realize data redundancy. Note the parallel reduction function is a Dask Bag feature, which does not directly support replicate facility. However, as Dask document states, one can replicate low-level Dask Future objects to improve resiliency.
Following #SultanOrazbayev's post to manually perform partial sums, with persist() function to keep partial sums in cluster memory as in the comment, the returned item is essentially a Dask Future:
import dask.bag as db
from dask.distributed import futures_of
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
bags = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.persist()
)
bags.append(summed_image)
futures = futures_of(bags) # This can only be called on the .persist() result
I can then replicate these remote intermediate partial sums and feel safer to sum the futures to get final result:
client.replicate(futures, 5) # Improve resiliency by replicating to 5 workers
summed_image = client.submit(sum, futures).result() # The only line that blocks for the final result
Here I feel replica of 5 is stable for my cluster, although higher value will incur higher network overhead to pass the replica among workers.
This works but may be improved, like how to perform parallel reduction (sum) on the intermediate results, especially when there are lots of tasks. Please leave me your suggestions.

Open question - Is high level parallelising of many multi threaded serial jobs across a cluster using joblib backend possible?

I am totally new to Ray and have a question regarding it being a potential solution.
I am optimising an image modelling code and have successfully optimised it to run on a single machine, using multi-threaded numpy operations.
Each image generation is a serial operation, which scales across a single node.
What I’d like to do is scale each of these locally parallel jobs across multiple nodes.
Before refactoring, the code was parallelised serially at a high level, calculating single images in parallel. I would like to replicate this parallel behaviour again, across multiple nodes. Essentially this would be batch running a number of independent jobs which compute a single image in parallel across multiple nodes, where those computations themselves are independent of each other, the only communication requirement is sending parameters at the beginning (small) and image arrays at the end (large).
As mentioned the original parallel implementation used joblib to parallelise the serial image computation over cpus locally, with each image calculation on a separate cpu. Now, I want to replicate this, except with one image calculation process per node, which will them multi thread scale across that compute node.
So my idea is try the joblib backend for to control this process. This is the previous high level Joblib call for running multiple serial image computation in parallel.
enter image description here
I believe I can just encapsulate the above call with:
with joblib_backend(‘ray’):
The above loop is actually being called inside a method of a class, and the image computation uses the class self construct to pass around variables and arrays. Is there anything I have to do with actors to preserve this state?
Any thoughts or pointers would be greatly appreciated.

Threading vs Multiprocessing

Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?

Threading
Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond
Multiprocessing
As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.
Vectorization
Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this

Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.
The single biggest performance factor in multithreaded execution when there no
locking/barriers involved is data access locality eg. matrix multiplication kernels.
Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.
The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in
size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.
Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.

What is the most efficient way to utilize dask multiprocessing scheduler if data flow between tasks is big?

We have a dask compute graph (quite custom so we use dask delayed instead of collections). I've read in the docs that current scheduling policy is LIFO so that a worker process has big chances to get the data it has just computed for further steps down the graph. But as far as I understood task
computation results are still (de)serialized to hard drive in even in this case.
So the question is how much performance gain would I get trying to keep
as little tasks as possible down a single path of independent computations in a graph:
A) many small "map" tasks along each path
t --> t --> t -->...
some reduce stage
t --> t --> t -->...
B) one huge "map" task along for each path
T ->
some reduce stage
T ->
Thank you!

The dask multiprocessing scheduler will automatically fuse linear chains of tasks into single tasks, so your case A above will automatically become case B.
If your workloads are more complex and do require inter-node communication then you might want to try the distributed scheduler on a single computer. It manages data movement between workers more intelligently.
$ pip install dask distributed
>>> from dask.distributed import Client
>>> c = Client() # Starts local "cluster". Becomes the global scheduler
Further reading
http://dask.pydata.org/en/latest/scheduler-choice.html
http://dask.pydata.org/en/latest/optimize.html
Correction
Also, just as a note, Dask doesn't persist intermediate results on disk. Rather it communicates intermediate results directly between processes.

In what order does data get process from RDDs in Spark?

Context
Spark provides RDDs for which map functions can be used to lazily set up the operations for processing in parallel. RDD's can be created with a specified partitioning parameter that determines how many partitions to create per RDD, preferably this parameter equals the number of systems (Ex. You have 12 files to process, create an RDD with 3 partitions which splits the data into buckets of 4 each for 4 systems and all the files get processed concurrently in each system). It is my understand that these partitions control the portion of data that goes to each system for processing.
Issue
I need to fine tune and control how many functions run at same time per system. If 2 or more functions run on same GPU at the same time, the system will crash.
Question
If an RDD is not evenly nicely split (like in the example above), how many threads run concurrently on the system?
Example
In:
sample_files = ['one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'eleven.png','ten.png']
CLUSTER_SIZE = 3
example_rdd = sc.parallelize(sample_files, CLUSTER_SIZE)
example_partitions = example_rdd.glom().collect()
# Print elements per partition
for i, l in enumerate(example_partitions): print "parition #{} length: {}".format(i, len(l))
# Print partition distribution
print example_partitions
# How many map functions run concurrently when the action is called on this Transformation?
example_rdd.map(lambda s: (s, len(s))
action_results = example_rdd.reduceByKey(add)
Out:
parition #0 length: 8
parition #1 length: 8
parition #2 length: 10
[ ['one.jpg', 'free.jpg', 'two.png', 'zero.png', 'four.jpg', 'six.png', 'seven.png', 'eight.jpg'],
['nine.png', 'eleven.png', 'ten.png', 'ten.png', 'one.jpg', 'free.jpg', 'two.png', 'zero.png'],
['four.jpg', 'six.png', 'seven.png', 'eight.jpg', 'nine.png', 'eleven.png', 'ten.png', 'ten.png', 'eleven.png', 'ten.png'] ]
In Conclusion
What I need to know, is if the RDD is split the way it is, what controls how many threads are processed simultaneously? Is it the number of cores, or is there a global parameter that can be set so it only processes 4 at a time on each partition (system)?

In what order does data get process from RDDs in Spark?
Unless it is some border case, like only one partition, order is arbitrary or nondeterministic. This will depend on the cluster, on the data and on different runtime events.
A number of partitions sets only a limit of overall parallelism for a given stage or in other words it is a minimal unit of parallelism in Spark. No matter how much resources you allocate you a single stage should process more data than at the time. Once again there can be border cases when worker is not accessible and task is rescheduled on another machine.
Another possible limit you can think of is the number of the executor threads. Even if you increase the number of partitions a single executor thread will process only one at the time.
Neither of the above tell you where or when given partition will be processed. While you can use some dirty, inefficient and non-portable tricks at the configuration level (like single worker with a single executor thread per machine) to make sure that only a one partition is processed on a given machine at the time it is not particularly useful in general.
As a rule of thumb I would say that Spark code should never be concerned wit a time an place it is executed. There are some low level aspects of the API which provides means to set partition specific preferences but as far as I know these don't provide hard guarantees.
That being said one can think of at least few ways you can approach this problem:
long running executor threads with configuration level guarantees - it could be acceptable if Spark is responsible only for loading and saving data
singleton objects which control queuing jobs on the GPU
delegating GPU processing to specialized service which ensures proper access
On a side not you may be interested in Large Scale Distributed Deep Learning on Hadoop Clusters which roughly describes an architecture which can be applicable here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.