I have a Random Forest, and using K-fold validation I'm trying to find optimal values for depth lenght and some other parameters.
I am running a QuadCore CPU thus I am trying to figure out, how i can iterate over say max_depth = range(50,101), such that when one of the cores is done fitting its forest with its max_depth it automatically takes the next max_depth in the list.
Or is it better splitting max_depth in 4 equal sizes and just make 4 processes manually?
IMHO, the best solution would be using queues and manual process-creation. That way, you have full control over the processes.
Create your function that does the Random forest after getting its max_depth from an input queue. Push the result in an output queue.
You put all your max_depthvalues in the input queue
You create a number of processes that best fit your architecture. Typically, 8 processes for 4 cores (with Hyperthreading) is a goot starting point
Start the processes.
After starting the processes each one will take one argument from the input queue and does your RandomForest. After one process has finished, it puts the result in another queue and retrieves another argument from the input queue. Using the queue, you don't have to care which process is finishing first etc. as the queues are thread-safe, so that only one process can access them. Any other will wait for access. You also wouldn't have to worry what your best split of the max_depth-list is. As soon as one process is finished it automatically gets a new value for the calculation. If nothing is in the queue, it stops.
I usually prefer this type of multiprocessing over Pool as I have a bit more control. Here is a small example.
Related
I am using pool.apply_async to parallelize my code. One of the arguments I am passing in is a set that I have stored in memory. Given I am passing in a pointer to the set (rather than the set itself) does multiprocessing make copies of the set for each CPU or is the same reference passed to each CPU? I assume that given the set is stored in memory, each CPU would receives it's own set. If that is not the case, would this be a bottleneck since each CPU would request the same object?
So in principle, if you create tasks for new processes, the data has to be copied over to the new processes, as the processes don't share memory (compared to threads). The details vary a bit from Operating System to Operating System (i.e. fork vs spawn) but in general the data has to be copied over.
Whether that is the bottleneck depends a bit on how much computation you are actually doing in the processes vs the amount of data that has to be transferred over to the child processes. I suggest to measure the times: (1) process start triggered by parent, (2) actual start of computation inside the child process and (3) end of computation. That should give you roughly the ramp-up (2-1) and the computation (3-2). With these numbers you can judge best if IO is the bottleneck or computation.
Given a pipeline / conveyor with two steps:
a processing step
a writing step, which cannot be executed in parallel due to e.g. a filelock and is thus the bottleneck
How can define the optimal number of workers for the processing step?
If we choose too few workers, the writing step may become idle.
If we choose too many workers, the processing step may result in a large memory spike at the beginning.
If believe we want to aim for time_processing_step / number_workers ~= time_writing_step. However, we don't know these times upfront. And maybe the time is not even constant for each input item. Hence: Is there a way to automatically balance such a pipeline?
Notes:
I'm thinking of a pipeline implemented in the form of e.g. a ThreadPoolExecutor or ProcessPoolExecutor.
I found this relevant SO thread, but it's already 12 years old, so maybe something changed since then.
I am using Dask Bag to run some simple map-reduce computation on a special cluster:
import dask.bag as bag
summed_image = bag.from_sequence(my_ids).map(gen_image_from_ids).reduction(sum, sum).compute()
This code generates a chained computation, starts mapping from from_sequence and gen_image_from_ids, and then reduces all results into one with sum's. Thanks to Dask Bag's feature, the summation is done in parallel in a multi-level tree.
My special cluster setting has higher failure rate because my worker can be killed anytime and the CUP is taken over by other higher-order processes and then released after a while. The kill may occur once on only a single node per 5 minutes, but my total reduction job may take more than 5 minutes.
Although Dask is good at failure recovery, my job sometimes just never ends. Consider if any internal node in the job tree gets killed, the temporary intermediate results from all previous computations are missing. And the computation should restart from beginning.
There is replicate for Dask Future objects but I could not find similar feature on higher-level Dask Bag or Dataframe to ensure data resiliency. Please let me know if there is a common treatment to keep intermediate results in a Dask cluster with super-high failure rate.
Update - My workaround
Maybe any distributed computing system will suffer from frequent failures even though the system can recover from them. In my case the worker shutdown is not essentially system failure, but is triggered by the higher-order process. So instead of directly killing my workers, the higher-order process now launches a small python script to send retire_worker() command, when it starts running.
As documented, by retire_worker() scheduler will move data from the retired worker to another one available. So my problem is temporarily solved. However, I sill leave the question open since I think replicated, redundant computing would be a faster solution, and better use idle nodes in the cluster.
This might not be the solution you are looking for, but one option is to divide up the task sequence into small-enough batches that can ensure that the task will complete in time (or will be quick to re-do from scratch).
Something like this perhaps:
import dask.bag as db
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
results = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.compute()
)
results.append(summed_image)
summed_image = sum(results) # final result
There are other things to keep in mind here regarding re-starting the workflow on failure (or potentially launching smaller tasks in parallel), but hopefully this gives you a starting point for a workable solution.
Update: More trials later -- this answer is not ideal because client.replicate() command is blocking. I suspect it requires all futures to be done before making replica -- this is unwanted because 1. any intermediate node can disconnect before all are ready, and 2. it prevents other tasks to run asynchronously. I need other way to make replica.
After lots of trials, I found one way to replicate the intermediate results during chained computation to realize data redundancy. Note the parallel reduction function is a Dask Bag feature, which does not directly support replicate facility. However, as Dask document states, one can replicate low-level Dask Future objects to improve resiliency.
Following #SultanOrazbayev's post to manually perform partial sums, with persist() function to keep partial sums in cluster memory as in the comment, the returned item is essentially a Dask Future:
import dask.bag as db
from dask.distributed import futures_of
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
bags = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.persist()
)
bags.append(summed_image)
futures = futures_of(bags) # This can only be called on the .persist() result
I can then replicate these remote intermediate partial sums and feel safer to sum the futures to get final result:
client.replicate(futures, 5) # Improve resiliency by replicating to 5 workers
summed_image = client.submit(sum, futures).result() # The only line that blocks for the final result
Here I feel replica of 5 is stable for my cluster, although higher value will incur higher network overhead to pass the replica among workers.
This works but may be improved, like how to perform parallel reduction (sum) on the intermediate results, especially when there are lots of tasks. Please leave me your suggestions.
To be specific I want to parallelize xgboost cross-validation
Please help me design such Dask application. Let's say I have a dask cluster. I want to do a 10-fold cross-validation for xgboost.
Let's say Scheduler needs to keep track of the current state of the job. It launches 10 xgboost tasks on 10 different workers(for each of the folds), with say 10000 iterations for each task maximum.
After each iteration is finished, there is a callback that reports current metric like rmse. So, worker would send that to Scheduler and receive an answer whether to continue or wrap up.
The main scheduler keeps periodically receiving those updates asynchronously. When all workers report a metric at a particular iteration, the scheduler aggregates them (just calculates mean) and pushes it to the current result stack. It also checks whether the result hasn't been improved in the last say 50 iterations, the scheduler tells all workers to wrap up (maybe at the next communication) and report back the result (which is a tree object).
After it gets them all, it returns all trees (and maybe metrics too).
To me it sounds like you're describing something similar to Hyperband, which is currently implemented in Dask-ML. You might want to look at these docs:
https://ml.dask.org/modules/generated/dask_ml.model_selection.HyperbandSearchCV.html?highlight=hyperband
If you want to implement something on your own, some of the pieces for that code may be of use to you as well. Dask-ML lives on Github at https://github.com/dask/dask-ml
Context
Spark provides RDDs for which map functions can be used to lazily set up the operations for processing in parallel. RDD's can be created with a specified partitioning parameter that determines how many partitions to create per RDD, preferably this parameter equals the number of systems (Ex. You have 12 files to process, create an RDD with 3 partitions which splits the data into buckets of 4 each for 4 systems and all the files get processed concurrently in each system). It is my understand that these partitions control the portion of data that goes to each system for processing.
Issue
I need to fine tune and control how many functions run at same time per system. If 2 or more functions run on same GPU at the same time, the system will crash.
Question
If an RDD is not evenly nicely split (like in the example above), how many threads run concurrently on the system?
Example
In:
sample_files = ['one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'one.jpg','free.jpg','two.png','zero.png',
'four.jpg','six.png','seven.png','eight.jpg',
'nine.png','eleven.png','ten.png','ten.png',
'eleven.png','ten.png']
CLUSTER_SIZE = 3
example_rdd = sc.parallelize(sample_files, CLUSTER_SIZE)
example_partitions = example_rdd.glom().collect()
# Print elements per partition
for i, l in enumerate(example_partitions): print "parition #{} length: {}".format(i, len(l))
# Print partition distribution
print example_partitions
# How many map functions run concurrently when the action is called on this Transformation?
example_rdd.map(lambda s: (s, len(s))
action_results = example_rdd.reduceByKey(add)
Out:
parition #0 length: 8
parition #1 length: 8
parition #2 length: 10
[ ['one.jpg', 'free.jpg', 'two.png', 'zero.png', 'four.jpg', 'six.png', 'seven.png', 'eight.jpg'],
['nine.png', 'eleven.png', 'ten.png', 'ten.png', 'one.jpg', 'free.jpg', 'two.png', 'zero.png'],
['four.jpg', 'six.png', 'seven.png', 'eight.jpg', 'nine.png', 'eleven.png', 'ten.png', 'ten.png', 'eleven.png', 'ten.png'] ]
In Conclusion
What I need to know, is if the RDD is split the way it is, what controls how many threads are processed simultaneously? Is it the number of cores, or is there a global parameter that can be set so it only processes 4 at a time on each partition (system)?
In what order does data get process from RDDs in Spark?
Unless it is some border case, like only one partition, order is arbitrary or nondeterministic. This will depend on the cluster, on the data and on different runtime events.
A number of partitions sets only a limit of overall parallelism for a given stage or in other words it is a minimal unit of parallelism in Spark. No matter how much resources you allocate you a single stage should process more data than at the time. Once again there can be border cases when worker is not accessible and task is rescheduled on another machine.
Another possible limit you can think of is the number of the executor threads. Even if you increase the number of partitions a single executor thread will process only one at the time.
Neither of the above tell you where or when given partition will be processed. While you can use some dirty, inefficient and non-portable tricks at the configuration level (like single worker with a single executor thread per machine) to make sure that only a one partition is processed on a given machine at the time it is not particularly useful in general.
As a rule of thumb I would say that Spark code should never be concerned wit a time an place it is executed. There are some low level aspects of the API which provides means to set partition specific preferences but as far as I know these don't provide hard guarantees.
That being said one can think of at least few ways you can approach this problem:
long running executor threads with configuration level guarantees - it could be acceptable if Spark is responsible only for loading and saving data
singleton objects which control queuing jobs on the GPU
delegating GPU processing to specialized service which ensures proper access
On a side not you may be interested in Large Scale Distributed Deep Learning on Hadoop Clusters which roughly describes an architecture which can be applicable here.