Dask application has trouble scaling up, any suggestions? - python

I have a fairly straightforward Dask application (filter a large wavefield) that I am having trouble scaling up. My job function is:
def filter_wavefield(pos, butter_filter,wave):
filtered = signal.sosfilt(butter_filter,wave[pos,:]).astype("float32")
return filtered
After loading the data to dask arrays (no problem scaling up here), I have tried three approaches to submit jobs:
Pre-scatter the data and pass the futures to the last argument of the job function
wave_future = client.scatter(wave_on_slice_channel,broadcast=True)
# I choose to parallelize over the coordinates, so I make several bags of those
coord_list = [i for i in range(nelem*ngll)]
coord_bag = db.from_sequence(coord_list,npartitions=500)
# Submit the tasks
filtered = coord_bag.map(filter_wavefield, butter_filter,wave_future)
filtered_waves = filtered.compute()
This approach works very well for data size about up to 0.5 TiB. However, now my data size is 1.35 TiB, and even this ~3 fold increase breaks things down completely. When I run filtered_waves = filtered.compute() with the current dataset, everything freezes (i.e. dashboard useless) and the kernel simply dies.
Persist the large data
I then did some research on how people deal with > 1TiB data with dask and come across this page. In the video, the dask expert persist the data right away. I tried this:
wave = wave_on_slice_channel.persist()
wait(wave)
# Same stuff about coordinates
coord_list = [i for i in range(nelem*ngll)]
coord_bag = db.from_sequence(coord_list,npartitions=500)
# Submit the tasks
filtered = coord_bag.map(filter_wavefield, butter_filter,wave)
filtered_waves = filtered.compute()
Note here, the last argument passed in to the job function is no longer a future, but I suppose it's actual data already persisted to the memory.
I am using 500 cores with a total of > 3.5 TiB memory, so after persisting, my dashboard looks like the following
This is all great, and I have my 1.35TiB in the distributed memory. However, everything still breaks once filtered_waves = filtered.compute() is called. The symptom is the same as before. Everything freezes and becomes unresponsive. Shortly after the python kernel dies.
Using for loop + submit rather than dask bags + map
After persisting the data, I tried to run one instance of the job function:
signal.sosfilt(butter_filter,wave[120,:])
and this can be ran in a split of a second, which means that accessing the large data wave is not causing problems. Then I tried the following:
filtered_futures = []
for pos in coord_list:
filtered_futures.append(client.submit(filter_wavefield,pos,butter_filter,wave))
Once the last line above is ran, I do see some response on the dashboard. Most importantly, my job function filter_wavefield is now listed as one of the tasks being processed (previous two methods won't even get the job function into the progress bars), and the number of tasks for filter_wavefiled is quickly increasing. However, after it increased to about 2700, everything freezes like before. The full size of the for loop, i.e. the coordinate list is 27482400.
I am fairly confident that Dask is built to handle this specific application I am attempting, but I am very lost now as to what else could I try. It seems the problem lies with how I am partitioning the tasks, but I thought this is exactly what Dask bag is supposed to handle (see here), yet the failure of this is what's letting me to try other methods.
Any suggestions would be hugely appreciated!

Related

More optimized way to do itertools.combinations

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.
(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.
Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

For Python Dask, what is the difference between persist and scatter?

I have been reading the doc and searching online a bit, but I am still confused about the difference in between persist and scatter.
I have been working with data sets about half a TB large, and have been using scatter to generate futures and then send them to workers. This has been working fine. But recently I started scaling up, and now dealing with datasets a few TB large, and this method stops working. On the dashboard, I see workers not triggered and I am quite certain that this is a scheduler issue.
I saw this video by Matt Rocklin. When he deals with a large dataset, I saw first thing he does is to persist it to the memory (distributed memory). I will give this a try with my large datasets, but meanwhile I am wondering what is the difference between persist and scatter? What specific situations are they best suited? Do I still need to scatter after I persist?
Thanks.
First with persist, imagine you have table A, which is used to make table B, and then you use B to generate two tables C and D. You have two chains of lineage with A->B->C and A->B->D. The A->B sequence can be computed twice, once to generate C and another one for D. This is because of the lazy evaluation nature of Dask.
Scatter is also called broadcast in other distributed frameworks. Basically, you have a sizeable object that you want to send to the workers ahead of time to minimize the transfer. Think like a machine learning model. You can scatter it ahead of time so it's available on all workers.

Distributed chained computing with Dask on a high failure-rate cluster?

I am using Dask Bag to run some simple map-reduce computation on a special cluster:
import dask.bag as bag
summed_image = bag.from_sequence(my_ids).map(gen_image_from_ids).reduction(sum, sum).compute()
This code generates a chained computation, starts mapping from from_sequence and gen_image_from_ids, and then reduces all results into one with sum's. Thanks to Dask Bag's feature, the summation is done in parallel in a multi-level tree.
My special cluster setting has higher failure rate because my worker can be killed anytime and the CUP is taken over by other higher-order processes and then released after a while. The kill may occur once on only a single node per 5 minutes, but my total reduction job may take more than 5 minutes.
Although Dask is good at failure recovery, my job sometimes just never ends. Consider if any internal node in the job tree gets killed, the temporary intermediate results from all previous computations are missing. And the computation should restart from beginning.
There is replicate for Dask Future objects but I could not find similar feature on higher-level Dask Bag or Dataframe to ensure data resiliency. Please let me know if there is a common treatment to keep intermediate results in a Dask cluster with super-high failure rate.
Update - My workaround
Maybe any distributed computing system will suffer from frequent failures even though the system can recover from them. In my case the worker shutdown is not essentially system failure, but is triggered by the higher-order process. So instead of directly killing my workers, the higher-order process now launches a small python script to send retire_worker() command, when it starts running.
As documented, by retire_worker() scheduler will move data from the retired worker to another one available. So my problem is temporarily solved. However, I sill leave the question open since I think replicated, redundant computing would be a faster solution, and better use idle nodes in the cluster.
This might not be the solution you are looking for, but one option is to divide up the task sequence into small-enough batches that can ensure that the task will complete in time (or will be quick to re-do from scratch).
Something like this perhaps:
import dask.bag as db
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
results = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.compute()
)
results.append(summed_image)
summed_image = sum(results) # final result
There are other things to keep in mind here regarding re-starting the workflow on failure (or potentially launching smaller tasks in parallel), but hopefully this gives you a starting point for a workable solution.
Update: More trials later -- this answer is not ideal because client.replicate() command is blocking. I suspect it requires all futures to be done before making replica -- this is unwanted because 1. any intermediate node can disconnect before all are ready, and 2. it prevents other tasks to run asynchronously. I need other way to make replica.
After lots of trials, I found one way to replicate the intermediate results during chained computation to realize data redundancy. Note the parallel reduction function is a Dask Bag feature, which does not directly support replicate facility. However, as Dask document states, one can replicate low-level Dask Future objects to improve resiliency.
Following #SultanOrazbayev's post to manually perform partial sums, with persist() function to keep partial sums in cluster memory as in the comment, the returned item is essentially a Dask Future:
import dask.bag as db
from dask.distributed import futures_of
from toolz import partition_all
n_per_chunk = 100 # just a guess, the best number depends on the case
tasks = list(partition_all(n_per_chunk, my_ids))
bags = []
for t in tasks:
summed_image = (
db
.from_sequence(my_ids)
.map(gen_image_from_ids)
.reduction(sum, sum)
.persist()
)
bags.append(summed_image)
futures = futures_of(bags) # This can only be called on the .persist() result
I can then replicate these remote intermediate partial sums and feel safer to sum the futures to get final result:
client.replicate(futures, 5) # Improve resiliency by replicating to 5 workers
summed_image = client.submit(sum, futures).result() # The only line that blocks for the final result
Here I feel replica of 5 is stable for my cluster, although higher value will incur higher network overhead to pass the replica among workers.
This works but may be improved, like how to perform parallel reduction (sum) on the intermediate results, especially when there are lots of tasks. Please leave me your suggestions.

How to parallelize a for loop in python/pyspark (to potentially be run across multiple nodes on Amazon servers)?

Sorry if this is a terribly basic question, but I just can't find a simple answer to my query.
I have some computationally intensive code that's embarrassingly parallelizable. The pseudocode looks like this.
n = 500
rounds = 200
data = [d_1, ..., d_n]
values = [ 0 for _ in range(n) ]
for _ in range(rounds):
for i in range(n): # Inner Loop
values[i] = compute_stuff(data[i])
data = special_function(values)
Each iteration of the inner loop takes 30 seconds, but they are completely independent. So I want to run the n=500 iterations in parallel by splitting the computation across 500 separate nodes running on Amazon, cutting the run-time for the inner loop down to ~30 secs. How do I do this?
I'm assuming that PySpark is the standard framework one would use for this, and Amazon EMR is the relevant service that would enable me to run this across many nodes in parallel. So my question is: how should I augment the above code to be run on 500 parallel nodes on Amazon Servers using the PySpark framework? Or else, is there a different framework and/or Amazon service that I should be using to accomplish this?
Here are some details about the pseudocode. Each data entry d_i is a custom object, though it could be converted to (and restored from) 2 arrays of numbers A and B if necessary. The return value of compute_stuff (and hence, each entry of values) is also custom object. Although, again, this custom object can be converted to (and restored from) a dictionary of lists of numbers. Also, compute_stuff requires the use of PyTorch and NumPy. Finally, special_function isn't some simple thing like addition, so it can't really be used as the "reduce" part of vanilla map-reduce I think.
Any help is appreciated!
Based on your describtion I wouldn't use pyspark. To process your data with pyspark you have to rewrite your code completly (just to name a few things: usage of rdd's, usage of spark functions instead of python functions).
I think it is much easier (in your case!) to use something like the wonderful pymp. You don't have to modify your code much:
#still pseudocode
import pymp
n = 500
rounds = 200
data = [d_1, ..., d_n]
values = pymp.shared.list()
for _ in range(rounds):
with pymp.Parallel(n) as p:
for i in p.range(n):
values.append(compute_stuff(data[i]))
data = special_function(values)
In case the order of your values list is important, you can use p.thread_num +i to calculate distinctive indices.
Pymp allows you to use all cores of your machine. When you want to use several aws machines, you should have a look at slurm.

Dask: why has CPU usage suddenly dropped?

I'm doing some Monte Carlo for a model and figured that Dask could be quite useful for this purpose. For the first 35 hours or so, things were running quite "smoothly" (apart from the fan noise giving a sense that the computer was taking off). Each model run would take about 2 seconds and there were 8 partitions running it in parallel. Activity monitor was showing 8 python3.6 instances.
However, the computer has become "silent" and CPU usage (as displayed in Spyder) hardly exceeds 20%. Model runs are happening sequentially (not in parallel) and taking about 4 seconds each. This happened today at some point while I was working on other things. I understand that depending on the sequence of actions, Dask won't use all cores at the same time. However, in this case there is really just one task to be performed (see further below), so one could expect all partitions to run and finish more or less simultaneously. Edit: the whole set up has run successfully for 10.000 simulations in the past, the difference now being that there are nearly 500.000 simulations to run.
Edit 2: now it has shifted to doing 2 partitions in parallel (instead of the previous 1 and original 8). It appears that something is making it change how many partitions are simultaneously processed.
Edit 3: Following recommendations, I have used a dask.distributed.Client to track what is happening, and ran it for the first 400 rows. An illustration of what it looks like after completing is included below. I am struggling to understand the x-axis labels, hovering over the rectangles shows about 143 s.
Some questions therefore are:
Is there any relationship between running other software (Chrome, MS Word) and having the computer "take back" some CPU from python?
Or instead, could it be related to the fact that at some point I ran a second Spyder instance?
Or even, could the computer have somehow run out of memory? But then wouldn't the command have stopped running?
... any other possible explanation?
Is it possible to "tell" Dask to keep up the hard work and go back to using all CPU power while it is still running the original command?
Is it possible to interrupt an execution and keep whichever calculations have already been performed? I have noticed that stopping the current command doesn't seem to do much.
Is it possible to inquire on the overall progress of the computation while it is running? I would like to know how many model runs are left to have an idea of how long it would take to complete in this slow pace. I have tried using the ProgressBar in the past but it hangs on 0% until a few seconds before the end of the computations.
To be clear, uploading the model and the necessary data would be very complex. I haven't created a reproducible example either out of fear of making the issue worse (for now the model is still running at least...) and because - as you can probably tell by now - I have very little idea of what could be causing it and I am not expecting anyone to be able to reproduce it. I'm aware this is not best practice and apologise in advance. However, I would really appreciate some thoughts on what could be going on and possible ways to go about it, if anyone has been thorough something similar before and/or has experience with Dask.
Running:
- macOS 10.13.6 (Memory: 16 GB | Processor: 2.5 GHz Intel Core i7 | 4 cores)
- Spyder 3.3.1
- dask 0.19.2
- pandas 0.23.4
Please let me know if anything needs to be made clearer
If you believe it can be relevant, the main idea of the script is:
# Create a pandas DataFrame where each column is a parameter and each row is a possible parameter combination (cartesian product). At the end of each row some columns to store the respective values of some objective functions are pre-allocated too.
# Generate a dask dataframe that is the DataFrame above split into 8 partitions
# Define a function that takes a partition and, for each row:
# Runs the model with the coefficient values defined in the row
# Retrieves the values of objective functions
# Assigns these values to the respective columns of the current row in the partition (columns have been pre-allocated)
# and then returns the partition with columns for objective functions populated with the calculated values
# map_partitions() to this function in the dask dataframe
Any thoughts?
This shows how simple the script is:
The dashboard:
Update: The approach I took was to:
Set a large number of partitions (npartitions=nCores*200). This made it much easier to visualise the progress. I'm not sure if setting so many partitions is good practice but it worked without much of a slowdown.
Instead of trying to get a single huge pandas DataFrame in the end by .compute(), I got the dask dataframe to be written to Parquet (in this way each partition was written to a separate file). Later, reading all files into a dask dataframe and computeing it to a pandas DataFrame wasn't difficult, and if something went wrong in the middle at least I wouldn't lose the partitions that had been successfully processed and written.
This is what it looked like at a given point:
Dask has many diagnostic tools to help you understand what is going on inside your computation. See http://docs.dask.org/en/latest/understanding-performance.html
In particular I recommend using the distributed scheduler locally and watching the Dask dashboard to get a sense of what is going on in your computation. See http://docs.dask.org/en/latest/diagnostics-distributed.html#dashboard
This is a webpage that you can visit that will tell you exactly what is going on in all of your processors.

Categories