Problem
I'd like to get the minimum value in a dataframe and apply a function to it.
I also would like to do this lazily. However, doing so appears to introduce a performance cost.
Example
I believe this example captures this behaviour:
import dask
import dask.dataframe as dd
# Sample data
df = dask.datasets.timeseries(end='2002-01-31')
# Sample function
def f(x):
return 2*x
task = df['id'].min()
f(task.compute()) # Takes ~1.6s on my machine
dask.delayed(f)(task).compute() # Takes ~3.5s on my machine
Why is the second computation taking longer? Can this be improved somehow?
Additional notes
Looking at the dashboard, it appears that making f delayed makes the actual processing of the data slower. That is, the longer time is not caused by f becoming slow through delayed.
If you persist the dataframe beforehand, the time for both tasks is equal.
The effect appears also when you use read_parquet to read the data
Snakeviz
I have tried to visualize the tasks using snakeviz. I am showing only what I think are the important parts:
Without delayed
With delayed
Depending on the size of your input data, it seems to me that this may be related to hashing of the input to delayed function. This is an additional operation on top of the actual calculation which can make it slower if you have a lot of input data.
Related
I have a fairly straightforward Dask application (filter a large wavefield) that I am having trouble scaling up. My job function is:
def filter_wavefield(pos, butter_filter,wave):
filtered = signal.sosfilt(butter_filter,wave[pos,:]).astype("float32")
return filtered
After loading the data to dask arrays (no problem scaling up here), I have tried three approaches to submit jobs:
Pre-scatter the data and pass the futures to the last argument of the job function
wave_future = client.scatter(wave_on_slice_channel,broadcast=True)
# I choose to parallelize over the coordinates, so I make several bags of those
coord_list = [i for i in range(nelem*ngll)]
coord_bag = db.from_sequence(coord_list,npartitions=500)
# Submit the tasks
filtered = coord_bag.map(filter_wavefield, butter_filter,wave_future)
filtered_waves = filtered.compute()
This approach works very well for data size about up to 0.5 TiB. However, now my data size is 1.35 TiB, and even this ~3 fold increase breaks things down completely. When I run filtered_waves = filtered.compute() with the current dataset, everything freezes (i.e. dashboard useless) and the kernel simply dies.
Persist the large data
I then did some research on how people deal with > 1TiB data with dask and come across this page. In the video, the dask expert persist the data right away. I tried this:
wave = wave_on_slice_channel.persist()
wait(wave)
# Same stuff about coordinates
coord_list = [i for i in range(nelem*ngll)]
coord_bag = db.from_sequence(coord_list,npartitions=500)
# Submit the tasks
filtered = coord_bag.map(filter_wavefield, butter_filter,wave)
filtered_waves = filtered.compute()
Note here, the last argument passed in to the job function is no longer a future, but I suppose it's actual data already persisted to the memory.
I am using 500 cores with a total of > 3.5 TiB memory, so after persisting, my dashboard looks like the following
This is all great, and I have my 1.35TiB in the distributed memory. However, everything still breaks once filtered_waves = filtered.compute() is called. The symptom is the same as before. Everything freezes and becomes unresponsive. Shortly after the python kernel dies.
Using for loop + submit rather than dask bags + map
After persisting the data, I tried to run one instance of the job function:
signal.sosfilt(butter_filter,wave[120,:])
and this can be ran in a split of a second, which means that accessing the large data wave is not causing problems. Then I tried the following:
filtered_futures = []
for pos in coord_list:
filtered_futures.append(client.submit(filter_wavefield,pos,butter_filter,wave))
Once the last line above is ran, I do see some response on the dashboard. Most importantly, my job function filter_wavefield is now listed as one of the tasks being processed (previous two methods won't even get the job function into the progress bars), and the number of tasks for filter_wavefiled is quickly increasing. However, after it increased to about 2700, everything freezes like before. The full size of the for loop, i.e. the coordinate list is 27482400.
I am fairly confident that Dask is built to handle this specific application I am attempting, but I am very lost now as to what else could I try. It seems the problem lies with how I am partitioning the tasks, but I thought this is exactly what Dask bag is supposed to handle (see here), yet the failure of this is what's letting me to try other methods.
Any suggestions would be hugely appreciated!
If I have a dataset with unknown divisions and would like to sort it according to a column and output to Parquet, it seems to me that Dask does at least some of the work twice:
import dask
import dask.dataframe as dd
def my_identity(x):
"""Does nothing, but shows up on the Dask dashboard"""
return x
df = dask.datasets.timeseries()
df = df.map_partitions(my_identity)
df = df.set_index(['name']) # <- `my_identity` is calculated here, as well as other tasks
df.to_parquet('temp.parq') # <- previous tasks seem to be recalculated here
If my_identity was computationally demanding, then recomputing it would be really costly.
Am I correct in my understanding that Dask does some work twice here? Is there any way to prevent that?
The explanation below may not be accurate, but hopefully helps a bit.
Let's try to get into dask's shoes on this. We are asking dask to create an index based on some variable... Dask only works with sorted indexes, so Dask will want to know how to re-arrange data to make it sorted and also what will be the appropriate divisions for the partitions. The first calculation you see is doing that, and dask will store only the parts of calculation necessary for the divisions/data-reshuffling.
Then when we ask Dask to save the data, it computes the variables, shuffles the data (in line with the previous computations) and stores it in corresponding partitions.
How to avoid this? Possible options:
persist before setting the index. Once you persist, dask will compute the variable and keep it on workers, so setting index will refer to the results of that computation. There will still be reshuffling of the data needed). Note that the documentation suggests persisting after setting the index, but that case assumes that the column exists (does not require separate computation).
Sort within partitions, this can be done lazily, but of course it's only an option if you do not need a global sort.
Use plain pandas, this may necessitate some manual chunking of the data (what I tend to use for sorting).
I am using Dask on a single machine (LocalCluster with 4 processes, 16 threads, 68.56GB memory) and am running into worker memory problems when trying to compute two results at once which share a dependency.
In the example shown below, computing result with just one computation runs fine and quickly, with workers' combined memory usage maxing out at around 1GB. However, when computing results with two computations the workers quickly use all of their memory and start to write to disk when total memory usage is around 40GB. The computation will eventually finish, but there is a massive slowdown as would be expected once it starts writing to disk.
Intuitively, if one chunk is read in and then its two sums are immediately computed, then the chunk can be discarded and memory usage stays low. However, it appears that Dask is prioritizing the loading of the data instead of the later aggregate computations which clear up memory.
Any help understanding what's going on here would be greatly appreciated. How can I can compute two results with a common dependency without needing to read the underlying data twice or read it fully into memory?
import dask
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client
client = Client("localhost:8786")
array = da.random.normal(size=(int(1e9), 10), chunks=(int(1e6), 10))
df = dd.from_array(array, columns=[str(i) for i in range(10)])
# does not blow up worker memory, overall usage stays below 1GB total
result = dask.compute(df["0"].sum())
# does blow up worker memory
results = dask.compute([df["0"].sum(), df["1"].sum()])
The way the array is constructed, every time a chunk is created it has to generate every column of the array. So one opportunity for optimization (if possible) is to generate/load array in a way that allows for column-wise processing. This will reduce memory load of a single task.
Another venue for optimization is to explicitly specify the common dependencies, for example dask.compute(df[['0', '1']].sum()) will run efficiently.
However, the more important point is that by default dask follows some rules of thumb on how to prioritize work, see here. You have several options to intervene (not sure if this list is exhaustive): custom priorities, resource constraints, modify the compute graph (to allow workers to release memory from intermediate tasks without waiting for the final task to complete).
A simple way to modify the graph is to break down the dependency between the final sum figure and all the intermediate tasks by computing intermediate sums manually:
[results] = dask.compute([df["0"].map_partitions(sum), df["1"].map_partitions(sum)])
Note that results will be a list of two sublists, but it's trivial to calculate the sum of each sublist (trying to run sum on a delayed object would trigger computation, so it's more efficient to run sum after results are computed).
While migrating some code from Pandas to Dask I found an enormous performance difference between modifying a Dask dataframe by calling DataFrame.assign() with multiple columns vs modifying it with multiple DataFrame.__setitem__() (aka dataframe[x]=y) calls.
With imports
import pandas, dask, cProfile
For a Dask dataframe defined as:
dd = dask.dataframe.from_pandas(pandas.DataFrame({'a':[1]}), npartitions=1)
cProfile.run('for i in range(100): dd["c"+str(i)]=dd["a"]+i')
takes 1.436 seconds
while
cProfile.run('dd.assign(**{"c"+str(i):dd["a"]+i for i in range(100)})')
only takes 0.211 seconds. A 6.8X difference.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
What explains this difference? And more importantly, is there any way to get the assign performance without having to refactor code that is using dd[x]=y repeatedly?
This may not matter or happen for large datasets, I haven't checked, but it does for a single row (why I care about Dask being fast for single rows is a separate topic).
For context, there is a difference in Pandas too but it is a lot smaller:
df = pandas.DataFrame({'a':[1]})
cProfile.run('for i in range (100): df["c"+str(i)]=df["a"]+i')
takes 0.116 seconds.
cProfile.run('df.assign(**{"c"+str(i):df["a"]+i for i in range(100)})')
takes 0.096 seconds. Just 1.2X.
Two main reasons:
The for loop generates a larger task graph (one new layer per item in the loop), compared to the single additional task from the assign.
DataFrame.__setitem__ is actually implemented in terms of assign: https://github.com/dask/dask/blob/366c7998398bc778c4aa5f4b6bb22c25b584fbc1/dask/dataframe/core.py#L3424-L3432, so you end up calling the same code, just many more times. Each assign is associated with a copy in pandas.
I have already tried looking at the stats with pyprof2calltree but couldn't make sense of it.
Profilers like this (built on cProfile) aren't well suited for profiling parallel code like Dask.
I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.