Cost of accessing scattered data in a dask cluster - python

I use dask to parallelize some processing, which is quite a joy.
I have an case, where the calculation on the client side requires some lookup data that is quite heavy to generate, so scatter these data to the clients:
[future_dict] = client.scatter([large_dict], broadcast=True)
The calculation is then something like
def worker(i):
key = do_some_work()
data = future_dict.result()[key]
res = do_some_more_work( data )
return (i, res )
f = client.map( worker, range(200))
res = client.gather( f )
This works, but the lookup future_dict.result()[key] is quite slow. The time it takes to do the lookup in the worker is similar to unpickl'ing a pickled version of large_dict, so I guess my dictionary de-serialized in each worker.
Can I do anything to make access to scattered data faster? Eg if my hypothesis of the data being de-serialized in each worker is correct, can I the do something to make the de-serialization happen once in each client only?

What you're doing should be ok, but if you wanted to make it faster you could pass in the future an explicit argument.
def func(i, my_dict=None):
key = do_some_work()
data = my_dict[key]
res = do_some_more_work( data )
return (i, res )
f = client.map( func, range(200), my_dict=future_data)
res = client.gather( f )

Related

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Dask : how to parallelize and serialize methods?

I am trying to parallize methods from a class using Dask on a PBS cluster.
My greatest challenge is that this method should parallelize some computations, then run further parallel computations on the result. Of course, this should be distributed on the cluster to run similar computations on other data...
The cluster is created :
cluster = PBSCluster(cores=4,
memory=10GB,
interface="ib0",
queue=queue,
processes=1,
nanny=False,
walltime="02:00:00",
shebang="#!/bin/bash",
env_extra=env_extra,
python=python_bin
)
cluster.scale(8)
client = Client(cluster)
The class I need to distribute has 2 separate steps which do have to be run separately since step1 writes a file that is then read at the beginning of the second step.
I have tried the following by putting both steps one after the other in a method :
def computations(params):
my_class(**params).run_step1(run_path)
my_class(**params).run_step2()
chain = []
for p in params_compute:
y = dask.delayed(computations)(p)
chain.append(y)
dask.compute(*chain)
But it does not work because the second step is trying to read the file immediately.
So I need to find a way to stop the execution after step1.
I have tried to force the execution of first step by adding a compute() :
def computations(params):
my_class(**params).run_step1(run_path).compute()
my_class(**params).run_step2()
But it may not be a good idea because when running dask.compute(*chain) I'd be ultimately doing compute(compute()) .. which might explain why the second step is not executed ?
What would the best approach be ?
Should I include a persist() somewhere at the end of step1 ?
For info, step1 and step2 below :
def run_step1(self, path_step):
preprocess_result = dask.delayed(self.run_preprocess)(path_step)
gpu_result = dask.delayed(self.run_gpu)(preprocess_result)
post_gpu = dask.delayed(self.run_postgpu)(gpu_result) # Write a result file post_gpu.tif
return post_gpu
def run_step2(self):
data_file = rio.open(self.outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = self.process(data_file )
final_merge = dask.delayed(self.merging)(temp_result1 )
write =dask.delayed(self.write_final)(final_merge )
return write
This is only a rough suggestion, as I don't have a reproducible example as a starting point, but the key idea is to pass a delayed object to run_step2 to explicitly link it to run_step1. Note I'm not sure how essential for you it is to use a class in this case, but for me it's easier to pass the params as a dict explicitly.
def run_step1(params):
# params is assumed to be a dict
# unpack params here if needed (path_step was not explicitly in the `for p in params_compute:` loop so I assume it can be stored in params)
preprocess_result = run_preprocess(path_step, params)
gpu_result = run_gpu(preprocess_result, params)
post_gpu = run_postgpu(gpu_result, params) # Write a result file post_gpu.tif
return post_gpu
def run_step2(post_gpu, params):
# unpack params here if needed
data_file = rio.open(outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = process(data_file, params)
final_merge = merging(temp_result1, params)
write = write_final(final_merge, params)
return write
chain = []
for p in params_compute:
y = dask.delayed(run_step1)(p)
z = dask.delayed(run_step2)(y, p)
chain.append(z)
dask.compute(*chain)
Sultan's answer almost works, but fails due to an internal misconception in the library I was provided.
I have used the following workaround which works for now (I'll use your solution later). I simply create 2 successive chains and compute them one after the other. Not really elegant but works fine...
chain1 = []
for p in params_compute:
y = (run_step1)(p)
chain1.append(y)
dask.compute(chain1)
chain2 = []
for p in params_compute:
y = (run_step2)(p)
chain2.append(y)
dask.compute(chain2)

How should I load a memory-intensive helper object per-worker in dask distributed?

I am currently trying to parse a very large number of text documents using dask + spaCy. SpaCy requires that I load a relatively large Language object, and I would like to load this once per worker. I have a couple of mapping functions that I would like to apply to each document, and I would hopefully not have to reinitialize this object for each future / function call. What is the best way to handle this?
Example of what I'm talking about:
def text_fields_to_sentences(
dataframe:pd.DataFrame,
...
)->pd.DataFrame:
# THIS IS WHAT I WOULD LIKE TO CHANGE
nlp, = setup_spacy(scispacy_version)
def field_to_sentences(row):
result = []
doc = nlp(row[text_field])
for sentence_tokens in doc.sents:
sentence_text = "".join([t.string for t in sentence_tokens])
r = text_data.copy()
r[sentence_text_field] = sentence_text
result.append(r)
return result
series = dataframe.apply(
field_to_sentences,
axis=1
).explode()
return pd.DataFrame(
[s[new_col_order].values for s in series],
columns=new_col_order
)
input_data.map_partitions(text_fields_to_sentences)
You could create the object as a delayed object
corpus = dask.delayed(make_corpus)("english")
And then use this lazy value in place of the full value:
df = df.text.apply(parse, corpus=corpus)
Dask will call make_corpus once on one machine and then pass it around to the workers as it is needed. It will not recompute any task.

How to ease and efficient store simulation data for numpy ufuncs in OO

In a jupyter notebook I OO-modeled a resource but in the control loop need to aggregate data over multiple objects being inefficient compared to ufuncs and similar operations. To package functionality i chose OO but for efficient and concise code i probably have to pull out the data into a storage class (maybe) and push all the ri[0] lines into a 2d array, in this case (2,K).
The class does not need the log, only the last entries.
K = 100
class Resource:
def __init__(self):
self.log = np.random( (5,K) )
# log gets filled during simulation
r0 = Resource()
r1 = Resource()
# while control loop:
#aggregate control data
for k in K:
total_row_0 = r0.log[0][k] + r1.log[0][k]
#do sth with the totals and loop again
This would greatly improves performance, but i have difficulty to link the data to the class if separately stored. How would you approach this? pandas DataFrames, np View or Shallow Copy?
[[...] #r0
[...] ]#r1 same data into one array, efficient but map back to class difficult
Here is my take on it:
import numpy as np
K = 3
class Res:
logs = 2
def __init__(self):
self.log = None
def set_log(self, view):
self.log = view
batteries = [Res(), Res()]
d = {'Res': np.random.random( (Res.logs * len(batteries), K) )}
for i in range(len(batteries)):
view = d['Res'].view()[i::len(batteries)][:]
batteries[i].set_log(view)
print(d)
batteries[1].log[1][2] = 1#test modifies view of last entry of second Res of second log
print(d)

Create a set with multiprocessing

I have a big list of items, and some auxiliary data. For each item in the list and element in data, I compute some thing, and add all the things into an output set (there may be many duplicates). In code:
def process_list(myList, data):
ret = set()
for item in myList:
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
what_I_Want = process_list(myList, data)
Because myList is big and compute(item, foo) is costly, I need to use multiprocessing. For now this is what I have:
from multiprocessing import Pool
initialize_worker(bar):
global data
data = bar
def process_item(item):
ret = set()
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
p = Pool(nb_proc, initializer = initialize_worker, initiargs = (data))
ret = p.map(process_item, myList)
what_I_Want = set().union(*ret)
What I do not like about that is that ret can be big. I am thinking about 3 options:
1) Chop myList into chunks a pass them to the workers, who will use process_list on each chunk (hence some duplicates will be removed at that step), and then union all the sets obtained to remove the last duplicates.
question: Is there an elegant way of doing that? Can we specify to Pool.map that it should pass the chunks to the workers instead of each item in the chunks? I know I could chop the list by myself, but this is damn ugly.
2) Have a shared set between all processes.
question: Why multiprocessing.manager does not feature set()? (I know it has dict(), but still..) If I use a manager.dict(), won't the communications between the processes and the manager slow down considerably the thing?
3) Have a shared multiprocessing.Queue(). Each worker puts the things it computes into the queue. Another worker does the unioning until some stopItem is found (which we put in the queue after the p.map)
question: Is this a stupid idea? Are communications between processes and a multiprocessing.Queue faster than those with a, say, manager.dict()? Also, how could I get back the set computed by the worker doing the unioning?
A minor thing: initiargs takes a tuple.
If you want to avoid creating all the results before reducing them into a set, you can use Pool.imap_unordered() with some chunk size. That will produce chunk size results from each workers as they become available.
If you want to change process_item to accept chunks directly, you have to do it manually. toolz.partition_all could be used to partition the initial dataset.
Finally, the managed data structures are bound to have much higher synchronization overhead. I'd avoid them as much as possible.
Go with imap_unordered and see if that's good enough; if not, then partition; if you cannot help having more than a couple duplicates total, use a managed dict.

Categories