Dask : how to parallelize and serialize methods? - python

I am trying to parallize methods from a class using Dask on a PBS cluster.
My greatest challenge is that this method should parallelize some computations, then run further parallel computations on the result. Of course, this should be distributed on the cluster to run similar computations on other data...
The cluster is created :
cluster = PBSCluster(cores=4,
memory=10GB,
interface="ib0",
queue=queue,
processes=1,
nanny=False,
walltime="02:00:00",
shebang="#!/bin/bash",
env_extra=env_extra,
python=python_bin
)
cluster.scale(8)
client = Client(cluster)
The class I need to distribute has 2 separate steps which do have to be run separately since step1 writes a file that is then read at the beginning of the second step.
I have tried the following by putting both steps one after the other in a method :
def computations(params):
my_class(**params).run_step1(run_path)
my_class(**params).run_step2()
chain = []
for p in params_compute:
y = dask.delayed(computations)(p)
chain.append(y)
dask.compute(*chain)
But it does not work because the second step is trying to read the file immediately.
So I need to find a way to stop the execution after step1.
I have tried to force the execution of first step by adding a compute() :
def computations(params):
my_class(**params).run_step1(run_path).compute()
my_class(**params).run_step2()
But it may not be a good idea because when running dask.compute(*chain) I'd be ultimately doing compute(compute()) .. which might explain why the second step is not executed ?
What would the best approach be ?
Should I include a persist() somewhere at the end of step1 ?
For info, step1 and step2 below :
def run_step1(self, path_step):
preprocess_result = dask.delayed(self.run_preprocess)(path_step)
gpu_result = dask.delayed(self.run_gpu)(preprocess_result)
post_gpu = dask.delayed(self.run_postgpu)(gpu_result) # Write a result file post_gpu.tif
return post_gpu
def run_step2(self):
data_file = rio.open(self.outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = self.process(data_file )
final_merge = dask.delayed(self.merging)(temp_result1 )
write =dask.delayed(self.write_final)(final_merge )
return write

This is only a rough suggestion, as I don't have a reproducible example as a starting point, but the key idea is to pass a delayed object to run_step2 to explicitly link it to run_step1. Note I'm not sure how essential for you it is to use a class in this case, but for me it's easier to pass the params as a dict explicitly.
def run_step1(params):
# params is assumed to be a dict
# unpack params here if needed (path_step was not explicitly in the `for p in params_compute:` loop so I assume it can be stored in params)
preprocess_result = run_preprocess(path_step, params)
gpu_result = run_gpu(preprocess_result, params)
post_gpu = run_postgpu(gpu_result, params) # Write a result file post_gpu.tif
return post_gpu
def run_step2(post_gpu, params):
# unpack params here if needed
data_file = rio.open(outputdir + "/post_gpu.tif").read() #opens the file written at the end of step1
temp_result1 = process(data_file, params)
final_merge = merging(temp_result1, params)
write = write_final(final_merge, params)
return write
chain = []
for p in params_compute:
y = dask.delayed(run_step1)(p)
z = dask.delayed(run_step2)(y, p)
chain.append(z)
dask.compute(*chain)

Sultan's answer almost works, but fails due to an internal misconception in the library I was provided.
I have used the following workaround which works for now (I'll use your solution later). I simply create 2 successive chains and compute them one after the other. Not really elegant but works fine...
chain1 = []
for p in params_compute:
y = (run_step1)(p)
chain1.append(y)
dask.compute(chain1)
chain2 = []
for p in params_compute:
y = (run_step2)(p)
chain2.append(y)
dask.compute(chain2)

Related

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Cost of accessing scattered data in a dask cluster

I use dask to parallelize some processing, which is quite a joy.
I have an case, where the calculation on the client side requires some lookup data that is quite heavy to generate, so scatter these data to the clients:
[future_dict] = client.scatter([large_dict], broadcast=True)
The calculation is then something like
def worker(i):
key = do_some_work()
data = future_dict.result()[key]
res = do_some_more_work( data )
return (i, res )
f = client.map( worker, range(200))
res = client.gather( f )
This works, but the lookup future_dict.result()[key] is quite slow. The time it takes to do the lookup in the worker is similar to unpickl'ing a pickled version of large_dict, so I guess my dictionary de-serialized in each worker.
Can I do anything to make access to scattered data faster? Eg if my hypothesis of the data being de-serialized in each worker is correct, can I the do something to make the de-serialization happen once in each client only?
What you're doing should be ok, but if you wanted to make it faster you could pass in the future an explicit argument.
def func(i, my_dict=None):
key = do_some_work()
data = my_dict[key]
res = do_some_more_work( data )
return (i, res )
f = client.map( func, range(200), my_dict=future_data)
res = client.gather( f )

loop of Pool.map and memory errors

I am experiencing a memory issue when I'm trying to run the below problem.
Consider a function that for each argument, arg_i_j, it returns a pandas dataframe as,
def some_fun(arg_i_j):
...
return DF_i_j
Now, I have structured all arguments that I want to test in the following format,
All_lists = [ [arg_0_0,..., arg_0_N], ..., [arg_k_0,..., arg_k_N]],
and I'm trying to execute the following code in the main function
# Version A
results_per_list = []
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
for a_list in All_lists:
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()
# then use results_per_list to do operations
and I end up with the error,
...\multiprocessing\connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
MemoryError
1) Anyone has any idea how can I resolve that issue??
2) Do you see any problem in creating a "pool" object for each "a_list" in "All_lists" like below?
# Version B
results_per_list = []
for a_list in All_lists:
the_pool = multiprocessing.Pool(processes=mp.cpu_count(), initializer=...,initargs=...)
results = the_pool.map(some_fun, a_list)
results_per_list.append(results)
the_pool.close()
the_pool.join()

How to create groups of N elements from a PCollection Apache Beam Python

I am trying to accomplish something like this:
Batch PCollection in Beam/Dataflow
The answer in the above link is in Java, whereas the language I'm working with is Python. Thus, I require some help getting a similar construction.
Specifically I have this:
p = beam.Pipeline (options = pipeline_options)
lines = p | 'File reading' >> ReadFromText (known_args.input)
After this, I need to create another PCollection but with a List of N rows of "lines" since my use case requires a group of rows. I can not operate line by line.
I tried a ParDo Function using variables for count associating with the counter N rows and after groupBy using Map. But these are reset every 1000 records, so it's not the solution I am looking for. I read the example in the link but I do not know how to do something like that in Python.
I tried saving the counters in Datastore, however, the speed difference between Dataflow reading and writing with Datastore is quite significant.
What is the correct way to do this? I don't know how else to approach it.
Regards.
Assume the grouping order is not important, you can just group inside a DoFn.
class Group(beam.DoFn):
def __init__(self, n):
self._n = n
self._buffer = []
def process(self, element):
self._buffer.append(element)
if len(self._buffer) == self._n:
yield list(self._buffer)
self._buffer = []
def finish_bundle(self):
if len(self._buffer) != 0:
yield list(self._buffer)
self._buffer = []
lines = p | 'File reading' >> ReadFromText(known_args.input)
| 'Group' >> beam.ParDo(Group(known_args.N)
...

Writing to file in Pool multiprocessing (Python 2.7)

I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions?
I greatly appreciate any help :)!
You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close() and join() are explained well in this post. It turns out (in the comment to the linked post) that map is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', ).. or the for loop underneath - for readability and functionality.
Multiprocessing.pool spawns processes, writing to a common file without lock from each process can cause data loss.
As you said you are trying to parallelise the calculation, multiprocessing.pool can be used to parallelize the computation.
Below is the solution that do parallel computation and writes the result in file, hope it helps:
from multiprocessing import Pool
# library for time
import datetime
# file in which you want to write
fout = open('test.txt', 'wb')
# function for your calculations, i have tried it to make time consuming
def calc(x):
x = x**2
sum = 0
for i in range(0, 1000000):
sum += i
return x
# function to write in txt file, it takes list of item to write
def f(res):
global fout
for x in res:
fout.write(str(x) + "\n")
if __name__ == '__main__':
qs = datetime.datetime.now()
arr = [1, 2, 3, 4, 5, 6, 7]
p = Pool(5)
res = p.map(calc, arr)
# write the calculated list in file
f(res)
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
# to compare the improvement using multiprocessing, iterative solution
qs = datetime.datetime.now()
for item in arr:
x = calc(item)
fout.write(str(x)+"\n")
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000

Categories