Applying multiple PTransforms on one PCollection simultaneously in Apache Beam pipeline - python

I am trying to create a beam pipeline to apply multiple ParDo Transforms at the same time on one PCollection and collect and print all results in a list. So far I've experiencing Sequential process, Like first ParDo then second ParDo after that.
Here's an example I have prepared for my issue:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
p = beam.Pipeline(options=PipelineOptions())
class Tr1(beam.DoFn):
def process(self, number):
number = number + 1
yield number
class Tr2(beam.DoFn):
def process(self, number):
number = number + 2
yield number
def pipeline_test():
numbers = p | "Create" >> beam.Create([1])
tr1 = numbers | "Tr1" >> beam.ParDo(Tr1())
tr2 = numbers | "Tr2" >> beam.ParDo(Tr2())
tr1 | "Print1" >> beam.Map(print)
tr2 | "Print2" >> beam.Map(print)
def main(argv):
del argv
pipeline_test()
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
app.run(main)

The scheduling of transform and elements is managed by the runner used to run the pipeline.
Runners typically tries to optimize the graph and might run certain tasks in sequence or in parallel.
In your case, both Tr1 and Tr2 are stateless and are applied to same input. In this case, runner typically run them on the same machine in sequence for the same element.
Note, runner will still run different elements in parallel.
It should look some thing like this.
Thread 1
ele1 -> Tr1
-> Tr2
Thread 2
ele1 -> Tr1
-> Tr2
I would not recommend relying on the expected parallelism of different part of the pipeline as it depends upon the runner.

Related

How to distribute multiprocess CPU usage over multiple nodes?

I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.

Nested dask delayed or futures

Looking for best practice for nested parallel jobs. I couldn't nest dask delayed or futures so I mixed both to get it to work. Is this not recommended? Is there better way to do this? Example:
import dask
from dask.distributed import Client
import random
import time
client = Client()
def rndSeries(x):
time.sleep(1)
return random.sample(range(1, 50), x)
def sqNum(x):
time.sleep(1)
return x**2
def subProcess(li):
results=[]
for i in li:
r = dask.delayed(sqNum)(i)
results.append(r)
return dask.compute(sum(results))[0]
futures=[]
for i in range(10):
x = client.submit(rndSeries,random.randrange(5,10,1))
y = client.submit(subProcess, x)
futures.append(y)
client.gather(futures)
Consider modification of your script to have a deterministic workflow. If you start with 1 worker, you will see that the process completes in 20 seconds (as expected, 2 processes of 1 second + 6 processes of 3 seconds). If you have 2 workers, the execution time will drop to 10 seconds.
import dask
from dask.distributed import Client, LocalCluster
import time
import numpy as np
cluster = LocalCluster(n_workers=1, threads_per_worker=1)
client = Client(cluster)
# if inside jupyter split the code below into a new cell
# to see accurate timing
%%time
def rndSeries(x):
time.sleep(1)
return np.random.rand()
def sqNum(x):
time.sleep(3)
return 1
def subProcess(li):
results=[]
li = [1,2,3]
for i in li:
r = dask.delayed(sqNum)(i)
results.append(r)
return dask.compute(sum(results))[0]
futures=[]
for i in range(2):
x = client.submit(rndSeries, np.random.rand())
y = client.submit(subProcess, x)
futures.append(y)
client.gather(futures)
What happens if you have 6 workers? Execution time is now 4 seconds (the lowest possible for this task), so it seems that the only drawback of dask.compute() inside a future version is that it forces the results of delayeds to be on a single worker. This is probably OK in many cases, however, if the combined resource requirements of all delayed tasks exceed resources of a single worker, then the best way to proceed is to submit tasks from tasks: https://distributed.dask.org/en/latest/task-launch.html

Zero simulation time

I am creating some kind of RAM memory. Idea was firstly to create RAM "write" functionality, as you can see in code below. Beside RAM memory, there is RAM model driver, which was suposed to write data to RAM (just to briefly verify if write functionality works properly).
RAM model driver and RAM model are connected to each other and some transaction should occur, but problem is that simulation is completed within zero simulation seconds.
Anybody has idea what could be a problem?
#gear
def ram_model(write_addr: Uint,
write_data: Queue['dtype'],*,
ram_mem = None,
dtype = b'dtype',
mem_granularity_in_bytes = 1) -> (Queue['dtype']):
if(ram_mem is None and type(ram_mem) is not dict):
ram_mem = {}
ram_write_op(write_addr = write_addr,
write_data = write_data,
ram_memory = ram_mem)
#gear
async def ram_write_op(write_addr: Uint,
write_data: Queue,*,
ram_memory = None,
mem_granularity_in_bytes = 1):
if(ram_memory is None and type(ram_mem) is not dict):
SystemError("Ram memory is %s but it should be dictionary",(type(ram_memory)))
byte_t = Array[Uint[8], mem_granularity_in_bytes]
async with write_addr as addr:
async for data, _ in write_data:
for b in code(data, byte_t):
ram_memory[addr] = b
addr += 1
#gear
async def ram_model_drv(*,addr_bus_width = b'asize',
data_type = b'dtype') -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size = (num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
for i in range(matrix[command_id].size):
yield (command_id, (matrix[command_id][i], i == matrix[command_id].size))
stimul = ram_model_drv(addr_bus_width = 8, data_type = Fixp[8,8])
out = ram_model(stimul[0], stimul[1])
sim()
Here is the output message:
python ram_model.py
- [INFO]: Running sim with seed: 3934280405122873233
0 [INFO]: -------------- Simulation start --------------
0 [INFO]: ----------- Simulation done ---------------
0 [INFO]: Elapsed: 0.00
Yeah, this one is a bit convoluted. Gist of the issue is that in the ram_model_drv module you are synchronously outputting data on both of its output interfaces with they yield statement. For PyGears this means that you would like data on both of these interfaces acknowledged before continuing. The ram_write_op module is connected to both of these interfaces via write_addr and write_data agruments. Inside that module, you acknowledge data from write_addr interface only after you've read multiple data from write_data interface, hence there's a deadlock and PyGears simulator detects that no further simulation steps are possible and exits at the end of the step 0.
There are also two additional issues in the driver:
It will never generate an eot for the output data Queue. Instead eot should be generated when i == matrix[command_id].size - 1.
The async modules are run in an endless loop by PyGears, so your ram_model_drv will generate the data endlessly unless you explicitly generate a GearDone exception.
OK, back to the main issue. There are several possibilities to circumvent it:
Use decoupling
For this to work, you first need to split data output in two yield statements, one for the write_addr and the other for the write_data, since your ram_write_op will use only one address per several write data.
#gear
async def ram_model_drv(*, addr_bus_width, data_type) -> (Uint[8], Queue['data_type']):
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
yield (command_id, None)
for i in range(matrix[command_id].size):
yield (None, (matrix[command_id][i], i == matrix[command_id].size - 1))
raise GearDone
You can use either dreg or decouple modules to temporarily store output data from ram_model_drv before they are consumed by ram_write_op.
out = ram_model(stimul[0] | decouple, stimul[1] | decouple)
Split driver into two modules, one driving each of the two interfaces
Use low level synchronization API for interfaces
Underneath the yield mechanism, there is a lower level API for communicating via PyGears interfaces. Handles to output interfaces can be obtained via module().dout field. Data can be sent via interface without waiting for it to be acknowledged using put_nb() method. Later, in order to wait for the aknowledgment, ready() method can be awaited. Finally, put() method combines the two in one call, so it will both send the data and wait for the acknowledgement.
#gear
async def ram_model_drv(*,
addr_bus_width=b'asize',
data_type=b'dtype') -> (Uint[8], Queue['data_type']):
addr, data = module().dout
num_of_w_comnds = 15
matrix = np.random.randint(10, size=(num_of_w_comnds, 10))
for command_id in range(num_of_w_comnds):
addr.put_nb(command_id)
for i in range(matrix[command_id].size):
await data.put((matrix[command_id][i], i == matrix[command_id].size - 1))
await addr.ready()
raise GearDone

multiprocessing: processing a large dataset

I am working with DEAP.
I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points).
I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation.
I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.
I guess that my question is more related to multiprocessing than to DEAP.
This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.
I have started to work with Python multiprocessing module.
The code looks like this
toolbox = base.Toolbox()
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 2)
toolbox.register("map", pool.map)
data = pd.read_pickle(PICKLE_DATA).values
And then, a little bit further:
def main():
NGEN = 10
CXPB = 0.5
MUTPB = 0.2
population = toolbox.population_guess()
fitnesses = list(toolbox.map(toolbox.evaluate, population))
print(sorted(fitnesses, reverse = True))
for ind, fit in zip(population, fitnesses):
ind.fitness.values = fit
# Begin the evolution
for g in range(NGEN):
The evaluation function uses the global "data" variable.
and, finally:
if __name__ == "__main__":
start = datetime.now()
main()
pool.close()
stop = datetime.now()
delta = stop-start
print (delta.seconds)
So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":.
It somehow works. Execution times are:
1 process: 398 s
2 processes: 270 s
3 processes: 272 s
4 processes: 511 s
Multiprocessing does not dramatically improve the execution time, and can even harm it.
The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.
I guess that the other measurements can be explained by the loading of data.
My questions:
1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?
2) I have tried to move the unpickling under the if __name__ == "__main__": guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes

how to implement single program multiple data (spmd) in python

i read the multiprocessing doc. in python and found that task can be assigned to different cpu cores. i like to run the following code (as a start) in parallel.
from multiprocessing import Process
import os
def do(a):
for i in range(a):
print i
if __name__ == "__main__":
proc1 = Process(target=do, args=(3,))
proc2 = Process(target=do, args=(6,))
proc1.start()
proc2.start()
now i get the output as 1 2 3 and then 1 ....6. but i need to work as 1 1 2 2 ie i want to run proc1 and proc2 in parallel (not one after other).
So you can have your code execute in parallel just by using map. I am using a delay (with time.sleep) to slow the code down so it prints as you want it to. If you don't use the sleep, the first process will finish before the second starts… and you get 0 1 2 0 1 2 3 4 5.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool()
>>>
>>> def do(a):
... for i in range(a):
... import time
... time.sleep(1)
... print i
...
>>> _ = p.map(do, [3,6])
0
0
1
1
2
2
3
4
5
>>>
I'm using the multiprocessing fork pathos.multiprocessing because I'm the author and I'm too lazy to code it in a file. pathos enables you to do multiprocessing in the interpreter, but otherwise it's basically the same.
You can also use the library pp. I prefer pp over multiprocessing because it allows for parallel processing across different cpus on the network. A function (func) can be applied to a list of inputs (args) using a simple code:
job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
result=[job() for job in job_server.submit(func,input) for arg in args]
You can also check out more examples at: https://github.com/gopiks/mappy/blob/master/map.py

Categories