How to execute custom Splittable DoFn in parallel - python

I am trying to develop a custom I/O connector for Apache Beam, written in Python. According to the official guideline, Splittable DoFn (SDF) is the framework of choice in my case.
I tried to run the pseudocode in the SDF programming guide, however, I failed to execute the pipeline in parallel. Below is a working example.
Dummy data
myfile = open('test_beam.txt', 'w')
for i in range(0, 1000):
myfile.write("%s\n" % i)
myfile.close
Pipeline
Make sure to replace DUMMY_FILE with the absolute path of test_beam.txt.
import argparse
import logging
import os
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import StandardOptions
from time import sleep
import random
from apache_beam.io.restriction_trackers import OffsetRange
DUMMY_FILE = absolute_path_to_dummy_data_file
class FileToWordsRestrictionProvider(beam.transforms.core.RestrictionProvider
):
def initial_restriction(self, file_name):
return OffsetRange(0, os.stat(file_name).st_size)
def create_tracker(self, restriction):
return beam.io.restriction_trackers.OffsetRestrictionTracker(
offset_range=self.initial_restriction(file_name=DUMMY_FILE))
def restriction_size(self, element, restriction):
return restriction.size()
class FileToWordsFn(beam.DoFn):
def process(
self,
file_name,
# Alternatively, we can let FileToWordsFn itself inherit from
# RestrictionProvider, implement the required methods and let
# tracker=beam.DoFn.RestrictionParam() which will use self as
# the provider.
tracker=beam.DoFn.RestrictionParam(FileToWordsRestrictionProvider())):
with open(file_name) as file_handle:
file_handle.seek(tracker.current_restriction().start)
while tracker.try_claim(file_handle.tell()):
yield read_next_record(file_handle=file_handle)
def read_next_record(file_handle):
line_number = file_handle.readline()
logging.info(line_number)
sleep(random.randint(1, 5))
logging.info(f'iam done {line_number}')
def run(args, pipeline_args, file_name):
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
execute_pipeline(args, p, file_name)
def execute_pipeline(args, p, file_name):
_ = (
p |
'Create' >> beam.Create([file_name]) |
'Read File' >> beam.ParDo(FileToWordsFn(file_name=file_name))
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
# to be added later
args, pipeline_args = parser.parse_known_args()
file_name = DUMMY_FILE
run(args, pipeline_args, file_name)
The SDF is taken from the first example here, however, I had to fix a few things (e.g., define restriction_size and a minor misplacement of ()). Furthermore, I introduced a random sleep in read_next_record to check whether the pipeline is executed in parallel (which it is not apparently).
There is probably a mistake in the way I constructed the pipeline? I would expect to use my SDF as the very first step in the pipeline, but doing so results in AttributeError: 'PBegin' object has no attribute 'windowing'. To circumvent this issue, I followed this post and added created a PCollection containing the input file_name.
What is the correct way to execute an SDF within a pipeline in parallel?

Beam DoFns (including SplittableDoFns) operate on an input PCollection. For SplittableDoFn, the input is usually a PCollection of source configs (for example, input files). When executing a SplittableDoFn the Beam runner is able to parallelize the execution of even a single input element by isolating parts of the input read using the RestrictionTracker. So for a file, this would mean that you might have workers running in parallel that read data from the same file but at different offsets.
So your implementation seems correct and should already facilitate parallel execution for a Beam runner.

Splittable DoFns of apache beam allows create a custom config to runner initiated splits, my case I had to process a big file where all content don't have separators and these were in one line and dataflow don't scalled. I used beam.transforms.core.RestrictionProvider, with the function split, where I specificed the number of parts for read the file and with this config when I ran the job dataflow used varios workers and the time of process reduced a lot.
class FileToLinesRestrictionProvider(beam.transforms.core.RestrictionProvider):
def initial_restriction(self, file_name):
return OffsetRange(0, size_file) #6996999736 #43493
#return OffsetRange(0, os.stat(file_name).st_size)
def create_tracker(self, restriction):
# return beam.io.restriction_trackers.OffsetRestrictionTracker(
# offset_range=self.initial_restriction(file_name=rutaFile_Test))
return beam.io.restriction_trackers.OffsetRestrictionTracker(restriction)
def split(self, file_name, restriction):
# Configuración para leer el archivo por partes
bundle_ranges = calcular_segmentos_lectura(tamFila, tam_segmentos, size_file)
for start, stop in bundle_ranges:
yield OffsetRange(start, stop)
def restriction_size(self, element, restriction):
#print(restriction.size())
return restriction.size()
class FileToLinesFn(beam.DoFn):
def process(
self,
file_name,
# Alternatively, we can let FileToWordsFn itself inherit from
# RestrictionProvider, implement the required methods and let
# tracker=beam.DoFn.RestrictionParam() which will use self as
# the provider.
tracker=beam.DoFn.RestrictionParam(FileToLinesRestrictionProvider())):
with FileSystems.open(file_name) as file_handle:
file_handle.seek(tracker.current_restriction().start)
print(tracker.current_restriction())
while tracker.try_claim(file_handle.tell()):
#print(file_handle.tell())
yield file_handle.read(tamFila)
def calcular_segmentos_lectura(
size_line,
tam_segmentos,
tam_file):
""" Basado en el tamaño del archivo y tamaños de las lineas divide en partes de acuerdo
a los parametros de entrada
Retorna array con los caracteres que deben procesar en cada paso
"""
num_lineas = int(tam_file /size_line)
valor_segmento = int(num_lineas / tam_segmentos)
valor_segmento = valor_segmento * size_line
print(valor_segmento)
segmentos_ranges = []
valorAnterior = 0
for i in range(tam_segmentos):
start = valorAnterior
stop_position = (valorAnterior + (valor_segmento))
valorAnterior = stop_position
if (i + 1) == tam_segmentos:
stop_position = tam_file
segmentos_ranges.append((start, stop_position))
return segmentos_ranges
This example help me a lot url

Related

How do I include both data and state dependencies in a Prefect flow?

This seems like it should be simple, but I can't figure out how to include both state and data dependencies in a single flow. Here is what I attempted (simplified):
def main():
with Flow("load_data") as flow:
test_results = prepare_file1()
load_file1(test_results)
participants = prepare_file2()
load_file2(participants)
email = flow.add_task(EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx',smtp_port=25, smtp_type='INSECURE',))
flow.set_dependencies(task=email, upstream_tasks=[load_file1,load_file2])
flow.visualize()
I get the following graph:
Which means that load_file1 and load_file2 run twice. Can I just set up an additional dependency so that email runs when the two load tasks finish?
The issue is how you add the task to your Flow. When using tasks from the Prefect task library, it's best to first initialize those and then call those in your Flow as follows:
send_email = EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx', smtp_port=25, smtp_type='INSECURE')
with Flow("load_data") as flow:
send_email()
Or alternatively, do it in one step with double round brackets EmailTask(init_kwargs)(run_kwargs). The first pair of brackets will initialize the task and the second one will call the task by invoking the task's .run() method.
with Flow("load_data") as flow:
EmailTask(name='email', subject='Flow succeeded!', msg='flow succeeded', email_to='xxx', email_from='xxx', smtp_server='xxx', smtp_port=25, smtp_type='INSECURE')()
The full flow example could look as follows:
from prefect import task, Flow
from prefect.tasks.notifications import EmailTask
from prefect.triggers import always_run
#task(log_stdout=True)
def prepare_file1():
print("File1 prepared!")
return "file1"
#task(log_stdout=True)
def prepare_file2():
print("File2 prepared!")
return "file2"
#task(log_stdout=True)
def load_file1(file: str):
print(f"{file} loaded!")
#task(log_stdout=True)
def load_file2(file: str):
print(f"{file} loaded!")
send_email = EmailTask(
name="email",
subject="Flow succeeded!",
msg="flow succeeded",
email_to="xxx",
email_from="xxx",
smtp_server="xxx",
smtp_port=25,
smtp_type="INSECURE",
trigger=always_run,
)
with Flow("load_data") as flow:
test_results = prepare_file1()
load1_task = load_file1(test_results)
participants = prepare_file2()
load2_task = load_file2(participants)
send_email(upstream_tasks=[load1_task, load2_task])
if __name__ == "__main__":
flow.visualize()

Dataflow python pickeling issue

I have written a simple dataflow program that takes input from a pub/sub topic and calculates the fibanacci number for that integer. However, my DoFn is not able to pickle the custom function fibonacci which is giving me errors when running on Dataflowrunner. Can someone help me in telling me what I am doing wrong?
Below is my pipeline code.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions, SetupOptions
class Fibonacci(beam.DoFn):
def fibonacci(self, n):
if n < 2:
return n
else:
return self.fibonacci(n-1) + self.fibonacci(n-2)
def process(self, element, fib):
import json
# do some processing
n = int(json.loads(element.data))
# call fibonnaci
return [fib(n)]
def Print(n):
print(n)
if __name__ == "__main__":
input_subscription = 'projects/consumerresearch/subscriptions/test-user-sub'
options = PipelineOptions()
options.view_as(StandardOptions).streaming=True
options.view_as(SetupOptions).save_main_session = True
p = beam.Pipeline(options=options)
raw_pubsub_data = (
p | 'Read from topic' >> beam.io.ReadFromPubSub(subscription=input_subscription, with_attributes=True)
)
output = raw_pubsub_data | beam.ParDo(Fibonacci()) | beam.Map(Print)
result = p.run()
result.wait_until_finish()
The signature of process should be this:
process(self, element):
Your implementation has a 3rd parameter fib; Beam would not know what to pass for this. Change your implementation to reference self.fibonacci?
https://beam.apache.org/documentation/programming-guide/#pardo

Pyspark Luigi multiple workers issue

I want to load multiple files in spark data frame in parallel using Luigi workflow and store them in dictionary .
Once all the files are loaded,i want to be able to access these data-frame from dictionary in main and then do further processing.This process is working when i am running Luigi with one worker.if running Luigi with more than one worker,this variable is empty in main method.
Any suggestion will be helpful.
import Luigi
from Luigi import LocalTarget
from pyspark import SQLContext
from src.etl.SparkAbstract import SparkAbstract
from src.util.getSpark import get_spark_session
from src.util import getSpark,read_json
import configparser as cp
import datetime
from src.input.InputCSVFileComponent import InputCSVFile
import os
from src.etl.Component import ComponentInfo
class fileloadTask(luigi.Task):
compinfo = luigi.Parameter()
def output(self):
return luigi.LocalTarget("src/workflow_output/"+str(datetime.date.today().isoformat() )+"-"+ str(self.compinfo.id)+".csv")
def run(self):
a = InputCSVFile(self.compinfo) ##this class is responsible to return the object of spark dataframe and put it in dictionary
a.execute()
with self.output().open('w') as f:
f.write("done")
class EnqueueTask(luigi.WrapperTask):
compinfo = read_json.read_json_config('path to json file')
def requires(self):
folders = [
comp.id for comp in list(self.compinfo) if comp.component_type == 'INPUTFILE'
]
print(folders)
newcominfo = []
for index, objid in enumerate(folders):
newcominfo.append(self.compinfo[index])
for i in newcominfo:
print(f" in compingo..{i.id}")
callmethod = [fileloadTask(compinfo) for compinfo in newcominfo]
print(callmethod)
return callmethod
class MainTask(luigi.WrapperTask):
def requires(self):
return EnqueueTask()
def output(self):
return luigi.LocalTarget("src/workflow_output/"+str(datetime.date.today().isoformat() )+"-"+ "maintask"+".csv")
def run(self):
print(f"printing mapdf..{SparkAbstract.mapDf}")
res = not SparkAbstract.mapDf
print("Is dictionary empty ? : " + str(res)) ####-------------> this is empty when workers > 1 ################
for key, value in SparkAbstract.mapDf.items():
print("prinitng from dict")
print(key, value.show(10))
with self.output().open('w') as f:
f.write("done")
"""
entry point for spark application
"""
if __name__ == "__main__":
luigi.build([MainTask()],workers=2,local_scheduler=True)
Each worker runs in its own process. That mean workers can't share python object (in this instance the dictionary in which you put the results).
Generally speaking luigi is best to orchestrate tasks with side effects (like writing to files etc).
If you you are trying to parallelise tasks that load data in memory, I'd recommand using dask instead of luigi.

disorder in multiprocessing subprocess

After running some compuations nicely in linear fashion with a moderator script (cf. below) calling an inner one performing the computation, I struggle
to bring it to execution when trying it with multiprocessing. It seems that each CPU core is running through this list set (testRegister) and launches a computation even if an other core already performed this task earlier (in the same session). How can I prevent this chaotic behaviour? It is my first time attempting calling multiple processors by Python.
Correction: The initial post did not show that the test is a string consisting calling "the inner script" with varying parameters m1 and m2 beside fixed arguments arg1 and arg2 belonging solely to this "inner script".
#!/usr/bin/env python3
import os
import subprocess as sub
import sys
import multiprocessing
fileRegister = []
testRegister = []
def fileCollector():
for file in os.listdir("."):
if file.endswith(".xyz"):
fileRegister.append(file)
fileRegister.sort()
return fileRegister
def testSetup():
data = fileRegister
while len(data) > 1:
for entry in fileRegister[1:]:
m0 = str(fileRegister[0])
m1 = str(entry)
test = str("python foo.py ") + str(m1) + str(" ") + str(m2) +\
str(" --arg1 --arg2") # formulate test condition
testRegister.append(test)
testRegister.sort()
del data[0]
return testRegister
def shortAnalysator():
for entry in testRegister:
print(str(entry))
sub.call(entry, shell=True)
del testRegister[0]
def polyAnalysator():
# apparently each CPU core works as if the register were not shared
# reference: https://docs.python.org/3.7/library/multiprocessing.html
if __name__ == '__main__':
jobs = []
for i in range(3): # safety marging to not consume all CPU
p = multiprocessing.Process(target=shortAnalysator)
jobs.append(p)
p.start()
fileCollector()
testSetup()
shortAnalysator() # proceeding expectably on one CPU (slow)
# polyAnalysator() # causing irritation
sys.exit()```
Your polyAnalysator is running the shortAnalysator three times. Try changing your polyAnalysator as follows, and add the f method. This uses the multiprocessing Pool:
from multiprocessing import Pool
def f(test):
sub.call(test, shell=True)
def polyAnalysator():
# apparently each CPU core works as if the register were not shared
# reference: https://docs.python.org/3.7/library/multiprocessing.html
with Pool(3) as p:
p.map(f, testRegister)

How to initialize a pool of python multiprocessing workers with a shared state?

I am trying to execute in parallel some machine learning algorithm.
When I use multiprocessing, it's slower than without. My wild guess is that the pickle serialization of the models I use slowing down the whole process. So the question is: how can I initialize the pool's worker with an initial state so that I don't need to serialize/deserialize for every single call the models?
Here is my current code:
import pickle
from pathlib import Path
from collections import Counter
from multiprocessing import Pool
from gensim.models.doc2vec import Doc2Vec
from wikimark import html2paragraph
from wikimark import tokenize
def process(args):
doc2vec, regressions, filepath = args
with filepath.open('r') as f:
string = f.read()
subcategories = Counter()
for index, paragraph in enumerate(html2paragraph(string)):
tokens = tokenize(paragraph)
vector = doc2vec.infer_vector(tokens)
for subcategory, model in regressions.items():
prediction = model.predict([vector])[0]
subcategories[subcategory] += prediction
# compute the mean score for each subcategory
for subcategory, prediction in subcategories.items():
subcategories[subcategory] = prediction / (index + 1)
# keep only the main category
subcategory = subcategories.most_common(1)[0]
return (filepath, subcategory)
def main():
input = Path('./build')
doc2vec = Doc2Vec.load(str(input / 'model.doc2vec.gz'))
regressions = dict()
for filepath in input.glob('./*/*/*.model'):
with filepath.open('rb') as f:
model = pickle.load(f)
regressions[filepath.parent] = model
examples = list(input.glob('../data/wikipedia/english/*'))
with Pool() as pool:
iterable = zip(
[doc2vec] * len(examples), # XXX!
[regressions] * len(examples), # XXX!
examples
)
for filepath, subcategory in pool.imap_unordered(process, iterable):
print('* {} -> {}'.format(filepath, subcategory))
if __name__ == '__main__':
main()
The lines marked with XXX! point to the data that serialized when I call pool.imap_unodered. There at least 200MB of data that is serialized.
How can I avoid serialization?
The solution as simple as using a global for both doc2vec and regressions.

Categories