I have a pipeline which streams IoT log data. In addition, it has a bounded side input, containing the initial configurations of the devices. This configuration changes over time and has to be updated by specific logs (ConfigLogs) coming from the main PubSub source. All remaining logs (PayloadLogs) need to consume at some point the updated configuration.
In order to have access to the latest configuration, I came up with the following pipeline design:
However, I was unable to get this to work. In particular, I struggle with the correct window/trigger strategy for the side input of Use Config Side Input.
Here is a pseudocode example
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.transforms.trigger import AccumulationMode, Repeatedly, AfterCount
class UpdateConfig(beam.DoFn):
# currently dummy logic
def process(self, element, outdated_config_list):
outdated_config = outdated_config_list[0]
print(element)
print(outdated_config)
yield {**outdated_config, **{element[0]: element[1]}}
class UseConfig(beam.DoFn):
# currently dummy logic
def process(self, element, latest_config_list):
print(element)
print(latest_config_list)
with beam.Pipeline() as pipeline:
# Bounded Side Input
outdated_config = (
pipeline
| "Config BQ" >> beam.Create([{3: 'config3', 4: 'config4', 5: 'config5'}])
| "Window Side Input" >> beam.WindowInto(window.GlobalWindows())
)
# Unbounded Source
config_logs, payload_logs = (
pipeline
| "IoT Data" >>
beam.io.ReadFromPubSub(subscription="MySub").with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: eval(x.decode('utf-8')))
| "Partition Config/Payload Logs" >>
beam.Partition(lambda x, nums: 0 if type(x[0]) == int else 1, 2)
)
# Update of Bounded Side Input with part of Unbounded Source
latest_config = (
config_logs
| "Update Batch Config" >>
beam.ParDo(UpdateConfig(), batch_lookup_list=beam.pvalue.AsList(outdated_config))
| "LatestConfig Window/Trigger" >>
beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterCount(1)),
accumulation_mode=AccumulationMode.DISCARDING
)
)
# Use Updated Side Input
(
payload_logs
| "Use Config Side Input" >>
beam.ParDo(UseConfig(), latest_config_list=beam.pvalue.AsList(latest_config))
)
which I feed with the following type of dummy data
# ConfigLog: (1, 'config1')
# PayloadLog: ('1', 'value1')
The prints of Update Batch Config are always executed, however Dataflow waits indefinitely in Use Config Side Input, even though I added an AfterCount trigger in LatestConfig Window/Trigger.
When I switch to a FixedWindow (and also add one for the PayloadLogs), I do get the prints in Use Config Side Input, but the side input is mostly empty, since the FixedWindow is triggered after a fixed amount of time, irrespectively if I had an incoming ConfigLog or not.
What is the correct windowing/triggering strategy when using a unbounded side input?
Related
I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery.
Side pipeline code
side_pipeline = (
p
| "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
| "map to read request" >>
beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
| beam.io.ReadAllFromBigQuery()
)
Function with side input code
def enrich_payload(payload, equipments):
id = payload["id"]
for equipment in equipments:
if id == equipment["id"]:
payload["type"] = equipment["type"]
payload["brand"] = equipment["brand"]
payload["year"] = equipment["year"]
break
return payload
Main pipeline code
main_pipeline = (
p
| "read" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/topiq")
| "bytes to dict" >> beam.Map(lambda x: json.loads(x.decode("utf-8")))
| "transform" >> beam.Map(transform_function)
| "timestamping" >> beam.Map(lambda src: window.TimestampedValue(
src,
dt.datetime.fromisoformat(src["timestamp"]).timestamp()
))
| "windowing" >> beam.WindowInto(window.FixedWindows(30))
)
final_pipeline = (
main_pipeline
| "enrich data" >> beam.Map(enrich_payload, equipments=beam.pvalue.AsIter(side_pipeline))
| "store" >> beam.io.WriteToBigQuery(bq_table)
)
result = p.run()
result.wait_until_finish()
After deploy it to Dataflow, everything looks fine and no error. But then I noticed that enrich data step has two nodes instead of one.
And also, the side input stuck as you can see it has Elements Added with 21 counts in Input Collections and - value in Elements Added in Output Collections.
You can find the full pipeline code here
I already follow all instruction in these documentations:
https://beam.apache.org/documentation/patterns/side-inputs/
https://beam.apache.org/releases/pydoc/2.35.0/apache_beam.io.gcp.bigquery.html
Yet still found this error. Please help me. Thanks!
Here you have a working example:
mytopic = ""
sql = "SELECT station_id, CURRENT_TIMESTAMP() timestamp FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations` LIMIT 10"
def to_bqrequest(e, sql):
from apache_beam.io import ReadFromBigQueryRequest
yield ReadFromBigQueryRequest(query=sql)
def merge(e, side):
for i in side:
yield f"Main {e.decode('utf-8')} Side {i}"
pubsub = p | "Read PubSub topic" >> ReadFromPubSub(topic=mytopic)
side_pcol = (p | PeriodicImpulse(fire_interval=300, apply_windowing=False)
| "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| "To BQ Request" >> ParDo(to_bqrequest, sql=sql)
| ReadAllFromBigQuery()
)
final = (pubsub | "Merge" >> ParDo(merge, side=beam.pvalue.AsList(side_pcol))
| Map(logging.info)
)
p.run()
Note this uses a GlobalWindow (so that both inputs have the same window). I used a processing time trigger so that the pane contains multiple rows. 5 was chosen arbitrarily, using 1 would work too.
Please note matching the data between side and main inputs is non deterministic, and you may see fluctuating values from older fired panes.
In theory, using FixedWindows should fix this, but I cannot get the FixedWindows to work.
I'm facing a situation with beam.GroupByKey(), I've loaded a file whose amount of lines is 42.854.
Due to the business rules, I need to execute a GroupByKey(); However, After finishing its execution I noticed that I got almost the double of lines. As you can see below :
Step before the GroupByKey():
Why am I having this behavior ?
I'm not doing anything special in my pipeline:
with beam.Pipeline(runner, options=opts) as p:
#LOAD FILE
elements = p | 'Load File' >> beam.Create(fileRNS.values)
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL
Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(ExecuteInsert)
2021-02-09
When I debug, Prepare_Bulk_Insert() has the following content :
As you can see, the amount of elements is correct, I don't understand why GroupByKey() has its input with a higher amount of elements if I'm sending the correct amount.
The Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(funcaoMap) has its input as follows:
Double amount. =(
Kind regards,
Juliano Medeiros
Those screenshots indicate that the "Prepare Bulk Insert" DoFn is outputting more that one element per input element. Your first screenshot is showing the input PCollection of the GBK (which is produced by the DoFn) and the second is the input to the DoFn, so the difference must be produced by that DoFn.
I have a pipeline that gets data from BigQuery and writes it to GCS, however, if I find any rejects I want to right them to a Bigquery table. I am collecting rejects into a global list variable and later loading the list into BigQuery table. This process works fine in when I run it locally as the pipelines were running in the right order. When I run it using dataflowrunner, it doesn't guarantee the order ( I want pipeline1 to run before pipeline2. Is there a way to have dependent pipelines in Dataflow using python? or Also please suggest if this can be solved in with better approach. Thanks in advance.
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(lambda x: somefunction) # Collecting rejects in the except block of this method to a global list variable
....etc
| 'to gcs' >> beam.io.WriteToText(output)
)
# Loading the rejects gathered in the above pipeline to Biquery
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline2:
rejects = (pipeline2
| 'create pipeline' >> beam.Create(reject_list)
| 'to json format' >> beam.Map(lambda data: {.....})
| 'to bq' >> beam.io.WriteToBigQuery(....)
)
You can do something like that, but with only 1 pipeline, and some additional code in the transformation.
The beam.Map(lambda x: somefunction) should have two outputs: the one that is written to GCS, and the rejected elements that will be eventually written to BigQuery.
For that, your transform function would have to return a TaggedOutput.
There is an example in the Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/#multiple-outputs-dofn
The second PCollection, you can then write to BigQuery.
You don't need to have a Create in this second branch of the pipeline.
The pipeline would be something like this:
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(transform) # Tagged output produced here
pcoll_to_gcs = data.gcs_output
pcoll_to_bq = data.rejected
pcoll_to_gcs | "to gcs" >> beam.io.WriteToText(output)
pcoll_to_bq | "to bq" >> beam.io.WriteToBigQuery(....)
Then the transform function would be something like this
def transform(element):
if something_is_wrong_with_element:
yield pvalue.TaggedOutput('rejected', element)
transformed_element = ....
yield pvalue.TaggedOutput('gcs_output', transformed_element)
list A: 25M hashes
list B: 175K hashes
I want to check each hash in list B for existence in list A. For this I have a ParDo function and I yield when it's not matched. This is a deduplication process.
How do I set up this ParDo efficiently, now I do a side input of list A while processing list B. But shouldnt the side input go to setup() or start_bundle() of the ParDo so I store the lookup list (A) in the worker just once?
class Checknewrecords(beam.DoFn):
def process(self, element, hashlist):
if element['TA_HASH'] not in hashlist:
yield element
else:
pass
If you have a the answer please include a link to the documentation because I did not find any good documentation for the Python version.
transformed_records is a PCollection from a previous transformation
current_data is a PCollection from a BigQuery.read
new_records = transformed_records | 'Checknewrecords' >> beam.ParDo(Checknewrecords(), pvalue.AsList(current_data))
I believe that pvalue.AsDict is what you need, which will give you a dictionary style interface for the side input. You can find some examples on the Apache Beam Github search.
Here is a simplified example I just wrote, but please see the checked in example below (though a bit more complicated), incase I made a mistake.
class ComputeHashes(beam.DoFn):
def process(self, element):
# use the element as a key to produce a KV, value is not used
yield (HashFunction(element), true)
initial_elements = beam.Create("foo")
computed_hashes = initial_elements | beam.ParDo(ComputeHashes())
class FilterIfAlreadyComputedHash(beam.DoFn):
def process(self, element, hashes):
# Filter if it already exists in hashes
if not hashes.get(element):
yield element
more_elements = beam.Create("foo", "bar") # Read from your pipeline's source
small_words = more_elements | beam.ParDo(FilterIfAlreadyComputedHash(), beam.pvalue.AsDict(computed_hashes))
In the checked in example, from the beam github repo, in visionml_test.py a PCollection is converted to the Dictionary type view using beam.PValue.AsDict().
class VisionMlTestIT(unittest.TestCase):
def test_text_detection_with_language_hint(self):
IMAGES_TO_ANNOTATE = [
'gs://apache-beam-samples/advanced_analytics/vision/sign.jpg'
]
IMAGE_CONTEXT = [vision.types.ImageContext(language_hints=['en'])]
with TestPipeline(is_integration_test=True) as p:
contexts = p | 'Create context' >> beam.Create(
dict(zip(IMAGES_TO_ANNOTATE, IMAGE_CONTEXT)))
output = (
p
| beam.Create(IMAGES_TO_ANNOTATE)
| AnnotateImage(
features=[vision.types.Feature(type='TEXT_DETECTION')],
context_side_input=beam.pvalue.AsDict(contexts))
| beam.ParDo(extract))
The side input is passed into a FlatMap (in visionml.py), and, in the FlatMap's function, an entry is retrieved from the dictionary with .get(). This could also be passed into a Map or ParDo. See: beam python side input documentation (here they use .AsSingleton instead .AsDict). You can find an example here of using it in the process call.
class AnnotateImage(PTransform):
"""A ``PTransform`` for annotating images using the GCP Vision API.
ref: https://cloud.google.com/vision/docs/
Batches elements together using ``util.BatchElements`` PTransform and sends
each batch of elements to the GCP Vision API.
Element is a Union[text_type, binary_type] of either an URI (e.g. a GCS URI)
or binary_type base64-encoded image data.
Accepts an `AsDict` side input that maps each image to an image context.
"""
MAX_BATCH_SIZE = 5
MIN_BATCH_SIZE = 1
def __init__(
self,
features,
retry=None,
timeout=120,
max_batch_size=None,
min_batch_size=None,
client_options=None,
context_side_input=None,
metadata=None):
"""
Args:
features: (List[``vision.types.Feature.enums.Feature``]) Required.
The Vision API features to detect
retry: (google.api_core.retry.Retry) Optional.
A retry object used to retry requests.
If None is specified (default), requests will not be retried.
timeout: (float) Optional.
The time in seconds to wait for the response from the Vision API.
Default is 120.
max_batch_size: (int) Optional.
Maximum number of images to batch in the same request to the Vision API.
Default is 5 (which is also the Vision API max).
This parameter is primarily intended for testing.
min_batch_size: (int) Optional.
Minimum number of images to batch in the same request to the Vision API.
Default is None. This parameter is primarily intended for testing.
client_options:
(Union[dict, google.api_core.client_options.ClientOptions]) Optional.
Client options used to set user options on the client.
API Endpoint should be set through client_options.
context_side_input: (beam.pvalue.AsDict) Optional.
An ``AsDict`` of a PCollection to be passed to the
_ImageAnnotateFn as the image context mapping containing additional
image context and/or feature-specific parameters.
Example usage::
image_contexts =
[(''gs://cloud-samples-data/vision/ocr/sign.jpg'', Union[dict,
``vision.types.ImageContext()``]),
(''gs://cloud-samples-data/vision/ocr/sign.jpg'', Union[dict,
``vision.types.ImageContext()``]),]
context_side_input =
(
p
| "Image contexts" >> beam.Create(image_contexts)
)
visionml.AnnotateImage(features,
context_side_input=beam.pvalue.AsDict(context_side_input)))
metadata: (Optional[Sequence[Tuple[str, str]]]): Optional.
Additional metadata that is provided to the method.
"""
super(AnnotateImage, self).__init__()
self.features = features
self.retry = retry
self.timeout = timeout
self.max_batch_size = max_batch_size or AnnotateImage.MAX_BATCH_SIZE
if self.max_batch_size > AnnotateImage.MAX_BATCH_SIZE:
raise ValueError(
'Max batch_size exceeded. '
'Batch size needs to be smaller than {}'.format(
AnnotateImage.MAX_BATCH_SIZE))
self.min_batch_size = min_batch_size or AnnotateImage.MIN_BATCH_SIZE
self.client_options = client_options
self.context_side_input = context_side_input
self.metadata = metadata
def expand(self, pvalue):
return (
pvalue
| FlatMap(self._create_image_annotation_pairs, self.context_side_input)
| util.BatchElements(
min_batch_size=self.min_batch_size,
max_batch_size=self.max_batch_size)
| ParDo(
_ImageAnnotateFn(
features=self.features,
retry=self.retry,
timeout=self.timeout,
client_options=self.client_options,
metadata=self.metadata)))
#typehints.with_input_types(
Union[text_type, binary_type], Optional[vision.types.ImageContext])
#typehints.with_output_types(List[vision.types.AnnotateImageRequest])
def _create_image_annotation_pairs(self, element, context_side_input):
if context_side_input: # If we have a side input image context, use that
image_context = context_side_input.get(element)
else:
image_context = None
if isinstance(element, text_type):
image = vision.types.Image(
source=vision.types.ImageSource(image_uri=element))
else: # Typehint checks only allows text_type or binary_type
image = vision.types.Image(content=element)
request = vision.types.AnnotateImageRequest(
image=image, features=self.features, image_context=image_context)
yield request
Note, in Java you use it as .asMap().
Sorry, I initially misunderstood the question. Actually I don't think its possible to have a side input in start_bundle. It is only accessible in process_bundle. But you could instead do the work on the first call to process bundle and get a similar result.
class DoFnMethods(beam.DoFn):
def __init__(self):
self.first_element_processed = False
self.once_retrieved_side_input_data = None
def called_once(self, side_input):
if self.first_element_processed:
return
self.once_retrieved_side_input_data = side_input.get(...)
self.first_element_processed = True
def process(self, element, side_input):
self.called_once(side_input)
...
Note: You do need to be aware of the fact that start bundle and finish bundle will be called once for the bundle across all windows, and the side input is provided to process is different for each window computed. So if you are working with windows you may need to use a dict(keyed by window) for the self.first_element_processed and self.once_retrieved_side_input_data variables so you can called_onc once for each window.
My task is equivalent to:
given a stream of letters (each letter is associated with the author)
build the low-latency histogram of ngrams within a fixed window,
assuming that correct ngram consists of a sequence of letters from the same author.
My pipeline looks like:
| 'Extract and assign timestamps' >> Map(..)
| 'Extract author from message' >> Map(..)
| 'Assign author as key' >> Map(lambda x: (x.author, x)) # :(author,chr)
| 'Assign windows' >> WindowInto(window.FixedWindows(5*60),
trigger=Repeatedly(AfterAny(
AfterCount(2),
AfterWatermark())),
accumulation_mode=ACCUMULATING)
| 'Group letters by author' >> GroupByKey() # :(author,[chr])
| 'Extract many ngrams from every substring' >> ParDo(..) # :[(author, ngram)]
| 'Re-key by ngrams' >> Map(lambda x: (x.ngram, 1)) # :(ngram,1)
| 'Group again, now by ngrams' >> GroupByKey() # :(ngram,[1])
| 'Sum ngrams counts' >> Map(lambda x: (x[0], sum(x[1]))) # :(ngram,count)
| 'Re-key by window' >> ParDo(..) # :(window, (ngram,count))
| 'Group within window' >> GroupByKey # :(window, [(ngram,count)])
My goal is to receive a single output message for every two input messages such that output message summarizes ngram-statistics between: the start of the current 5-minute interval in event-time, and current moment in processing-time.
Without AfterCount-trigger, everything works well: each window only has a single pane. I get a single histogram-message for every 5-minutes in event-time.
This pipeline doesn't really work as intended with multiple panes per window, because I have 3 GBKs here and each gets triggered by panes, and does so with accumulation. For example:
after first two characters come in (let's say ('a','author1') and ('b','author1'), panes trigger in all GBKs;
I get an output message which correctly summarizes all ngrams seen so far, which may look like ('[0,300)', [('ab', 1)]);
now I have:
buffered characters in front of the first 'Group letters by word'-GBK: ('author1','a') and ('author1','b'));
a buffered ngram in front of the second 'Group again, now by ngrams'-GBK: ('ab',1); and
a buffered ngram-sum in front of the third 'Group within window'-GBK: ('[0,300)',('ab',1))
when two more characters come in: ('c','author1') and ('d','author2') all GBKs get retriggered with buffered things:
first GBK correctly processes full data so far: ('a','author1'), ('b','author1'), ('c','author1'), ('d','author2') and correctly emits: ('author1',['a','b','c']) and ('author2',['d'])
second GBK processes buffered ('ab',1,) as well as newly emitted ('bc',1) and re-emitted ('ab',1) - and here duplication starts (and cascades on each pane of the same window).
after four characters my summary-statistics-so-far is ('[0,300)',[('ab',2),('bc',1)]) while it should be ('[0,300)',[('ab',1),('bc',1)])
Here are the ideas I have:
Intuitively I want to switch from ACCUMULATING to DISCARDING after first GBK. I can't see if Apache Beam supports that.
I can use a single GBK-by-window and do all intermediate grouping and map-reducing inside the custom transform, but that feels like abandoning the Apache Beam model;
I can re-window after first GBK and switch to DISCARDING, but then I start to get lots of empty or incomplete output windows - in other words, that doesn't satisfy my requirements.
What's the appropriate solution for such pipeline? How can I accumulate before first GBK, but discard before both others - yet make sure that panes for all three GBKs close around the same data (or its derivative)?