Google Dataflow - Apache Beam GroupByKey(): Duplicating/Slow - python

I'm facing a situation with beam.GroupByKey(), I've loaded a file whose amount of lines is 42.854.
Due to the business rules, I need to execute a GroupByKey(); However, After finishing its execution I noticed that I got almost the double of lines. As you can see below :
Step before the GroupByKey():
Why am I having this behavior ?
I'm not doing anything special in my pipeline:
with beam.Pipeline(runner, options=opts) as p:
#LOAD FILE
elements = p | 'Load File' >> beam.Create(fileRNS.values)
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL
Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(ExecuteInsert)
2021-02-09
When I debug, Prepare_Bulk_Insert() has the following content :
As you can see, the amount of elements is correct, I don't understand why GroupByKey() has its input with a higher amount of elements if I'm sending the correct amount.
The Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(funcaoMap) has its input as follows:
Double amount. =(
Kind regards,
Juliano Medeiros

Those screenshots indicate that the "Prepare Bulk Insert" DoFn is outputting more that one element per input element. Your first screenshot is showing the input PCollection of the GBK (which is produced by the DoFn) and the second is the input to the DoFn, so the difference must be produced by that DoFn.

Related

Windowing Strategy for Unbounded Side Input

I have a pipeline which streams IoT log data. In addition, it has a bounded side input, containing the initial configurations of the devices. This configuration changes over time and has to be updated by specific logs (ConfigLogs) coming from the main PubSub source. All remaining logs (PayloadLogs) need to consume at some point the updated configuration.
In order to have access to the latest configuration, I came up with the following pipeline design:
However, I was unable to get this to work. In particular, I struggle with the correct window/trigger strategy for the side input of Use Config Side Input.
Here is a pseudocode example
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.transforms.trigger import AccumulationMode, Repeatedly, AfterCount
class UpdateConfig(beam.DoFn):
# currently dummy logic
def process(self, element, outdated_config_list):
outdated_config = outdated_config_list[0]
print(element)
print(outdated_config)
yield {**outdated_config, **{element[0]: element[1]}}
class UseConfig(beam.DoFn):
# currently dummy logic
def process(self, element, latest_config_list):
print(element)
print(latest_config_list)
with beam.Pipeline() as pipeline:
# Bounded Side Input
outdated_config = (
pipeline
| "Config BQ" >> beam.Create([{3: 'config3', 4: 'config4', 5: 'config5'}])
| "Window Side Input" >> beam.WindowInto(window.GlobalWindows())
)
# Unbounded Source
config_logs, payload_logs = (
pipeline
| "IoT Data" >>
beam.io.ReadFromPubSub(subscription="MySub").with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: eval(x.decode('utf-8')))
| "Partition Config/Payload Logs" >>
beam.Partition(lambda x, nums: 0 if type(x[0]) == int else 1, 2)
)
# Update of Bounded Side Input with part of Unbounded Source
latest_config = (
config_logs
| "Update Batch Config" >>
beam.ParDo(UpdateConfig(), batch_lookup_list=beam.pvalue.AsList(outdated_config))
| "LatestConfig Window/Trigger" >>
beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterCount(1)),
accumulation_mode=AccumulationMode.DISCARDING
)
)
# Use Updated Side Input
(
payload_logs
| "Use Config Side Input" >>
beam.ParDo(UseConfig(), latest_config_list=beam.pvalue.AsList(latest_config))
)
which I feed with the following type of dummy data
# ConfigLog: (1, 'config1')
# PayloadLog: ('1', 'value1')
The prints of Update Batch Config are always executed, however Dataflow waits indefinitely in Use Config Side Input, even though I added an AfterCount trigger in LatestConfig Window/Trigger.
When I switch to a FixedWindow (and also add one for the PayloadLogs), I do get the prints in Use Config Side Input, but the side input is mostly empty, since the FixedWindow is triggered after a fixed amount of time, irrespectively if I had an incoming ConfigLog or not.
What is the correct windowing/triggering strategy when using a unbounded side input?

Apache Beam Cloud Dataflow Streaming Stuck Side Input

I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery.
Side pipeline code
side_pipeline = (
p
| "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
| "map to read request" >>
beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
| beam.io.ReadAllFromBigQuery()
)
Function with side input code
def enrich_payload(payload, equipments):
id = payload["id"]
for equipment in equipments:
if id == equipment["id"]:
payload["type"] = equipment["type"]
payload["brand"] = equipment["brand"]
payload["year"] = equipment["year"]
break
return payload
Main pipeline code
main_pipeline = (
p
| "read" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/topiq")
| "bytes to dict" >> beam.Map(lambda x: json.loads(x.decode("utf-8")))
| "transform" >> beam.Map(transform_function)
| "timestamping" >> beam.Map(lambda src: window.TimestampedValue(
src,
dt.datetime.fromisoformat(src["timestamp"]).timestamp()
))
| "windowing" >> beam.WindowInto(window.FixedWindows(30))
)
final_pipeline = (
main_pipeline
| "enrich data" >> beam.Map(enrich_payload, equipments=beam.pvalue.AsIter(side_pipeline))
| "store" >> beam.io.WriteToBigQuery(bq_table)
)
result = p.run()
result.wait_until_finish()
After deploy it to Dataflow, everything looks fine and no error. But then I noticed that enrich data step has two nodes instead of one.
And also, the side input stuck as you can see it has Elements Added with 21 counts in Input Collections and - value in Elements Added in Output Collections.
You can find the full pipeline code here
I already follow all instruction in these documentations:
https://beam.apache.org/documentation/patterns/side-inputs/
https://beam.apache.org/releases/pydoc/2.35.0/apache_beam.io.gcp.bigquery.html
Yet still found this error. Please help me. Thanks!
Here you have a working example:
mytopic = ""
sql = "SELECT station_id, CURRENT_TIMESTAMP() timestamp FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations` LIMIT 10"
def to_bqrequest(e, sql):
from apache_beam.io import ReadFromBigQueryRequest
yield ReadFromBigQueryRequest(query=sql)
def merge(e, side):
for i in side:
yield f"Main {e.decode('utf-8')} Side {i}"
pubsub = p | "Read PubSub topic" >> ReadFromPubSub(topic=mytopic)
side_pcol = (p | PeriodicImpulse(fire_interval=300, apply_windowing=False)
| "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| "To BQ Request" >> ParDo(to_bqrequest, sql=sql)
| ReadAllFromBigQuery()
)
final = (pubsub | "Merge" >> ParDo(merge, side=beam.pvalue.AsList(side_pcol))
| Map(logging.info)
)
p.run()
Note this uses a GlobalWindow (so that both inputs have the same window). I used a processing time trigger so that the pane contains multiple rows. 5 was chosen arbitrarily, using 1 would work too.
Please note matching the data between side and main inputs is non deterministic, and you may see fluctuating values from older fired panes.
In theory, using FixedWindows should fix this, but I cannot get the FixedWindows to work.

Handling rejects in Dataflow/Apache Beam through dependent pipelines

I have a pipeline that gets data from BigQuery and writes it to GCS, however, if I find any rejects I want to right them to a Bigquery table. I am collecting rejects into a global list variable and later loading the list into BigQuery table. This process works fine in when I run it locally as the pipelines were running in the right order. When I run it using dataflowrunner, it doesn't guarantee the order ( I want pipeline1 to run before pipeline2. Is there a way to have dependent pipelines in Dataflow using python? or Also please suggest if this can be solved in with better approach. Thanks in advance.
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(lambda x: somefunction) # Collecting rejects in the except block of this method to a global list variable
....etc
| 'to gcs' >> beam.io.WriteToText(output)
)
# Loading the rejects gathered in the above pipeline to Biquery
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline2:
rejects = (pipeline2
| 'create pipeline' >> beam.Create(reject_list)
| 'to json format' >> beam.Map(lambda data: {.....})
| 'to bq' >> beam.io.WriteToBigQuery(....)
)
You can do something like that, but with only 1 pipeline, and some additional code in the transformation.
The beam.Map(lambda x: somefunction) should have two outputs: the one that is written to GCS, and the rejected elements that will be eventually written to BigQuery.
For that, your transform function would have to return a TaggedOutput.
There is an example in the Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/#multiple-outputs-dofn
The second PCollection, you can then write to BigQuery.
You don't need to have a Create in this second branch of the pipeline.
The pipeline would be something like this:
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(transform) # Tagged output produced here
pcoll_to_gcs = data.gcs_output
pcoll_to_bq = data.rejected
pcoll_to_gcs | "to gcs" >> beam.io.WriteToText(output)
pcoll_to_bq | "to bq" >> beam.io.WriteToBigQuery(....)
Then the transform function would be something like this
def transform(element):
if something_is_wrong_with_element:
yield pvalue.TaggedOutput('rejected', element)
transformed_element = ....
yield pvalue.TaggedOutput('gcs_output', transformed_element)

Apache Beam with Python : How to compute the minimum in a session window, and apply it to all related PCollections

I'm using Apache Beam's Python SDK to treat dictionaries, which represent streaming analytics hits. The hits are aggregated thanks to session windows. All my DataFlow really has to do is apply these session windows, and assign a session ID to all related hits.
As a session ID, I've figured out I would use the timestamp of first hit (combined with a cookie ID for each user). Here's my pipeline:
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{PROJECT_ID}/topics/{SOURCE_TOPIC}'.format(
PROJECT_ID=PROJECT_ID, SOURCE_TOPIC=SOURCE_TOPIC),
id_label='hit_id',
timestamp_attribute='time')
hits = msgs | 'Parse' >> beam.Map(my_parser)
windowed_hits = hits | 'Windowing' >> beam.WindowInto(beam.window.Sessions(1 * 60))
visit_id = (windowed_hits | 'ExtractTimestamp' >> beam.Map(my_extracter)
| 'GetMinimum' >> beam.CombineGlobally(my_min).without_defaults())
windowed_hits | 'SetVisitId' >>
beam.Map(set_visit_id, visit_id=beam.pvalue.AsSingleton(visit_id))
my_parser is applying literal_eval to transform strings into dicts. my_extracter is taking timestamp out of the hit. set_visit_id is just taking an argument and assigning it to key visit_id.
This doesn't seem to work. When debugging, it seems my visit_id branch is correctly working, and it waits for the session to end before computing the minimum. But when used as a side input, I only get a pvalue.EmptySideInput. How can I get to the result I want, and why does my code return an empty side input ?
Edit: I've replaced AsSingleton with AsIter, to have an idea of what's going wrong here. What I get is a _FilteringIterable with:
_iterable containing one WindowedWalue. The value is the timestamp of the unique hit I sent (let's call it TS1). It's associated with one window, from TS1 to TS1 + 60. Oddly, the timestamp property of this WindowedValue is TS1 + 60(.238), but I guess this is because the branch computing the minimum waited for the session to complete before computing the minimum.
_target_window containing one window, from TS + 60(.24) to TS + 120(.24).
So I guess the problem is this _target_window, but I do not understand why it ranges from TS + 60 to TS + 120. Could it be because of the timestamp of the WindowedValue ? It seems likely, as the boundaries of the _target_window seem derived from its rounded value.
I eventually managed what I wanted to do by throwing away any Combine and replacing this by a GroupByKey.
def my_parser(msg):
result = literal_eval(msg)
return result
def set_key(hit):
return (hit['cid'], hit)
def set_vid2(keyed_hits):
k, hits = keyed_hits
visit_id = min([h['time'] for h in hits])
for h in hits:
h['visit_id'] = visit_id
return hits
def unpack_list(l):
for d in l:
yield d
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{PROJECT_ID}/topics/{SOURCE_TOPIC}'.format(
PROJECT_ID=PROJECT_ID, SOURCE_TOPIC=SOURCE_TOPIC),
id_label='hit_id',
timestamp_attribute='time')
hits = msgs | 'Parse' >> beam.Map(my_parser)
keyed_hits = hits | 'SetKey' >> beam.Map(set_key)
windowed_hits = (keyed_hits | 'Windowing' >> beam.WindowInto(beam.window.Sessions(1 * 60))
| 'Grouping' >> beam.GroupByKey())
clean_hits = windowed_hits | 'ComputeMin' >> beam.Map(set_vid2)
clean_hits | 'Unpack' >> beam.FlatMap(unpack_list)
After the GroupByKey, I have a PCollection containing lists of hits (which are grouped by cookie ID + session windows). Then once the visit ID is computed and set on every hit, I transform my PCollection of lists of hits into a PCollection of hits, with unpack_list.
I'm not sure this is the right way to do it, and if it has any impact on performance though.

How to control accumulation mode when using sequential GroupByKey? Or other idiomatic approach?

My task is equivalent to:
given a stream of letters (each letter is associated with the author)
build the low-latency histogram of ngrams within a fixed window,
assuming that correct ngram consists of a sequence of letters from the same author.
My pipeline looks like:
| 'Extract and assign timestamps' >> Map(..)
| 'Extract author from message' >> Map(..)
| 'Assign author as key' >> Map(lambda x: (x.author, x)) # :(author,chr)
| 'Assign windows' >> WindowInto(window.FixedWindows(5*60),
trigger=Repeatedly(AfterAny(
AfterCount(2),
AfterWatermark())),
accumulation_mode=ACCUMULATING)
| 'Group letters by author' >> GroupByKey() # :(author,[chr])
| 'Extract many ngrams from every substring' >> ParDo(..) # :[(author, ngram)]
| 'Re-key by ngrams' >> Map(lambda x: (x.ngram, 1)) # :(ngram,1)
| 'Group again, now by ngrams' >> GroupByKey() # :(ngram,[1])
| 'Sum ngrams counts' >> Map(lambda x: (x[0], sum(x[1]))) # :(ngram,count)
| 'Re-key by window' >> ParDo(..) # :(window, (ngram,count))
| 'Group within window' >> GroupByKey # :(window, [(ngram,count)])
My goal is to receive a single output message for every two input messages such that output message summarizes ngram-statistics between: the start of the current 5-minute interval in event-time, and current moment in processing-time.
Without AfterCount-trigger, everything works well: each window only has a single pane. I get a single histogram-message for every 5-minutes in event-time.
This pipeline doesn't really work as intended with multiple panes per window, because I have 3 GBKs here and each gets triggered by panes, and does so with accumulation. For example:
after first two characters come in (let's say ('a','author1') and ('b','author1'), panes trigger in all GBKs;
I get an output message which correctly summarizes all ngrams seen so far, which may look like ('[0,300)', [('ab', 1)]);
now I have:
buffered characters in front of the first 'Group letters by word'-GBK: ('author1','a') and ('author1','b'));
a buffered ngram in front of the second 'Group again, now by ngrams'-GBK: ('ab',1); and
a buffered ngram-sum in front of the third 'Group within window'-GBK: ('[0,300)',('ab',1))
when two more characters come in: ('c','author1') and ('d','author2') all GBKs get retriggered with buffered things:
first GBK correctly processes full data so far: ('a','author1'), ('b','author1'), ('c','author1'), ('d','author2') and correctly emits: ('author1',['a','b','c']) and ('author2',['d'])
second GBK processes buffered ('ab',1,) as well as newly emitted ('bc',1) and re-emitted ('ab',1) - and here duplication starts (and cascades on each pane of the same window).
after four characters my summary-statistics-so-far is ('[0,300)',[('ab',2),('bc',1)]) while it should be ('[0,300)',[('ab',1),('bc',1)])
Here are the ideas I have:
Intuitively I want to switch from ACCUMULATING to DISCARDING after first GBK. I can't see if Apache Beam supports that.
I can use a single GBK-by-window and do all intermediate grouping and map-reducing inside the custom transform, but that feels like abandoning the Apache Beam model;
I can re-window after first GBK and switch to DISCARDING, but then I start to get lots of empty or incomplete output windows - in other words, that doesn't satisfy my requirements.
What's the appropriate solution for such pipeline? How can I accumulate before first GBK, but discard before both others - yet make sure that panes for all three GBKs close around the same data (or its derivative)?

Categories