I'm currently building PoC Apache Beam pipeline in GCP Dataflow. In this case, I want to create streaming pipeline with main input from PubSub and side input from BigQuery and store processed data back to BigQuery.
Side pipeline code
side_pipeline = (
p
| "periodic" >> PeriodicImpulse(fire_interval=3600, apply_windowing=True)
| "map to read request" >>
beam.Map(lambda x:beam.io.gcp.bigquery.ReadFromBigQueryRequest(table=side_table))
| beam.io.ReadAllFromBigQuery()
)
Function with side input code
def enrich_payload(payload, equipments):
id = payload["id"]
for equipment in equipments:
if id == equipment["id"]:
payload["type"] = equipment["type"]
payload["brand"] = equipment["brand"]
payload["year"] = equipment["year"]
break
return payload
Main pipeline code
main_pipeline = (
p
| "read" >> beam.io.ReadFromPubSub(topic="projects/my-project/topics/topiq")
| "bytes to dict" >> beam.Map(lambda x: json.loads(x.decode("utf-8")))
| "transform" >> beam.Map(transform_function)
| "timestamping" >> beam.Map(lambda src: window.TimestampedValue(
src,
dt.datetime.fromisoformat(src["timestamp"]).timestamp()
))
| "windowing" >> beam.WindowInto(window.FixedWindows(30))
)
final_pipeline = (
main_pipeline
| "enrich data" >> beam.Map(enrich_payload, equipments=beam.pvalue.AsIter(side_pipeline))
| "store" >> beam.io.WriteToBigQuery(bq_table)
)
result = p.run()
result.wait_until_finish()
After deploy it to Dataflow, everything looks fine and no error. But then I noticed that enrich data step has two nodes instead of one.
And also, the side input stuck as you can see it has Elements Added with 21 counts in Input Collections and - value in Elements Added in Output Collections.
You can find the full pipeline code here
I already follow all instruction in these documentations:
https://beam.apache.org/documentation/patterns/side-inputs/
https://beam.apache.org/releases/pydoc/2.35.0/apache_beam.io.gcp.bigquery.html
Yet still found this error. Please help me. Thanks!
Here you have a working example:
mytopic = ""
sql = "SELECT station_id, CURRENT_TIMESTAMP() timestamp FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations` LIMIT 10"
def to_bqrequest(e, sql):
from apache_beam.io import ReadFromBigQueryRequest
yield ReadFromBigQueryRequest(query=sql)
def merge(e, side):
for i in side:
yield f"Main {e.decode('utf-8')} Side {i}"
pubsub = p | "Read PubSub topic" >> ReadFromPubSub(topic=mytopic)
side_pcol = (p | PeriodicImpulse(fire_interval=300, apply_windowing=False)
| "ApplyGlobalWindow" >> WindowInto(window.GlobalWindows(),
trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5)),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| "To BQ Request" >> ParDo(to_bqrequest, sql=sql)
| ReadAllFromBigQuery()
)
final = (pubsub | "Merge" >> ParDo(merge, side=beam.pvalue.AsList(side_pcol))
| Map(logging.info)
)
p.run()
Note this uses a GlobalWindow (so that both inputs have the same window). I used a processing time trigger so that the pane contains multiple rows. 5 was chosen arbitrarily, using 1 would work too.
Please note matching the data between side and main inputs is non deterministic, and you may see fluctuating values from older fired panes.
In theory, using FixedWindows should fix this, but I cannot get the FixedWindows to work.
Related
I have a pipeline which streams IoT log data. In addition, it has a bounded side input, containing the initial configurations of the devices. This configuration changes over time and has to be updated by specific logs (ConfigLogs) coming from the main PubSub source. All remaining logs (PayloadLogs) need to consume at some point the updated configuration.
In order to have access to the latest configuration, I came up with the following pipeline design:
However, I was unable to get this to work. In particular, I struggle with the correct window/trigger strategy for the side input of Use Config Side Input.
Here is a pseudocode example
import apache_beam as beam
from apache_beam.transforms import window
from apache_beam.transforms.trigger import AccumulationMode, Repeatedly, AfterCount
class UpdateConfig(beam.DoFn):
# currently dummy logic
def process(self, element, outdated_config_list):
outdated_config = outdated_config_list[0]
print(element)
print(outdated_config)
yield {**outdated_config, **{element[0]: element[1]}}
class UseConfig(beam.DoFn):
# currently dummy logic
def process(self, element, latest_config_list):
print(element)
print(latest_config_list)
with beam.Pipeline() as pipeline:
# Bounded Side Input
outdated_config = (
pipeline
| "Config BQ" >> beam.Create([{3: 'config3', 4: 'config4', 5: 'config5'}])
| "Window Side Input" >> beam.WindowInto(window.GlobalWindows())
)
# Unbounded Source
config_logs, payload_logs = (
pipeline
| "IoT Data" >>
beam.io.ReadFromPubSub(subscription="MySub").with_output_types(bytes)
| "Decode" >> beam.Map(lambda x: eval(x.decode('utf-8')))
| "Partition Config/Payload Logs" >>
beam.Partition(lambda x, nums: 0 if type(x[0]) == int else 1, 2)
)
# Update of Bounded Side Input with part of Unbounded Source
latest_config = (
config_logs
| "Update Batch Config" >>
beam.ParDo(UpdateConfig(), batch_lookup_list=beam.pvalue.AsList(outdated_config))
| "LatestConfig Window/Trigger" >>
beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterCount(1)),
accumulation_mode=AccumulationMode.DISCARDING
)
)
# Use Updated Side Input
(
payload_logs
| "Use Config Side Input" >>
beam.ParDo(UseConfig(), latest_config_list=beam.pvalue.AsList(latest_config))
)
which I feed with the following type of dummy data
# ConfigLog: (1, 'config1')
# PayloadLog: ('1', 'value1')
The prints of Update Batch Config are always executed, however Dataflow waits indefinitely in Use Config Side Input, even though I added an AfterCount trigger in LatestConfig Window/Trigger.
When I switch to a FixedWindow (and also add one for the PayloadLogs), I do get the prints in Use Config Side Input, but the side input is mostly empty, since the FixedWindow is triggered after a fixed amount of time, irrespectively if I had an incoming ConfigLog or not.
What is the correct windowing/triggering strategy when using a unbounded side input?
I'm facing a situation with beam.GroupByKey(), I've loaded a file whose amount of lines is 42.854.
Due to the business rules, I need to execute a GroupByKey(); However, After finishing its execution I noticed that I got almost the double of lines. As you can see below :
Step before the GroupByKey():
Why am I having this behavior ?
I'm not doing anything special in my pipeline:
with beam.Pipeline(runner, options=opts) as p:
#LOAD FILE
elements = p | 'Load File' >> beam.Create(fileRNS.values)
#PREPARE VALUES (BULK INSERT)
Script_Values = elements | 'Prepare Bulk Insert' >> beam.ParDo(Prepare_Bulk_Insert())
Grouped_Values = Script_Values | 'Grouping values' >> beam.GroupByKey()
#BULK INSERT INTO POSTGRESQL
Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(ExecuteInsert)
2021-02-09
When I debug, Prepare_Bulk_Insert() has the following content :
As you can see, the amount of elements is correct, I don't understand why GroupByKey() has its input with a higher amount of elements if I'm sending the correct amount.
The Grouped_Values | 'Insert PostgreSQL' >> beam.ParDo(funcaoMap) has its input as follows:
Double amount. =(
Kind regards,
Juliano Medeiros
Those screenshots indicate that the "Prepare Bulk Insert" DoFn is outputting more that one element per input element. Your first screenshot is showing the input PCollection of the GBK (which is produced by the DoFn) and the second is the input to the DoFn, so the difference must be produced by that DoFn.
I have a pipeline that gets data from BigQuery and writes it to GCS, however, if I find any rejects I want to right them to a Bigquery table. I am collecting rejects into a global list variable and later loading the list into BigQuery table. This process works fine in when I run it locally as the pipelines were running in the right order. When I run it using dataflowrunner, it doesn't guarantee the order ( I want pipeline1 to run before pipeline2. Is there a way to have dependent pipelines in Dataflow using python? or Also please suggest if this can be solved in with better approach. Thanks in advance.
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(lambda x: somefunction) # Collecting rejects in the except block of this method to a global list variable
....etc
| 'to gcs' >> beam.io.WriteToText(output)
)
# Loading the rejects gathered in the above pipeline to Biquery
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline2:
rejects = (pipeline2
| 'create pipeline' >> beam.Create(reject_list)
| 'to json format' >> beam.Map(lambda data: {.....})
| 'to bq' >> beam.io.WriteToBigQuery(....)
)
You can do something like that, but with only 1 pipeline, and some additional code in the transformation.
The beam.Map(lambda x: somefunction) should have two outputs: the one that is written to GCS, and the rejected elements that will be eventually written to BigQuery.
For that, your transform function would have to return a TaggedOutput.
There is an example in the Beam Programming Guide: https://beam.apache.org/documentation/programming-guide/#multiple-outputs-dofn
The second PCollection, you can then write to BigQuery.
You don't need to have a Create in this second branch of the pipeline.
The pipeline would be something like this:
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as pipeline1:
data = (pipeline1
| 'get data' >> beam.io.Read(beam.io.BigQuerySource(query=...,use_standard_sql=True))
| 'combine output to list' >> beam.combiners.ToList()
| 'tranform' >> beam.Map(transform) # Tagged output produced here
pcoll_to_gcs = data.gcs_output
pcoll_to_bq = data.rejected
pcoll_to_gcs | "to gcs" >> beam.io.WriteToText(output)
pcoll_to_bq | "to bq" >> beam.io.WriteToBigQuery(....)
Then the transform function would be something like this
def transform(element):
if something_is_wrong_with_element:
yield pvalue.TaggedOutput('rejected', element)
transformed_element = ....
yield pvalue.TaggedOutput('gcs_output', transformed_element)
I'm building a dataflow pipeline and I'm having some trouble branching and merging outputs. The pipeline I want to build is as follows:
Read some input data input_data.
A. Extract some metric, metric_1, on input_data.
B. Extract some other metric, metric_2, on input_data
Since these metric extractions are computationally expensive, I want to branch off of the main input_data and merge the outputs afterwards for further calculation. Merge outputs output.
Here's some sample code that encapsulates my actual pipeline
class ReadData(beam.DoFn):
def process(self, element):
# read from source
return [{'input': np.random.rand(100,10)}]
class GetFirstMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'first': np.random.rand(100,4)}]
class GetSecondMetric(beam.DoFn):
def process(self, element):
# some processing
return [{'second': np.random.rand(100,3)}]
def run():
with beam.Pipeline() as p:
input_data = (p | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
output = ((metric_1, metric_2)
| beam.Flatten()
| beam.combiners.ToList()
| beam.Map(print)
)
When I run this, I get a 'PBegin' object has no attribute 'windowing' error. I've seen some examples and sample code for doing something like this in Java. But I couldn't find the right resources for doing the same in Python. My question is as follows:
What's the right way to branch and merge pcollections (especially if the branches came from a common input)?
Is there a better pipeline design for accomplishing the same?
Thanks in advance!
In this code, your problem is that you are not 'starting' an initial PCollection. In ReadData.process - what is the value of the variable element?
Well, the runner can't come up with a value, because there's no input pcollection. You need to create your first PCollection. You'd do something like the following code...
As for making them into a list - perhaps a side input strategy may work. CAn you try the following:
def run():
with beam.Pipeline() as p:
starter_pcoll = p | beam.Create(['any'])
input_data = (starter_pcoll | 'read sample data' >> beam.ParDo(ReadData()))
metric_1 = (input_data | 'some metric on input data' >> beam.ParDo(GetFirstMetric()))
metric_2 = (input_data | 'some aggregate metric' >> beam.ParDo(GetSecondMetric()))
side_in = beam.pvalue.AsList((metric_1, metric_2)
| beam.Flatten())
p | beam.Create(['any']) | beam.Map(lambda x, si: print(si),
side_in)
This should make your pipeline run. Happy to clarify about your specific questions further.
I'm using Apache Beam's Python SDK to treat dictionaries, which represent streaming analytics hits. The hits are aggregated thanks to session windows. All my DataFlow really has to do is apply these session windows, and assign a session ID to all related hits.
As a session ID, I've figured out I would use the timestamp of first hit (combined with a cookie ID for each user). Here's my pipeline:
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{PROJECT_ID}/topics/{SOURCE_TOPIC}'.format(
PROJECT_ID=PROJECT_ID, SOURCE_TOPIC=SOURCE_TOPIC),
id_label='hit_id',
timestamp_attribute='time')
hits = msgs | 'Parse' >> beam.Map(my_parser)
windowed_hits = hits | 'Windowing' >> beam.WindowInto(beam.window.Sessions(1 * 60))
visit_id = (windowed_hits | 'ExtractTimestamp' >> beam.Map(my_extracter)
| 'GetMinimum' >> beam.CombineGlobally(my_min).without_defaults())
windowed_hits | 'SetVisitId' >>
beam.Map(set_visit_id, visit_id=beam.pvalue.AsSingleton(visit_id))
my_parser is applying literal_eval to transform strings into dicts. my_extracter is taking timestamp out of the hit. set_visit_id is just taking an argument and assigning it to key visit_id.
This doesn't seem to work. When debugging, it seems my visit_id branch is correctly working, and it waits for the session to end before computing the minimum. But when used as a side input, I only get a pvalue.EmptySideInput. How can I get to the result I want, and why does my code return an empty side input ?
Edit: I've replaced AsSingleton with AsIter, to have an idea of what's going wrong here. What I get is a _FilteringIterable with:
_iterable containing one WindowedWalue. The value is the timestamp of the unique hit I sent (let's call it TS1). It's associated with one window, from TS1 to TS1 + 60. Oddly, the timestamp property of this WindowedValue is TS1 + 60(.238), but I guess this is because the branch computing the minimum waited for the session to complete before computing the minimum.
_target_window containing one window, from TS + 60(.24) to TS + 120(.24).
So I guess the problem is this _target_window, but I do not understand why it ranges from TS + 60 to TS + 120. Could it be because of the timestamp of the WindowedValue ? It seems likely, as the boundaries of the _target_window seem derived from its rounded value.
I eventually managed what I wanted to do by throwing away any Combine and replacing this by a GroupByKey.
def my_parser(msg):
result = literal_eval(msg)
return result
def set_key(hit):
return (hit['cid'], hit)
def set_vid2(keyed_hits):
k, hits = keyed_hits
visit_id = min([h['time'] for h in hits])
for h in hits:
h['visit_id'] = visit_id
return hits
def unpack_list(l):
for d in l:
yield d
msgs = p | 'Read' >> beam.io.gcp.pubsub.ReadFromPubSub(
topic='projects/{PROJECT_ID}/topics/{SOURCE_TOPIC}'.format(
PROJECT_ID=PROJECT_ID, SOURCE_TOPIC=SOURCE_TOPIC),
id_label='hit_id',
timestamp_attribute='time')
hits = msgs | 'Parse' >> beam.Map(my_parser)
keyed_hits = hits | 'SetKey' >> beam.Map(set_key)
windowed_hits = (keyed_hits | 'Windowing' >> beam.WindowInto(beam.window.Sessions(1 * 60))
| 'Grouping' >> beam.GroupByKey())
clean_hits = windowed_hits | 'ComputeMin' >> beam.Map(set_vid2)
clean_hits | 'Unpack' >> beam.FlatMap(unpack_list)
After the GroupByKey, I have a PCollection containing lists of hits (which are grouped by cookie ID + session windows). Then once the visit ID is computed and set on every hit, I transform my PCollection of lists of hits into a PCollection of hits, with unpack_list.
I'm not sure this is the right way to do it, and if it has any impact on performance though.