I'm quite new to Apache Beam and implemented my first pipelines.
But now I got to a point, where I am confused how to combine windowing and joining.
Problem definition:
I have two streams of data, one with pageviews of users, and another with requests of the users. They share the key session_id which describes the users session, but each have other additional data.
The goal is to compute the number of pageviews in a session before a request happened. That means, I want to have a stream of data that has every request together with the number of pageviews before the request. It suffices to have the pageviews of lets say the last 5 minutes.
What I tried
To load the requests I use this snippet, which loads the requests from a pubsub subscription and then extracts the session_id as key. Lastly, I apply a window which emits every request directly when it is received.
requests = (p
| 'Read Requests' >> (
beam.io.ReadFromPubSub(subscription=request_sub)
| 'Extract' >> beam.Map(lambda x: json.loads(x))
| 'Session as Key' >> beam.Map(lambda request: (request['session_id'], request))
| 'Window' >> beam.WindowInto(window.SlidingWindows(5 * 60, 1 * 60, 0),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING
)
)
)
Similarily, this snippet loads the pageviews, which applies a sliding window which is emitted accumulating whenever a pageview enters.
pageviews = (p
| 'Read Pageviews' >> (
beam.io.ReadFromPubSub(subscription=pageview_sub)
| 'Extract' >> beam.Map(lambda x: json.loads(x))
| 'Session as Key' >> beam.Map(lambda pageview: (pageview['session_id'], pageview))
| 'Window' >> beam.WindowInto(
windowfn=window.SlidingWindows(5 * 60, 1 * 60, 0),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
)
)
To apply the join, I tried
combined = (
{
'requests': requests,
'pageviews': pageviews
}
| 'Merge' >> beam.CoGroupByKey()
| 'Print' >> beam.Map(print)
)
When I run this pipeline, there are never rows with requests as well as pageviews in the merged rows, only one of them is there.
My idea was to filter out pageviews before the request and count them after the cogroupby. What do I need to do? I suppose my problem is with the windowing and triggering strategy.
Its also quite important that the request get processed with low latency, possibly discarding pageviews that come in late.
I found a solution myself, here's it in case somebody is interested:
Idea
The trick is to combine the two streams using the beam.Flatten operation and to use a Stateful DoFn to compute the number of pageviews before one request. Each stream contains json dictionaries. I embedded them by using {'request' : request} and {'pageview' : pageview} as a surrounding block, so that I can keep the different events apart in the Stateful DoFn. I also computed things like first pageview timestamp and seconds since first pageview along. The streams have to use the session_id as a key, such that the Stateful DoFn is receiving all the events of one session only.
Code
First of all, this is the pipeline code:
# Beam pipeline, that extends requests by number of pageviews before request in that session
with beam.Pipeline(options=options) as p:
# The stream of requests
requests = (
'Read from PubSub subscription' >> beam.io.ReadFromPubSub(subscription=request_sub)
| 'Extract JSON' >> beam.ParDo(ExtractJSON())
| 'Add Timestamp' >> beam.ParDo(AssignTimestampFn())
| 'Use Session ID as stream key' >> beam.Map(lambda request: (request['session_id'], request))
| 'Add type of event' >> beam.Map(lambda r: (r[0], ('request', r[1])))
)
# The stream of pageviews
pageviews = (
'Read from PubSub subscription' >> beam.io.ReadFromPubSub(subscription=pageview_sub)
| 'Extract JSON' >> beam.ParDo(ExtractJSON())
| 'Add Timestamp' >> beam.ParDo(AssignTimestampFn())
| 'Use Session ID as stream key' >> beam.Map(lambda pageview: (pageview['session_id'], pageview))
| 'Add type of event' >> beam.Map(lambda p: (p[0], ('pageview', p[1])))
)
# Combine the streams and apply Stateful DoFn
combined = (
(
p | ('Prepare requests stream' >> requests),
p | ('Prepare pageviews stream' >> pageviews)
)
| 'Combine event streams' >> beam.Flatten()
| 'Global Window' >> beam.WindowInto(windowfn=window.GlobalWindows(),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'Stateful DoFn' >> beam.ParDo(CountPageviews())
| 'Compute processing delay' >> beam.ParDo(LogTimeDelay())
| 'Format for BigQuery output' >> beam.ParDo(FormatForOutputDoFn())
)
# Write to BigQuery.
combined | 'Write' >> beam.io.WriteToBigQuery(
requests_extended_table,
schema=REQUESTS_EXTENDED_TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
The interesting part is the combination of the two streams using beam.Flatten and applying the stateful DoFn CountPageviews()
Here's the code of the used custom DoFns:
# This DoFn just loads a json message
class ExtractJSON(beam.DoFn):
def process(self, element):
import json
yield json.loads(element)
# This DoFn adds the event timestamp of messages into it json elements for further processing
class AssignTimestampFn(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam):
import datetime
timestamped_element = element
timestamp_utc = datetime.datetime.utcfromtimestamp(float(timestamp))
timestamp = timestamp_utc.strftime("%Y-%m-%d %H:%M:%S")
timestamped_element['timestamp_utc'] = timestamp_utc
timestamped_element['timestamp'] = timestamp
yield timestamped_element
# This class is a stateful dofn
# Input elements should be of form (session_id, {'event_type' : event}
# Where events can be requests or pageviews
# It computes on a per session basis the number of pageviews and the first pageview timestamp
# in its internal state
# Whenever a request comes in, it appends the internal state to the request and emits
# a extended request
# Whenever a pageview comes in, the internal state is updated but nothing is emitted
class CountPageviewsStateful(beam.DoFn):
# The internal states
NUM_PAGEVIEWS = userstate.CombiningValueStateSpec('num_pageviews', combine_fn=sum)
FIRST_PAGEVIEW = userstate.ReadModifyWriteStateSpec('first_pageview', coder=beam.coders.VarIntCoder())
def process(self,
element,
num_pageviews_state=beam.DoFn.StateParam(NUM_PAGEVIEWS),
first_pageview_state=beam.DoFn.StateParam(FIRST_PAGEVIEW)
):
import datetime
# Extract element
session_id = element[0]
event_type, event = element[1]
# Process different event types
# Depending on event type, different actions are done
if event_type == 'request':
# This is a request
request = event
# First, the first pageview timestamp is extracted and the seconds since first timestamp are calculated
first_pageview = first_pageview_state.read()
if first_pageview is not None:
seconds_since_first_pageview = (int(request['timestamp_utc'].timestamp()) - first_pageview)
first_pageview_timestamp_utc = datetime.datetime.utcfromtimestamp(float(first_pageview))
first_pageview_timestamp = first_pageview_timestamp_utc.strftime("%Y-%m-%d %H:%M:%S")
else:
seconds_since_first_pageview = -1
first_pageview_timestamp = None
# The calculated data is appended to the request
request['num_pageviews'] = num_pageviews_state.read()
request['first_pageview_timestamp'] = first_pageview_timestamp
request['seconds_since_first_pageview'] = seconds_since_first_pageview
# The pageview counter is reset
num_pageviews_state.clear()
# The request is returned
yield (session_id, request)
elif event_type == 'pageview':
# This is a pageview
pageview = event
# Update first pageview state
first_pageview = first_pageview_state.read()
if first_pageview is None:
first_pageview_state.write(int(pageview['timestamp_utc'].timestamp()))
elif first_pageview > int(pageview['timestamp_utc'].timestamp()):
first_pageview_state.write(int(pageview['timestamp_utc'].timestamp()))
# Increase number of pageviews
num_pageviews_state.add(1)
# Do not return anything, pageviews are not further processed
# This DoFn logs the delay between the event time and the processing time
class LogTimeDelay(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam):
import datetime
import logging
timestamp_utc = datetime.datetime.utcfromtimestamp(float(timestamp))
seconds_delay = (datetime.datetime.utcnow() - timestamp_utc).total_seconds()
logging.warning('Delayed by %s seconds', seconds_delay)
yield element
This seems to work and gives me an average delay of about 1-2 seconds on the direct runner. On Cloud Dataflow the average delay is about 0.5-1 seconds. So all in all, this seems to solve the problem definition.
Further considerations
There are some open questions, though:
I am using global windows, which means internal state will be kept forever as far as i am concerned. Maybe session windows are the correct way to go: When there are no pageviews/requests for x seconds, the window is closed and internal state is given free.
Processing delay is a little bit high, but maybe I need to tweak the pubsub part a little bit.
I do not know how much overhead or memory consumption this solution adds over standard beam methods. I also didn't test high workload and parallelisation.
Related
I have a Dataflow job that accumulates UI interactions via GCP Pub/Sub. I've tested this by using a script that sends many Pub/Sub messages representing interactions to the input_topic. When I do a lower number of messages (<500 per second), the Dataflow job correctly counts the interactions. But when I crank the number of messages up, all of a sudden the Dataflow job send out counts that are way higher (5-10x) than the number of Pub/Sub messages sent to the input_topic.
The ideas I've explored are:
Pub/Sub is resending messages that aren't being acked.
This doesn't make sense because the ack deadline for the input_topic subscription is 1 minute.
Something is wrong with my trigger configuration.
Something I don't understand is happening in ReadFromPubSub or CombineGlobally(CountFn())
class CountFn(beam.CombineFn):
def create_accumulator(self):
# interaction1, interaction2, interaction3, interaction4
return 0, 0, 0, 0
def add_input(self, interactions, input):
(interaction1, interaction2, interaction3, interaction4) = interactions
interaction1_result = interaction1 + input['interaction1'] if ('interaction1' in input and isinstance(input['interaction1'], int) and input['interaction1'] > 0) else interaction1
interaction2_result = interaction2 + input['interaction2'] if ('interaction2' in input and isinstance(input['interaction2'], int) and input['interaction2'] > 0) else interaction2
interaction3_result = interaction3 + input['interaction3'] if ('interaction3' in input and isinstance(input['interaction3'], int) and input['interaction3'] > 0) else interaction3
interaction4_result = interaction4 + input['interaction4'] if ('interaction4' in input and isinstance(input['interaction4'], int) and input['interaction4'] > 0) else interaction4
return interaction1_result, interaction2_result, interaction3_result, interaction4_result
def merge_accumulators(self, accumulators):
interaction1, interaction2, interaction3, interaction4 = zip(*accumulators)
return sum(interaction1), sum(interaction2), sum(interaction3), sum(interaction4)
def extract_output(self, interactions):
(interaction1, interaction2, interaction3, interaction4) = interactions
output = {
'interaction1': interaction1,
'interaction2': interaction2,
'interaction3': interaction3,
'interaction4': interaction4
}
return output
def to_json(e):
try:
return json.loads(e.decode('utf-8'))
except json.JSONDecodeError:
return {}
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read from pubsub' >> beam.io.ReadFromPubSub(topic=known_args.input_topic)
| 'To Json' >> beam.Map(to_json)
| 'Window' >> beam.WindowInto(window.FixedWindows(1),
trigger=AfterProcessingTime(delay=1 * 3),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=2)
| 'Calculate Metrics' >> beam.CombineGlobally(CountFn()).without_defaults()
| 'To bytestring' >> beam.Map(lambda e: json.dumps(e).encode('utf-8'))
| 'Write to pubsub' >> beam.io.WriteToPubSub(topic=known_args.output_topic))
Turns out the problem was caused by the client I was using to send the test messages. Nodejs-pubsub issue 847 states that nodejs-pubsub has a problem sending high volumes of messages.
https://github.com/googleapis/nodejs-pubsub/issues/847
There is a comment suggesting a workaround, but I have not tried it myself.
https://github.com/googleapis/nodejs-pubsub/issues/847#issuecomment-886024472
I have a branching pipeline with multiple ParDo transforms that are merged and written to text file records in a GCS bucket.
I am receiving the following messages after my pipeline crashes:
The worker lost contact with the service.
RuntimeError: FileNotFoundError: [Errno 2] Not found: gs://MYBUCKET/JOBNAME.00000-of-00001.avro [while running 'WriteToText/WriteToText/Write/WriteImpl/WriteBundles/WriteBundles']
Which looks like it can't find the log file it's been writing to. It seems to be fine until a certain point when the error occurs. I'd like to wrap a try: / except: around it or a breakpoint, but I'm not even sure how to discover what the root cause is.
Is there a way to just write a single file? Or only open a file to write once? It's spamming thousands of output files into this bucket, which is something I'd like to eliminate and may be a factor.
with beam.Pipeline(argv=pipeline_args) as p:
csvlines = (
p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
| 'Parse CSV to Dictionary' >> beam.ParDo(Split())
| 'Read Files into Memory' >> beam.ParDo(DownloadFilesDoFn())
| 'Windowing' >> beam.WindowInto(window.FixedWindows(20 * 60))
)
b1 = ( csvlines | 'Branch1' >> beam.ParDo(Branch1DoFn()) )
b2 = ( csvlines | 'Branch2' >> beam.ParDo(Branch2DoFn()) )
b3 = ( csvlines | 'Branch3' >> beam.ParDo(Branch3DoFn()) )
b4 = ( csvlines | 'Branch4' >> beam.ParDo(Branch4DoFn()) )
b5 = ( csvlines | 'Branch5' >> beam.ParDo(Branch5DoFn()) )
b6 = ( csvlines | 'Branch6' >> beam.ParDo(Branch6DoFn()) )
output = (
(b1,b2,b3,b4,b5,b6) | 'Merge PCollections' >> beam.Flatten()
| 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
)
This question is linked to this previous question which contains more detail about the implementation. The solution there suggested to create an instance of google.cloud.storage.Client() in the start_bundle() of every call to a ParDo(DoFn). This is connected to the same gcs bucket - given via the args in WriteToText(known_args.output)
class DownloadFilesDoFn(beam.DoFn):
def __init__(self):
import re
self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')
def start_bundle(self):
import google.cloud.storage
self.gcs = google.cloud.storage.Client()
def process(self, element):
self.file_match = self.gcs_path_regex.match(element['Url'])
self.bucket = self.gcs.get_bucket(self.file_match.group(1))
self.blob = self.bucket.get_blob(self.file_match.group(2))
self.f = self.blob.download_as_bytes()
It's likely the cause of this error is related to to having too many connections to the client. I'm not clear on good practice for this - since it's been suggested elsewhere that you can set up network connections in this way for each bundle.
Adding this to the end to remove the client object from memory at the end of the bundle should help close some unnecessary lingering connections.
def finish_bundle(self):
del self.gcs, self.gcs_path_regex
I'm trying to create fixed windows of 10 sec using apache beam 2.23 with kafka as data source.
It seems to be getting triggered for every record even if I try to set AfterProcessingtime trigger to 15 and throwing the following error if I try to use GroupByKey.
Error : KeyError: 0 [while running '[17]: FixedWindow']
Data simulation :
from kafka import KafkaProducer
import time
producer = KafkaProducer()
id_val = 1001
while(1):
message = {}
message['id_val'] = str(id_val)
message['sensor_1'] = 10
if (id_val<1003):
id_val = id_val+1
else:
id_val=1001
time.sleep(2)
print(time.time())
producer.send('test', str(message).encode())
Beam snippet :
class AddTimestampFn(beam.DoFn):
def process(self, element):
timestamp = int(time.time())
yield beam.window.TimestampedValue(element, timestamp)
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
with beam.Pipeline() as p:
lines = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
groups = (
lines
| 'ParseEventFn' >> beam.Map(lambda x: (ast.literal_eval(x[1])))
| 'Add timestamp' >> beam.ParDo(AddTimestampFn())
| 'After timestamp add ' >> beam.ParDo(PrintFn("timestamp add"))
| 'FixedWindow' >> beam.WindowInto(
beam.window.FixedWindows(10*1),allowed_lateness = 30)
| 'Group ' >> beam.GroupByKey())
| 'After group' >> beam.ParDo(PrintFn("after group")))
What am I doing wrong here? I have just started using beam so it could be something really silly.
I have 2 questions on my development.
Question 1
I'm trying to create a template from a python code which consists of reading from BigQuery tables, apply some transformations and write in a different BigQuery table (which can exists or not).
The point is that I need to send the target table as parameter, but looks that I can't use parameters in the pipeline method WriteToBigQuery as it is raising the following error message: apache_beam.error.RuntimeValueProviderError: RuntimeValueProvider(option: project_target, type: str, default_value: 'Test').get() not called from a runtime context
Approach 1
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Delete previous data" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> Write(WriteToBigQuery(
table=custom_options.project_target.get() + ":" + custom_options.dataset_target.get() + "." + custom_options.table_target.get(),
schema=custom_options.target_schema.get(),
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
Approach 2
I created a ParDo function in order to get there the variable and set the WriteToBigQuery method. However, despite of having the pipeline execution completed sucessfully and seeing that the output is returning rows (theoretically written), I can't see the table nor data inserted on it.
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Pre-tasks" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> Write(WriteToBigQuery())
Where I tried with 2 methods and none works: BigQueryBatchFileLoads and WriteToBigQuery
class writeTable(beam.DoFn):
def process(self, element):
try:
#Load first here the parameters from the custom_options variable (Here we can do it)
result1 = Write(BigQueryBatchFileLoads(destination=target_table,
schema=target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED))
result2 = WriteToBigQuery(table=target_table,
schema=target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
method="FILE_LOADS"
)
Question 2
Other doubt I have is if in this last ParDo class, I need to return something as the element or result1 or result2 as we are in the last pipeline step.
Appreciate your help on this.
The most advisable way to do this is similar to #1, but passing the value provider without calling get, and passing a lambda for table:
with beam.Pipeline(options=options) as pipeline:
logging.info("Start logic process...")
kpis_report = (
pipeline
| "Process start" >> Create(["1"])
| "Delete previous data" >> ParDo(preTasks())
| "Read table" >> ParDo(readTable())
....
| 'Write table 2' >> WriteToBigQuery(
table=lambda x: custom_options.project_target.get() + ":" + custom_options.dataset_target.get() + "." + custom_options.table_target.get(),
schema=custom_options.target_schema,
write_disposition=BigQueryDisposition.WRITE_APPEND,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED)
This should work.
I have a pipeline that takes a bounded PCollection, assigns timestamps to it and "windows" it into Sliding Windows. After a grouping transform, I want to assign the resulting PCollection back to the global window. I have not been able to figure out how to do this. See sample beam pseudo-code below:
import apache_beam as beam
with beam.Pipeline() as p:
(
p
| beam.io.ReadFromText()
| beam.ParDo(AddTimestampDoFn())
| beam.WindowInto(beam.window.SlidingWindows(60, 60))
| beam.GroupByKey()
| beam.ParDo(SomethingElse()
| beam.WindowInto(GlobalWindow()) # Here is where I want to bring back to global window
)
Any ideas on how to go about it?
Using beam.WindowInto(window.GlobalWindows()) should work. For example, with this quick test:
data = [{'message': 'Hi', 'timestamp': time.time()}]
events = (p
| 'Create Events' >> beam.Create(data) \
| 'Add Timestamps' >> beam.Map(lambda x: beam.window.TimestampedValue(x, x['timestamp'])) \
| 'Sliding Windows' >> beam.WindowInto(beam.window.SlidingWindows(60, 60)) \
| 'First window' >> beam.ParDo(DebugPrinterFn()) \
| 'global Window' >> beam.WindowInto(window.GlobalWindows()) \
| 'Second window' >> beam.ParDo(DebugPrinterFn()))
where DebugPrinterFn prints window information:
class DebugPrinterFn(beam.DoFn):
"""Just prints the element and window"""
def process(self, element, window=beam.DoFn.WindowParam):
logging.info("Received message %s in window=%s", element['message'], window)
yield element
I get the following output:
INFO:root:Received message Hi in window=[1575565500.0, 1575565560.0)
INFO:root:Received message Hi in window=GlobalWindow
Tested with the DirectRunner and 2.16.0 SDK. If it does not work for you:
Do you get any error?
Which runner and SDK are you using?
Full code here