Apache Beam Python Dataflow with GCP Pub/Sub counter is over counting

Apache Beam Python Dataflow with GCP Pub/Sub counter is over counting - python

I have a Dataflow job that accumulates UI interactions via GCP Pub/Sub. I've tested this by using a script that sends many Pub/Sub messages representing interactions to the input_topic. When I do a lower number of messages (<500 per second), the Dataflow job correctly counts the interactions. But when I crank the number of messages up, all of a sudden the Dataflow job send out counts that are way higher (5-10x) than the number of Pub/Sub messages sent to the input_topic.
The ideas I've explored are:
Pub/Sub is resending messages that aren't being acked.
This doesn't make sense because the ack deadline for the input_topic subscription is 1 minute.
Something is wrong with my trigger configuration.
Something I don't understand is happening in ReadFromPubSub or CombineGlobally(CountFn())
class CountFn(beam.CombineFn):
def create_accumulator(self):
# interaction1, interaction2, interaction3, interaction4
return 0, 0, 0, 0
def add_input(self, interactions, input):
(interaction1, interaction2, interaction3, interaction4) = interactions
interaction1_result = interaction1 + input['interaction1'] if ('interaction1' in input and isinstance(input['interaction1'], int) and input['interaction1'] > 0) else interaction1
interaction2_result = interaction2 + input['interaction2'] if ('interaction2' in input and isinstance(input['interaction2'], int) and input['interaction2'] > 0) else interaction2
interaction3_result = interaction3 + input['interaction3'] if ('interaction3' in input and isinstance(input['interaction3'], int) and input['interaction3'] > 0) else interaction3
interaction4_result = interaction4 + input['interaction4'] if ('interaction4' in input and isinstance(input['interaction4'], int) and input['interaction4'] > 0) else interaction4
return interaction1_result, interaction2_result, interaction3_result, interaction4_result
def merge_accumulators(self, accumulators):
interaction1, interaction2, interaction3, interaction4 = zip(*accumulators)
return sum(interaction1), sum(interaction2), sum(interaction3), sum(interaction4)
def extract_output(self, interactions):
(interaction1, interaction2, interaction3, interaction4) = interactions
output = {
'interaction1': interaction1,
'interaction2': interaction2,
'interaction3': interaction3,
'interaction4': interaction4
}
return output
def to_json(e):
try:
return json.loads(e.decode('utf-8'))
except json.JSONDecodeError:
return {}
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'Read from pubsub' >> beam.io.ReadFromPubSub(topic=known_args.input_topic)
| 'To Json' >> beam.Map(to_json)
| 'Window' >> beam.WindowInto(window.FixedWindows(1),
trigger=AfterProcessingTime(delay=1 * 3),
accumulation_mode=AccumulationMode.DISCARDING,
allowed_lateness=2)
| 'Calculate Metrics' >> beam.CombineGlobally(CountFn()).without_defaults()
| 'To bytestring' >> beam.Map(lambda e: json.dumps(e).encode('utf-8'))
| 'Write to pubsub' >> beam.io.WriteToPubSub(topic=known_args.output_topic))

Turns out the problem was caused by the client I was using to send the test messages. Nodejs-pubsub issue 847 states that nodejs-pubsub has a problem sending high volumes of messages.
https://github.com/googleapis/nodejs-pubsub/issues/847
There is a comment suggesting a workaround, but I have not tried it myself.
https://github.com/googleapis/nodejs-pubsub/issues/847#issuecomment-886024472

Related

Why does uploading data from Pi block UART Communication?

I have a waveshare TOF Laser Range Sensor (B) that I am using with my Raspberry Pi 3 Model B+. My main goal is to receive distance from the sensor and upload that data to ThingSpeak cloud platform. The first part of the code works well: I am receiving the distance and timestamp, along with signal status and data check from the sensor. However, when I try to upload the timestamp and distance values to the cloud platform, the sensor's data is is incorrect and will be values that have only 5-7 cm of variation, even though the object is very close to it. I have tried using async requests using the aiohhtp and ayncio libraries, to no avail.
Here is the demo code from the manufacturers of the sensor that I have modified to send async requests.
#coding: UTF-8
import RPi.GPIO as GPIO
import serial
import time
import chardet
import sys
import aiohttp
import asyncio
#ThingSpeak Cloud Write Definitions:
channel_id = "1853890"
write_api_key = "G22BQASFVWJT6T"
TOF_length = 16
TOF_header=(87,0,255)
TOF_system_time = 0
TOF_distance = 0
TOF_status = 0
TOF_signal = 0
TOF_check = 0
ser = serial.Serial('/dev/serial0',921600)
ser.flushInput()
#Async Function to upload Data
async def upload(timer, dist):
async with aiohttp.ClientSession() as session:
upload_url = "https://api.thingspeak.com/update?api_key=G22BQASFVWJT6TOH&field1=" + str(timer) + "&field2=" + str(dist)
async with session.get(upload_url) as res:
print('ok')
def verifyCheckSum(data, len):
#print(data)
TOF_check = 0
for k in range(0,len-1):
TOF_check += data[k]
TOF_check=TOF_check%256
if(TOF_check == data[len-1]):
print("TOF data is ok!")
return 1
else:
print("TOF data is error!")
return 0
while True:
TOF_data=()
timer=0
dist=0
time.sleep(0.01)
if ser.inWaiting() >=32:
for i in range(0,16):
TOF_data=TOF_data+(ord(ser.read(1)),ord(ser.read(1)))
#print(TOF_data)
for j in range(0,16):
if( (TOF_data[j]==TOF_header[0] and TOF_data[j+1]==TOF_header[1] and TOF_data[j+2]==TOF_header[2]) and (verifyCheckSum(TOF_data[j:TOF_length],TOF_length))):
if(((TOF_data[j+12]) | (TOF_data[j+13]<<8) )==0):
print("Out of range!")
else :
print("TOF id is: "+ str(TOF_data[j+3]))
TOF_system_time = TOF_data[j+4] | TOF_data[j+5]<<8 | TOF_data[j+6]<<16 | TOF_data[j+7]<<24
print("TOF system time is: "+str(TOF_system_time)+'ms')
timer = TOF_system_time
TOF_distance = (TOF_data[j+8]) | (TOF_data[j+9]<<8) | (TOF_data[j+10]<<16)
print("TOF distance is: "+str(TOF_distance)+'mm')
dist=TOF_distance
TOF_status = TOF_data[j+11]
print("TOF status is: "+str(TOF_status))
TOF_signal = TOF_data[j+12] | TOF_data[j+13]<<8
print("TOF signal is: "+str(TOF_signal))
#Calling async function to upload data:
asyncio.run(upload(timer, dist))
break
Here is the output when calling upload method:
Here is the output when not calling the upload method:
Can someone please explain what I am doing wrong and correct me?
Thanks!

Windowed Joins in Apache Beam

I'm quite new to Apache Beam and implemented my first pipelines.
But now I got to a point, where I am confused how to combine windowing and joining.
Problem definition:
I have two streams of data, one with pageviews of users, and another with requests of the users. They share the key session_id which describes the users session, but each have other additional data.
The goal is to compute the number of pageviews in a session before a request happened. That means, I want to have a stream of data that has every request together with the number of pageviews before the request. It suffices to have the pageviews of lets say the last 5 minutes.
What I tried
To load the requests I use this snippet, which loads the requests from a pubsub subscription and then extracts the session_id as key. Lastly, I apply a window which emits every request directly when it is received.
requests = (p
| 'Read Requests' >> (
beam.io.ReadFromPubSub(subscription=request_sub)
| 'Extract' >> beam.Map(lambda x: json.loads(x))
| 'Session as Key' >> beam.Map(lambda request: (request['session_id'], request))
| 'Window' >> beam.WindowInto(window.SlidingWindows(5 * 60, 1 * 60, 0),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING
)
)
)
Similarily, this snippet loads the pageviews, which applies a sliding window which is emitted accumulating whenever a pageview enters.
pageviews = (p
| 'Read Pageviews' >> (
beam.io.ReadFromPubSub(subscription=pageview_sub)
| 'Extract' >> beam.Map(lambda x: json.loads(x))
| 'Session as Key' >> beam.Map(lambda pageview: (pageview['session_id'], pageview))
| 'Window' >> beam.WindowInto(
windowfn=window.SlidingWindows(5 * 60, 1 * 60, 0),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING
)
)
)
To apply the join, I tried
combined = (
{
'requests': requests,
'pageviews': pageviews
}
| 'Merge' >> beam.CoGroupByKey()
| 'Print' >> beam.Map(print)
)
When I run this pipeline, there are never rows with requests as well as pageviews in the merged rows, only one of them is there.
My idea was to filter out pageviews before the request and count them after the cogroupby. What do I need to do? I suppose my problem is with the windowing and triggering strategy.
Its also quite important that the request get processed with low latency, possibly discarding pageviews that come in late.

I found a solution myself, here's it in case somebody is interested:
Idea
The trick is to combine the two streams using the beam.Flatten operation and to use a Stateful DoFn to compute the number of pageviews before one request. Each stream contains json dictionaries. I embedded them by using {'request' : request} and {'pageview' : pageview} as a surrounding block, so that I can keep the different events apart in the Stateful DoFn. I also computed things like first pageview timestamp and seconds since first pageview along. The streams have to use the session_id as a key, such that the Stateful DoFn is receiving all the events of one session only.
Code
First of all, this is the pipeline code:
# Beam pipeline, that extends requests by number of pageviews before request in that session
with beam.Pipeline(options=options) as p:
# The stream of requests
requests = (
'Read from PubSub subscription' >> beam.io.ReadFromPubSub(subscription=request_sub)
| 'Extract JSON' >> beam.ParDo(ExtractJSON())
| 'Add Timestamp' >> beam.ParDo(AssignTimestampFn())
| 'Use Session ID as stream key' >> beam.Map(lambda request: (request['session_id'], request))
| 'Add type of event' >> beam.Map(lambda r: (r[0], ('request', r[1])))
)
# The stream of pageviews
pageviews = (
'Read from PubSub subscription' >> beam.io.ReadFromPubSub(subscription=pageview_sub)
| 'Extract JSON' >> beam.ParDo(ExtractJSON())
| 'Add Timestamp' >> beam.ParDo(AssignTimestampFn())
| 'Use Session ID as stream key' >> beam.Map(lambda pageview: (pageview['session_id'], pageview))
| 'Add type of event' >> beam.Map(lambda p: (p[0], ('pageview', p[1])))
)
# Combine the streams and apply Stateful DoFn
combined = (
(
p | ('Prepare requests stream' >> requests),
p | ('Prepare pageviews stream' >> pageviews)
)
| 'Combine event streams' >> beam.Flatten()
| 'Global Window' >> beam.WindowInto(windowfn=window.GlobalWindows(),
trigger=trigger.AfterCount(1),
accumulation_mode=trigger.AccumulationMode.DISCARDING)
| 'Stateful DoFn' >> beam.ParDo(CountPageviews())
| 'Compute processing delay' >> beam.ParDo(LogTimeDelay())
| 'Format for BigQuery output' >> beam.ParDo(FormatForOutputDoFn())
)
# Write to BigQuery.
combined | 'Write' >> beam.io.WriteToBigQuery(
requests_extended_table,
schema=REQUESTS_EXTENDED_TABLE_SCHEMA,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
The interesting part is the combination of the two streams using beam.Flatten and applying the stateful DoFn CountPageviews()
Here's the code of the used custom DoFns:
# This DoFn just loads a json message
class ExtractJSON(beam.DoFn):
def process(self, element):
import json
yield json.loads(element)
# This DoFn adds the event timestamp of messages into it json elements for further processing
class AssignTimestampFn(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam):
import datetime
timestamped_element = element
timestamp_utc = datetime.datetime.utcfromtimestamp(float(timestamp))
timestamp = timestamp_utc.strftime("%Y-%m-%d %H:%M:%S")
timestamped_element['timestamp_utc'] = timestamp_utc
timestamped_element['timestamp'] = timestamp
yield timestamped_element
# This class is a stateful dofn
# Input elements should be of form (session_id, {'event_type' : event}
# Where events can be requests or pageviews
# It computes on a per session basis the number of pageviews and the first pageview timestamp
# in its internal state
# Whenever a request comes in, it appends the internal state to the request and emits
# a extended request
# Whenever a pageview comes in, the internal state is updated but nothing is emitted
class CountPageviewsStateful(beam.DoFn):
# The internal states
NUM_PAGEVIEWS = userstate.CombiningValueStateSpec('num_pageviews', combine_fn=sum)
FIRST_PAGEVIEW = userstate.ReadModifyWriteStateSpec('first_pageview', coder=beam.coders.VarIntCoder())
def process(self,
element,
num_pageviews_state=beam.DoFn.StateParam(NUM_PAGEVIEWS),
first_pageview_state=beam.DoFn.StateParam(FIRST_PAGEVIEW)
):
import datetime
# Extract element
session_id = element[0]
event_type, event = element[1]
# Process different event types
# Depending on event type, different actions are done
if event_type == 'request':
# This is a request
request = event
# First, the first pageview timestamp is extracted and the seconds since first timestamp are calculated
first_pageview = first_pageview_state.read()
if first_pageview is not None:
seconds_since_first_pageview = (int(request['timestamp_utc'].timestamp()) - first_pageview)
first_pageview_timestamp_utc = datetime.datetime.utcfromtimestamp(float(first_pageview))
first_pageview_timestamp = first_pageview_timestamp_utc.strftime("%Y-%m-%d %H:%M:%S")
else:
seconds_since_first_pageview = -1
first_pageview_timestamp = None
# The calculated data is appended to the request
request['num_pageviews'] = num_pageviews_state.read()
request['first_pageview_timestamp'] = first_pageview_timestamp
request['seconds_since_first_pageview'] = seconds_since_first_pageview
# The pageview counter is reset
num_pageviews_state.clear()
# The request is returned
yield (session_id, request)
elif event_type == 'pageview':
# This is a pageview
pageview = event
# Update first pageview state
first_pageview = first_pageview_state.read()
if first_pageview is None:
first_pageview_state.write(int(pageview['timestamp_utc'].timestamp()))
elif first_pageview > int(pageview['timestamp_utc'].timestamp()):
first_pageview_state.write(int(pageview['timestamp_utc'].timestamp()))
# Increase number of pageviews
num_pageviews_state.add(1)
# Do not return anything, pageviews are not further processed
# This DoFn logs the delay between the event time and the processing time
class LogTimeDelay(beam.DoFn):
def process(self, element, timestamp=beam.DoFn.TimestampParam):
import datetime
import logging
timestamp_utc = datetime.datetime.utcfromtimestamp(float(timestamp))
seconds_delay = (datetime.datetime.utcnow() - timestamp_utc).total_seconds()
logging.warning('Delayed by %s seconds', seconds_delay)
yield element
This seems to work and gives me an average delay of about 1-2 seconds on the direct runner. On Cloud Dataflow the average delay is about 0.5-1 seconds. So all in all, this seems to solve the problem definition.
Further considerations
There are some open questions, though:
I am using global windows, which means internal state will be kept forever as far as i am concerned. Maybe session windows are the correct way to go: When there are no pageviews/requests for x seconds, the window is closed and internal state is given free.
Processing delay is a little bit high, but maybe I need to tweak the pubsub part a little bit.
I do not know how much overhead or memory consumption this solution adds over standard beam methods. I also didn't test high workload and parallelisation.

Apache beam python groupbykey with kafka io streaming data

I'm trying to create fixed windows of 10 sec using apache beam 2.23 with kafka as data source.
It seems to be getting triggered for every record even if I try to set AfterProcessingtime trigger to 15 and throwing the following error if I try to use GroupByKey.
Error : KeyError: 0 [while running '[17]: FixedWindow']
Data simulation :
from kafka import KafkaProducer
import time
producer = KafkaProducer()
id_val = 1001
while(1):
message = {}
message['id_val'] = str(id_val)
message['sensor_1'] = 10
if (id_val<1003):
id_val = id_val+1
else:
id_val=1001
time.sleep(2)
print(time.time())
producer.send('test', str(message).encode())
Beam snippet :
class AddTimestampFn(beam.DoFn):
def process(self, element):
timestamp = int(time.time())
yield beam.window.TimestampedValue(element, timestamp)
pipeline_options = PipelineOptions()
pipeline_options.view_as(StandardOptions).streaming = True
p = beam.Pipeline(options=pipeline_options)
with beam.Pipeline() as p:
lines = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
groups = (
lines
| 'ParseEventFn' >> beam.Map(lambda x: (ast.literal_eval(x[1])))
| 'Add timestamp' >> beam.ParDo(AddTimestampFn())
| 'After timestamp add ' >> beam.ParDo(PrintFn("timestamp add"))
| 'FixedWindow' >> beam.WindowInto(
beam.window.FixedWindows(10*1),allowed_lateness = 30)
| 'Group ' >> beam.GroupByKey())
| 'After group' >> beam.ParDo(PrintFn("after group")))
What am I doing wrong here? I have just started using beam so it could be something really silly.

Task in CombineFn is not properly finalized Apache beam

I'm using Python 3.7 SDK of apache beam 2.17.0 for dataflow. Code is running locally, but I gather data from pubsub. I try to combine per key and everything goes fine until the pipeline calls the "merge_accumulators" function. From this point on, all the underlying code is executed twice.
After debugging and going deep in the source code, I found the task is not properly finalized and that is why it is executed twice.
This is the pipeline code:
options = {
"runner": "DirectRunner",
"streaming": True,
"save_main_session": True
}
p = beam.Pipeline(options = PipelineOptions(flags=[], **options))
processRows = (p
|'Read from topic' >> beam.io.ReadFromPubSub(subscription=get_subscription_address())
|'Filter do not track' >> beam.ParDo(TakeOutNoTrack)
|'Map Data' >> beam.ParDo(mapData)
|'Filter metatags' >> beam.ParDo(filterMetatags)
|'Label admin' >> beam.ParDo(labelAdmin)
|'Process row' >> beam.ParDo(processRow)
)
sessionRow = (processRows
|'Add timestamp' >> beam.Map(lambda x: window.TimestampedValue(x, x['timestamp']))
|'Key on uuid' >> beam.Map(lambda x: (x['capture_uuid'], x))
|'User session window' >> beam.WindowInto(window.Sessions(config_triggers['session_gap']),
trigger=trigger.AfterWatermark(
early=trigger.AfterCount(config_triggers['after_count'])),
accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
|'CombineValues' >> beam.CombinePerKey(JoinSessions())
)
printing = (sessionRow
|'Printing' >> beam.Map(lambda x: print(x))
)
print('running pipeline')
p.run().wait_until_finish()
print('done running the pipeline')
return
This is the config_triggers:
config_triggers = {
"session_gap": 1320,
"after_count": 1,
"session_length": 20
}
This is the combine class:
class JoinSessions(beam.CombineFn):
def define_format(self):
try:
data = {
"session_uuid": [],
"capture_uuid": "",
"metatags": [],
"timestamps": [],
"admin": []
}
return data
except Exception:
logging.error("error at define data: \n%s" % traceback.format_exc())
def create_accumulator(self):
try:
return self.define_format()
except Exception:
logging.error("error at create accumulator: \n%s " % traceback.format_exc())
def add_input(self, metatags, input):
try:
metatags["session_uuid"].append(input.get('session_uuid'))
metatags["capture_uuid"] = input.get('capture_uuid')
metatags["metatags"].append(input.get('metatags'))
metatags["timestamps"].append(input.get('timestamp'))
metatags["admin"].append(input.get('admin'))
print('test add_input')
return metatags
except Exception:
logging.error("error at add input: \n%s" % traceback.format_exc())
def merge_accumulators(self, accumulators):
# print(accumulators)
try:
global test_counter
tags_accumulated = self.define_format()
for tags in accumulators:
tags_accumulated["session_uuid"] += tags['session_uuid']
tags_accumulated["capture_uuid"] += tags['capture_uuid']
tags_accumulated["metatags"] += tags['metatags']
tags_accumulated["timestamps"] += tags['timestamps']
tags_accumulated["admin"] += tags['admin']
test_counter += 1
print('counter = ', test_counter)
return tags_accumulated
except Exception:
logging.error("Error at merge Accumulators: \n%s" % traceback.format_exc())
def extract_output(self, metatags):
try:
# print('New input in the pipeline:')
# print('Extract_output: ')
# print(metatags, '\n')
return metatags
except Exception:
logging.error("error at return input: \n%s" % traceback.format_exc())
No errors are thrown nor exceptions or some kind of information. Just the output of the 'printing' label is printed twice. Also the global counter goes up two times, but there is just one data entry in the pipeline.
The print on the add_input function is executed just once.
I'm new to dataflow, so, sorry if I made a dumb mistake.

I think this is due to the trigger you have set.
trigger=trigger.AfterWatermark(early=trigger.AfterCount(config_triggers['after_count'])),accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
config_triggers['after_count] is 1.
So you've set up a trigger that fires after every 1 element and you also accumulate elements produced by trigger firings. So a second trigger within the same window will include elements from the first trigger and so on. See following for details regarding setting the trigger of your pipeline correctly according to your use-case.
https://beam.apache.org/documentation/programming-guide/#triggers

Apache beam python streaming writing hourly avro files files

Getting messages from pubsub and then saving it into hourly or other interval files on gcs does not work. The job only writes the files when I shut down the job. Can anyone point me into the correct direction?
topic = 'test.txt'
jobname = 'streaming-' + topic.replace('.', '-')
input_topic= 'projects/PROJECT/topics/' + topic
u = Utils()
parsed_schema = u.get_parsed_avro_from_schema_service(
schema_name=topic,
schema_repo_url='localhost'
)
p = beam.Pipeline(options=pipelineoptions)
messages = p | 'Read from topic: ' + topic >> ReadFromPubSub(topic=input_topic).with_input_types(bytes)
windowed_lines = (
messages
| 'decode' >> beam.ParDo(DecodeAvro(), parsed_schema)
| beam.WindowInto(
window.FixedWindows(60),
trigger=AfterWatermark(),
accumulation_mode=AccumulationMode.DISCARDING
)
)
output = windowed_lines | 'write result' >> WriteToAvro(
file_path_prefix='gs://BUCKET/streaming/tests/',
shard_name_template=topic.split('.')[0] + '_' + str(uuid.uuid4()) + '_SSSS-of-NNNN',
schema=parsed_schema,
file_name_suffix='.avro',
num_shards=2
)
result = p.run()
result.wait_until_finish()

After some more research, I found that writing from an unbounded source into a bounded one is not supported by python sdk yet. So I will have to change to Java sdk for this.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Beam Python Dataflow with GCP Pub/Sub counter is over counting - python

Related

Why does uploading data from Pi block UART Communication?

Windowed Joins in Apache Beam

Apache beam python groupbykey with kafka io streaming data

Task in CombineFn is not properly finalized Apache beam

Apache beam python streaming writing hourly avro files files

Categories

Resources